What is it about?
Patient Information Leaflets (PILs) provide essential information about medication usage, side effects, precautions, and interactions, making them a valuable resource for Question Answering (QA) systems in healthcare. However, no dedicated benchmark currently exists to evaluate QA systems specifically on PILs, limiting progress in this domain. To address this gap, we introduce a fact-supported synthetic benchmark composed of multiple-choice questions and answers generated from real PILs. We construct the benchmark using a fully automated pipeline that leverages multiple Large Language Models (LLMs) to generate diverse, realistic, and contextually relevant question-answer pairs. The benchmark is publicly released as a standardized evaluation framework for assessing the ability of LLMs to process and reason over PIL content. To validate its effectiveness, we conduct an initial evaluation with state-of-the-art LLMs, showing that the benchmark presents a realistic and challenging task, making it a valuable resource for advancing QA research in the healthcare domain.
Featured Image
Read the Original
This page is a summary of: PILs of Knowledge: A Synthetic Benchmark for Evaluating Question Answering Systems in Healthcare, July 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3726302.3730283.
You can read the full text:
Resources
Contributors
The following have contributed to this page







