What is it about?

Large Language Models (LLMs), such as those used in chatbots and automatic translators, are now widely used to summarise documents, answer questions, and translate text. However, these systems sometimes produce information that sounds convincing but is factually wrong. This behaviour is often referred to as “hallucination”. Understanding how to reliably detect such errors is essential if AI systems are to be trusted in education, research, healthcare, and other safety‑critical settings. This review examines how researchers currently measure faithfulness—the extent to which AI‑generated text remains accurate and consistent with its source material. The paper surveys a wide range of evaluation methods across three common applications: text summarisation, question answering, and machine translation. It explains how traditional automated metrics compare surface text similarity, why these methods often fail to detect deeper factual errors, and how newer approaches attempt to capture meaning rather than wording. A key focus of the review is the increasing use of AI models themselves as evaluators, sometimes called “LLM‑as‑a‑judge”. These systems assess whether a generated answer logically follows from the source information and, in many cases, show closer agreement with human judgement than older metrics. The paper also discusses techniques that reduce hallucinations, such as grounding AI responses in external documents or using structured prompting strategies. Overall, the review provides a clear picture of what currently works, what does not, and where future research is needed to make AI outputs more reliable.

Featured Image

Why is it important?

This work is important because AI systems are increasingly used in situations where factual errors can have serious consequences. Despite rapid progress in language modelling, there is still no single, reliable way to measure whether AI‑generated text is faithful to the truth. By systematically comparing evaluation methods across multiple domains, this review highlights the strengths and weaknesses of commonly used metrics and shows why many popular benchmarks are insufficient for real‑world use. The review is particularly timely because it demonstrates that evaluation based on open‑ended text generation—rather than multiple‑choice testing—provides a more realistic picture of AI reliability. It also identifies LLM‑based evaluation as one of the most promising current approaches, while clearly outlining its limitations, such as over‑confidence and sensitivity to prompting. These insights help researchers, developers, and policymakers design safer AI systems and choose evaluation strategies that better reflect human judgement.

Perspectives

Writing this review reinforced how challenging it is to define and measure “truthfulness” in language‑based AI systems. What seems correct on the surface can hide subtle factual errors that only become apparent with careful analysis. Bringing together work from summarisation, question answering, and machine translation made it clear that no single metric works everywhere. I hope this paper helps researchers think more critically about how they evaluate AI systems and encourages the development of more robust, transparent, and human‑aligned evaluation methods. Ultimately, improving how we measure faithfulness is a necessary step towards deploying AI responsibly in high‑impact domains.

Prof Tatiana Kalganova
Brunel University

Read the Original

This page is a summary of: A Review of Faithfulness Metrics for Hallucination Assessment in Large Language Models, IEEE Journal of Selected Topics in Signal Processing, October 2025, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/jstsp.2025.3579203.
You can read the full text:

Read

Contributors

The following have contributed to this page