What is it about?

AI, especially LLMs, are increasingly used to evaluate human performance on text-based assessments. We measured how much the AI agrees with an human evaluator and compared it to the agreement between two human experts. Furthermore, we simulated the effect of disagreements on individual answers on an overall evaluation across multiple questions.

Featured Image

Why is it important?

In some areas the use of AI is already standard, e.g. customer service, whereas in others such as human performance evaluation there are concerns about legal and ethical concerns. Our paper provides data that can contribute to the discussion of these concerns.

Perspectives

This article already lead to many discussions about broader topics including societal impact and ethical usage of AI.

Sebastian Speiser
Hochschule fur Technik Stuttgart

Read the Original

This page is a summary of: Assessing the Real-World Impact of Disagreement Between Human Graders and LLMs, March 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3672608.3707736.
You can read the full text:

Read

Contributors

The following have contributed to this page