Assessing the Real-World Impact of Disagreement Between Human Graders and LLMs

Sebastian Speiser

doi:10.1145/3672608.3707736

What is it about?

AI, especially LLMs, are increasingly used to evaluate human performance on text-based assessments. We measured how much the AI agrees with an human evaluator and compared it to the agreement between two human experts. Furthermore, we simulated the effect of disagreements on individual answers on an overall evaluation across multiple questions.

Photo by Eugene Zhyvchik on Unsplash

Why is it important?

In some areas the use of AI is already standard, e.g. customer service, whereas in others such as human performance evaluation there are concerns about legal and ethical concerns. Our paper provides data that can contribute to the discussion of these concerns.

Perspectives

This article already lead to many discussions about broader topics including societal impact and ethical usage of AI.
Sebastian Speiser
Hochschule fur Technik Stuttgart

This page is a summary of: Assessing the Real-World Impact of Disagreement Between Human Graders and LLMs, March 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3672608.3707736.
You can read the full text:

Read

Contributors

The following have contributed to this page

Sebastian Speiser
Hochschule fur Technik Stuttgart

How well do humans and AI agree when grading student answers?

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

How well do humans and AI agree when grading student answers?

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management