Reliable but not rigorous: Evaluating ChatGPT's reliability, validity, and bias in automated academic grading

Raed Awashreh; Hisham Al Ghunaimi; Said AlGhenaimi

doi:10.1016/j.ssaho.2026.102788

What is it about?

Artificial intelligence tools such as ChatGPT are increasingly being considered for grading student assignments. They can provide feedback quickly and apply assessment criteria consistently. However, consistency does not necessarily mean that the grades are accurate or fair. This study compared grades awarded by ChatGPT with those assigned by human instructors for 61 undergraduate assignments in Political Science and Public Administration. Both ChatGPT and the instructors used the same grading rubrics. The results showed that ChatGPT generally ranked stronger and weaker assignments in a similar order to the instructors. However, it consistently awarded substantially higher marks and produced a narrower range of grades. ChatGPT tended to reward clear structure, fluent writing, good formatting, and well-organised presentation. It was less effective at identifying weaknesses in analytical depth, theoretical engagement, originality, and critical reasoning. Even when its written feedback recognised these limitations, the numerical penalties were often too small. The findings suggest that ChatGPT can be useful for providing formative feedback, helping students improve early drafts, and supporting instructors with preliminary reviews. However, it should not replace academic judgement when final grades are awarded. Universities should use a carefully calibrated human–AI approach in which instructors retain responsibility for summative assessment, fairness, and academic standards. Perspectives AI can support educators, but it should not become the final decision-maker in academic assessment. Our study shows a clear distinction between grading consistently and grading rigorously. ChatGPT can recognise structure, fluency, and presentation effectively, yet it may be overly positive and insufficiently sensitive to deeper intellectual qualities such as critical analysis, originality, and theoretical reasoning. The practical message is not to reject AI, but to use it responsibly. ChatGPT is most valuable as a diagnostic and formative tool: it can provide rapid feedback, highlight areas for improvement, and reduce routine workload. Final grades, particularly for analytical assignments, should remain under the supervision of qualified instructors. A responsible human–AI assessment model can combine efficiency with academic integrity, provided that institutions establish clear policies, regular calibration procedures, transparency, and appropriate safeguards for student data.

Photo by AFINIS Group ® - AFINIS GASKET® Production on Unsplash

Why is it important?

Universities are under growing pressure to assess student work efficiently while maintaining fairness and academic standards. Although ChatGPT can review assignments quickly and apply rubrics consistently, this study shows that it may also award marks that are systematically higher than those given by instructors. This creates a risk of grade inflation and may weaken the credibility of academic assessment. The issue is particularly important because clear writing and good formatting are not always evidence of deep understanding, critical thinking, or originality. If AI-generated grades are accepted without human oversight, students may receive marks that do not accurately reflect their academic performance. The study therefore provides a practical message for universities: ChatGPT can support formative feedback and reduce routine workload, but final grading decisions should remain with qualified instructors. A carefully governed human–AI assessment model can improve efficiency without compromising fairness, accountability, or trust in educational qualifications.

Perspectives

AI can support educators, but it should not become the final decision-maker in academic assessment. Our study shows a clear distinction between grading consistently and grading rigorously. ChatGPT can recognise structure, fluency, and presentation effectively, yet it may be overly positive and insufficiently sensitive to deeper intellectual qualities such as critical analysis, originality, and theoretical reasoning. The practical message is not to reject AI, but to use it responsibly. ChatGPT is most valuable as a diagnostic and formative tool: it can provide rapid feedback, highlight areas for improvement, and reduce routine workload. Final grades, particularly for analytical assignments, should remain under the supervision of qualified instructors. A responsible human–AI assessment model can combine efficiency with academic integrity, provided that institutions establish clear policies, regular calibration procedures, transparency, and appropriate safeguards for student data.
Dr Hisham Al Ghunaimi

This page is a summary of: Reliable but not rigorous: Evaluating ChatGPT's reliability, validity, and bias in automated academic grading, Social Sciences & Humanities Open, June 2026, Elsevier,
DOI: 10.1016/j.ssaho.2026.102788.
You can read the full text:

Read

Contributors

The following have contributed to this page

Dr Hisham Al Ghunaimi

Can ChatGPT grade student assignments fairly? It is consistent, but often too generous

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Can ChatGPT grade student assignments fairly? It is consistent, but often too generous

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management