Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

arXiv:2512.14561v2 Announce Type: replace Abstract: Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is h
The study synthesizes recent findings from January 2022 to August 2025, providing a timely assessment of LLM capabilities as their integration into various applications accelerates.
This research provides critical empirical evidence on the reliability of LLMs in a specific high-stakes application, informing development and deployment strategies for AI-powered assessment tools.
The mixed findings on LLM-human agreement rates indicate that AI in essay scoring is not a universally solved problem, requiring more nuanced application and development rather than broad, uncritical adoption.
- · AI ethics researchers
- · Developers of specialized AI scoring models
- · Educational technology platforms focusing on qualitative assessment
- · Companies pushing undifferentiated LLM-based scoring solutions
- · Users expecting perfect AI scoring consistency
Further research and development will likely focus on improving LLM reliability and interpretability in automated essay scoring.
Educational institutions may adopt hybrid human-AI scoring models, leveraging AI for efficiency while retaining human oversight for accuracy and fairness.
The development of robust, explainable AI for assessment could eventually redefine educational evaluation methodologies and standards.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL