SIGNALAI·May 27, 2026, 4:00 AMSignal75Short term

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

arXiv:2512.14561v2 Announce Type: replace Abstract: Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is h

Why this matters

Why now

The study synthesizes recent findings from January 2022 to August 2025, providing a timely assessment of LLM capabilities as their integration into various applications accelerates.

Why it’s important

This research provides critical empirical evidence on the reliability of LLMs in a specific high-stakes application, informing development and deployment strategies for AI-powered assessment tools.

What changes

The mixed findings on LLM-human agreement rates indicate that AI in essay scoring is not a universally solved problem, requiring more nuanced application and development rather than broad, uncritical adoption.

Winners

· AI ethics researchers
· Developers of specialized AI scoring models
· Educational technology platforms focusing on qualitative assessment

Losers

· Companies pushing undifferentiated LLM-based scoring solutions
· Users expecting perfect AI scoring consistency

Second-order effects

Direct

Further research and development will likely focus on improving LLM reliability and interpretability in automated essay scoring.

Second

Educational institutions may adopt hybrid human-AI scoring models, leveraging AI for efficiency while retaining human oversight for accuracy and fairness.

Third

The development of robust, explainable AI for assessment could eventually redefine educational evaluation methodologies and standards.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.CL

#cs.CL

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.