LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

arXiv:2606.18709v1 Announce Type: new Abstract: Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where mo
The proliferation of Large Language Models (LLMs) in various applications, particularly education, necessitates a clear understanding of their capabilities and limitations in complex tasks like psychometric assessment.
This research highlights a critical limitation in LLMs' ability to accurately evaluate nuanced human cognitive abilities like item discrimination, impacting their reliable deployment in high-stakes educational or professional assessments.
The findings suggest that current LLMs may not be suitable for fully automating or replacing human judgment in psychometric analysis, particularly for understanding subtle differences in proficiency levels.
- · Human psychometricians
- · Specialized educational assessment platforms
- · Researchers focused on LLM bias and limitations
- · General-purpose LLM providers in education
- · Companies attempting fully-automated assessment using LLMs
- · Proponents of LLMs as universal assessment tools
Further research will focus on developing methods to improve LLMs' psychometric capabilities, potentially through fine-tuning, retrieval augmentation, or specialized architectures.
Educational institutions may exercise increased caution in adopting LLM-based assessment tools for discerning proficiency differences, maintaining human oversight in critical evaluation processes.
This limitation could spur the development of hybrid assessment models where LLMs augment human experts rather than fully replacing them, particularly in fields requiring nuanced understanding of human performance.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL