Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

arXiv:2601.07506v2 Announce Type: replace Abstract: While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-
The increasing reliance on LLMs for automated evaluation in various applications, particularly in QA, necessitates a deeper understanding of their failure modes as their deployment becomes more widespread.
This research reveals a fundamental limitation in using LLMs as judges when their internal parametric knowledge conflicts with external reference data, directly impacting the reliability and trustworthiness of AI evaluation systems.
The assumption that LLM-judges are unbiased and purely reference-adherent is challenged, requiring new methodologies for robust AI evaluation and potentially limiting the scope of current LLM-based assessment tools.
- · Developers of robust AI evaluation methodologies
- · Companies focused on explainable and bias-mitigated AI
- · Researchers in AI safety and alignment
- · Developers relying solely on LLMs for black-box evaluation
- · Users of AI systems evaluated by compromised LLM-judges
- · Uncritical adopters of LLM-based QA systems
This directly leads to a reassessment of LLM evaluation practices and the development of more sophisticated, hybrid evaluation frameworks.
It could spur innovation in 'context-aware' or 'reference-constrained' LLMs designed to prioritize explicit references over internal knowledge during evaluation tasks.
Long-term, this research contributes to the broader challenge of ensuring AI systems are reliable and aligned with human intent, especially in critical decision-making contexts where factual adherence is paramount.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL