
arXiv:2605.16023v2 Announce Type: replace Abstract: LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse,
The rapid adoption of LLM-as-a-judge for model evaluation creates an immediate need to understand and address its inherent inconsistencies.
This research provides critical insights into the internal mechanics of LLMs, enabling more reliable and less biased evaluation paradigms for AI models at scale.
The understanding of LLM judgment processes moves beyond input-output observations to causal investigations of internal circuits, leading to potential improvements in AI alignment and evaluation.
- · AI developers
- · AI safety researchers
- · Model evaluation platforms
- · Untuned LLM evaluation methods
- · Users relying on inconsistent LLM judgments
This research reveals internal mechanisms causing LLM judgment inconsistencies based on output format.
Improved understanding could lead to more robust and format-agnostic LLM evaluators, enhancing the reliability of AI development.
More consistent AI evaluation could accelerate the development of safer and more aligned advanced AI systems, impacting their societal integration.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL