
arXiv:2606.13685v1 Announce Type: cross Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rat
The proliferation of LLM-as-a-Judge systems in AI development, from ranking to reward models, necessitates a deeper understanding of their reliability as they become critical infrastructure.
The inherent unreliability and bias in LLM-as-a-Judge evaluations, with significant flip rates, undermine the scientific rigor and trustworthiness of AI model assessment and development.
The findings challenge the assumption of objective and consistent LLM-based evaluation, potentially leading to a re-evaluation of current AI benchmarking methodologies and increased scrutiny of leaderboards.
- · Human evaluators
- · Robust AI evaluation frameworks
- · Explainable AI research
- · LLM-as-a-Judge only systems
- · Public AI leaderboards (if not adjusted)
- · AI models optimized solely on unreliable LLM feedback
This study will likely lead to calls for greater transparency and improved methodologies in LLM-as-a-Judge systems.
AI developers might pivot towards multi-modal or ensemble evaluation approaches to counteract individual LLM judge biases and inconsistencies.
A loss of trust in automated AI evaluation could slow the adoption of certain AI applications where objective performance metrics are crucial.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI