
arXiv:2607.02104v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as cheap, scalable judges that compare candidate outputs pairwise -- to rank responses, select models, or triage papers. Yet LLM judges are both noisy and systematically biased: they favor verbose or well-formatted answers and exhibit position effects, so simply aggregating their votes recovers a ranking of presentation, not of true quality. We study the practical goal of identifying the \topk{} items under a fixed comparison budget, and make two contributions. First, we cast judging as Bayesian
The paper addresses the immediate, practical challenges of using LLMs as judges, a capability being rapidly integrated into various workflows, necessitating solutions for their inherent biases.
As LLMs become ubiquitous for tasks like ranking and selection, understanding and mitigating their biases is crucial for maintaining fairness and accuracy, impacting model development and output evaluation.
This research provides a methodological framework to improve the reliability of LLM-based judgments, moving beyond simple aggregation to a more bias-aware approach.
- · AI developers
- · Evaluation platforms
- · Researchers using LLM judges
- · Users of ranked AI outputs
- · Uncritical LLM judge aggregators
- · Platforms relying on naive LLM scoring
The quality of LLM-derived rankings and selections will improve, leading to more robust AI-driven decision-making.
This improved reliability could accelerate the adoption of autonomous AI agents benefiting from more accurate self-evaluation or peer review.
Enhanced LLM-judge reliability could foster trust in automated systems, potentially influencing resource allocation and competitive landscapes in AI development.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG