
arXiv:2512.03019v2 Announce Type: replace Abstract: Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating coun
The rapid advancement and adoption of LLMs for evaluative tasks necessitate refined methods for accurate and consistent judgment, especially as their outputs become critical inputs for other systems.
Improving the reliability and consistency of LLM-as-a-judge frameworks is crucial for robust AI development, quality control, and the deployment of autonomous systems that rely on such evaluations.
This research introduces a principled, distribution-calibrated method for aggregating LLM-based judgments, addressing previous inconsistencies and making LLM evaluation more robust.
- · AI developers
- · MLOps platforms
- · LLM providers
- · Ad-hoc LLM evaluation methods
- · Low-reliability AI judging systems
More reliable benchmarks and evaluation processes for iterative LLM improvement.
Accelerated development of more capable and trustworthy AI agents due to clearer performance signals.
Enhanced trust in AI-driven decision-making systems where objective evaluation is paramount.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG