
arXiv:2602.13110v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used as scalable judges in pairwise evaluation, but they remain prone to miscalibration and biases. We propose SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework that calibrates an acceptance threshold so that, under exchangeability, the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To supply SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response
As LLMs become ubiquitous in evaluation, addressing their inherent biases and miscalibration is critical for their reliable application across various domains.
Improving the accuracy and neutrality of LLM evaluations is essential for fair and dependable progress in AI development, impacting everything from model training to content moderation.
The proposed SCOPE framework and Bidirectional Preference Entropy offer a more robust and calibrated method for pairwise LLM judging, potentially standardizing evaluation metrics.
- · AI developers
- · Machine learning researchers
- · Companies relying on AI for evaluation
- · Users of LLM-generated content
- · Uncalibrated LLM evaluation methods
- · Developers relying on biased LLM judgments
More reliable and less biased deployment of large language models for evaluation tasks.
Accelerated development of more accurate and ethical AI systems due to improved feedback mechanisms.
Increased trust in AI-driven decisions and content, potentially widening AI adoption in sensitive areas.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI