
arXiv:2602.07842v2 Announce Type: replace Abstract: Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct
The increasing deployment of LLMs in complex decision-making systems necessitates a deeper understanding of their confidence calibration, especially in nuanced scenarios.
Reliable confidence calibration is crucial for trust and safety in AI applications, particularly as LLMs are integrated into critical systems where multiple correct answers are common.
This research introduces MACE, a new benchmark that exposes a systematic weakness in current LLM confidence methods for multiple-answer questions, pushing the field towards more robust calibration techniques.
- · AI Safety Researchers
- · LLM Developers
- · Industries relying on AI for complex decision making
- · LLM models with poor calibration
- · Applications misinterpreting LLM confidence
- · Current training-free calibration methods
Improved reliability and trustworthiness of large language models in diverse applications.
Accelerated development of more sophisticated calibration techniques and benchmarks for AI systems.
Enhanced AI adoption in regulated industries due to increased confidence in model outputs, potentially influencing regulatory frameworks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL