
arXiv:2603.28304v2 Announce Type: replace Abstract: Using large language models (LLMs) as judges for evaluating model outputs has emerged as an important paradigm for automated evaluation. However, the choice of decoding temperature in LLM-as-a-judge settings is still largely chosen empirically, with limited systematic evidence on its impact. To address this gap, we conduct a systematic study of how temperature affects judgment behavior across different LLM judge models, prompting strategies, and evaluation paradigms. Our results show that higher temperatures generally decrease judgment consis
As LLMs become increasingly central to automated evaluation across various domains, the need for robust and reliable assessment methodologies is critical.
This study provides crucial insights into optimizing LLM-as-a-judge systems, which are foundational for advancing AI development and application quality.
The empirical understanding of how temperature settings influence LLM judgment consistency will lead to more standardized and effective evaluation practices.
- · AI developers
- · Evaluation platform providers
- · Researchers in NLP
- · Developers relying on arbitrary LLM evaluation
- · Unreliable LLM-as-a-judge methods
Systematic guidance on temperature selection will improve the accuracy and reproducibility of LLM-based evaluations.
Enhanced evaluation reliability could accelerate iterative development cycles for new AI models and applications.
More trustworthy evaluation benchmarks may foster greater public and institutional confidence in AI-driven systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL