Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampling temperature to 0 makes grading deterministic. We test this assumption against a real safety-evaluation codebase (Japan AISI's open-source aisev) and show it fails on two levels. First, the harness invokes its grader without setting temperature or seed; the underlying provider silently applies its default of 1.0, so
The increasing reliance on LLM-as-judge systems for critical safety evaluations makes the current findings on their non-deterministic behavior particularly timely.
This research reveals a critical flaw in current AI safety evaluation methodologies, impacting the reliability of deployment decisions for large language models and potentially undermining trust in 'safe' AI systems.
The assumption that setting LLM temperature to 0 ensures deterministic and reproducible safety evaluations is now demonstrably false, requiring a re-evaluation of current practices and toolchains.
- · AI safety researchers
- · Auditing and verification firms
- · Developers of robust evaluation harnesses
- · LLM developers relying on current evaluation practices
- · Organizations deploying LLMs based on flawed safety assessments
- · Users impacted by non-deterministic safety mechanisms
Immediate re-evaluation and potential redesign of LLM safety evaluation protocols and development pipelines.
Increased scrutiny and demand for transparency and reproducibility in all AI model testing and deployment frameworks.
A potential slowdown in the deployment of advanced LLMs until more robust and trustworthy safety evaluation methods are universally adopted.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG