Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

arXiv:2606.00920v1 Announce Type: new Abstract: Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under
The proliferation of LLMs in production environments necessitates a deeper understanding of their reliability beyond simplified benchmarks.
This research highlights critical limitations of current LLM evaluation methods, particularly for sensitive applications like code generation, indicating that models are less reliable than often perceived.
The focus of LLM development and evaluation will likely shift towards improving stability and consistent output across multiple invocations, rather than just peak accuracy.
- · AI researchers focusing on reliability
- · Developers of robust LLM testing frameworks
- · Companies demanding high reliability from AI systems
- · LLM providers overstating model performance
- · Benchmarks focused solely on single-run accuracy
- · Applications deploying LLMs without robust reliability testing
Industry standards for LLM reliability testing will become more sophisticated.
Demand for 'stable' or 'deterministic' LLM versions will increase, potentially leading to new model architectures.
The perceived readiness of LLMs for mission-critical tasks might be recalibrated, slowing adoption in some conservative sectors until reliability improves.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG