The Consistency Dilemma in LLMs: Generator-Evaluator Agreement and Vulnerability to Mistakes

arXiv:2606.30653v1 Announce Type: cross Abstract: Large language models are increasingly deployed in agentic pipelines that depend on the model evaluating its own outputs without external verification. The reliability of these pipelines depends on an implicit assumption: that the model applies relevant concepts the same way when it generates an output and later evaluates that output. We propose a new measure, generator-evaluator self-consistency, to test this assumption directly and apply it to 10 frontier models across 491 concepts. We find, first, that there is substantial variation in self-
The increasing deployment of LLMs in agentic pipelines necessitates a deeper understanding of their internal consistency and reliability at this stage of their development.
This research highlights a critical vulnerability in autonomous AI systems, directly impacting their trustworthiness and the scope of tasks they can reliably perform without human oversight.
The explicit measurement of generator-evaluator self-consistency provides a new metric for assessing LLM reliability and exposes variability across models that was previously implicitly assumed.
- · AI safety researchers
- · LLM evaluators
- · Developers of robust AI systems
- · Developers relying on unchecked LLM self-evaluation
- · Early adopters of fully autonomous AI agents
Further research and development will focus on improving LLM self-consistency to bridge the gap between generation and evaluation capabilities.
New standards and benchmarks for AI agent reliability will emerge, incorporating measures of internal consistency.
The development and deployment timeline for fully autonomous AI agents in sensitive applications may be extended as reliability issues are addressed.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI