How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

arXiv:2605.28840v1 Announce Type: cross Abstract: Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we stu
The increasing deployment of LLM agents in production systems necessitates a deeper understanding of their reliability, particularly behavioral consistency in real-world, multi-step applications.
For strategic readers, the reliability and predictability of LLM agents directly impact their suitability for critical enterprise applications and autonomous systems, influencing investment and deployment decisions.
This research highlights a crucial limitation in current LLM agents – their behavioral reproducibility – which could temper expectations on their immediate widespread adoption in highly sensitive tasks without significant improvements.
- · LLM researchers focused on reliability
- · Companies developing agent orchestration frameworks
- · Sectors requiring high determinism in AI applications
- · Developers deploying inconsistent LLM agents prematurely
- · Applications requiring perfect behavioral reproducibility with current LLMs
- · Sectors reliant on unproven agent autonomy
The study reveals that LLM agents exhibit inconsistencies in tool selection, order, and arguments across identical invocations.
This lack of reproducibility will drive further research and development into more robust and deterministic agent architectures and evaluation metrics.
Improved agent consistency could accelerate the adoption of LLM agents in mission-critical applications, potentially transforming white-collar workflows with greater confidence.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI