Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

arXiv:2605.23574v1 Announce Type: new Abstract: Long-horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository-artifact collection and verifier-backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matche
The proliferation of long-horizon LLM agents highlights the critical need to address their failure modes in achieving complex, multi-step goals, making this research timely.
This research directly tackles a core limitation of current AI agents, improving their reliability and effectiveness for automating complex tasks in real-world scenarios.
The explicit measurement and enforcement of 'Quantitative Goal Persistence' shifts the focus from simple task completion to verifiable, persistent effort towards a numerical objective, enhancing agent robustness.
- · AI Agent Developers
- · Automation Software Providers
- · Enterprises Adopting LLM Agents
- · Ineffective Automation Solutions
- · Manual Workflow Operators
More reliable and persistent AI agents capable of handling complex, long-duration tasks will emerge.
Increased adoption of AI agents across industries for workflows requiring sustained effort and verifiable outputs.
The development of more sophisticated external verifiers and auditing systems for autonomous AI operations becomes a new area of innovation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG