OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contai
The proliferation of AI agents in real-world applications highlights the need for robust evaluation beyond task completion, revealing the 'Outcome-Process Gap'.
This development indicates a maturing understanding of AI agent performance, shifting focus from mere task success to the reliability and safety of their operational processes.
The explicit benchmarking of process anomalies in AI agents will drive the development of more resilient, transparent, and trustworthy autonomous systems.
- · AI agent developers focused on reliability
- · Organizations deploying AI agents in critical applications
- · AI safety researchers
- · Developers of AI debugging and monitoring tools
- · AI agent developers prioritizing speed over robustness
- · Companies with black-box agent solutions
- · Early, unrefined AI agent deployments
OpenClawBench will become a standard for evaluating agentic AI, pushing for more sophisticated development practices.
This framework could lead to regulatory or industry standards for agent trustworthiness, similar to software security certifications.
The enhanced reliability of agents could accelerate their adoption in high-stakes domains, further automating complex real-world processes.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI