
arXiv:2606.13715v1 Announce Type: new Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unin
The two-year comparison of agent performance on a standardized benchmark highlights a rapid progression in AI capabilities, demonstrating a near-term inflection point for autonomous systems.
This demonstrates a dramatic increase in AI agent proficiency and safety, which will accelerate the deployment and impact of autonomous systems across white-collar professions.
The ability of leading AI agents to complete significantly more tasks while dramatically reducing harmful actions means they are becoming reliable enough for widespread enterprise adoption.
- · AI Agent developers
- · Enterprises adopting AI agents
- · Knowledge workers augmented by agents
- · SaaS companies for workflow automation
- · Businesses slow to adopt agents
- · Human-only back-office operations
Companies will increasingly integrate autonomous AI agents into operational workflows to boost efficiency.
This widespread integration will lead to significant re-skilling requirements and potential displacement in certain knowledge work sectors.
The enhanced capability and safety of agents could accelerate regulatory efforts to govern autonomous AI systems due to their direct operational and ethical implications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI