
arXiv:2606.29537v1 Announce Type: new Abstract: Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude
The AI research community is continuously pushing the boundaries of agent capabilities, identifying the need for more robust benchmarks to accurately assess progress beyond existing limitations.
This benchmark reveals the current limitations of frontier AI agents in complex, long-horizon real-world computer tasks, highlighting significant gaps that need to be addressed for true autonomous functionality.
The introduction of OSWorld 2.0 provides a more realistic and challenging yardstick for evaluating AI agents, shifting the focus towards practical, multi-step problem-solving rather than simpler, isolated tasks.
- · AI research labs developing agentic systems
- · Developers of robust autonomous agent frameworks
- · Companies seeking to automate complex workflows
- · AI models that perform well on narrow benchmarks but fail on real-world complexi
- · Organizations relying on simple benchmarks for agent deployment decisions
The benchmark will drive significant research and development efforts into improving AI agent capabilities for complex, long-duration tasks.
Improved AI agents resulting from this research could automate a wider range of white-collar and professional workflows, increasing efficiency across industries.
The automation of complex computer use by advanced AI agents could lead to fundamental restructuring of digital work, impacting job markets and enterprise software consumption patterns.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI