SIGNALAI·Jun 30, 2026, 4:00 AMSignal85Medium term

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

arXiv:2606.29537v1 Announce Type: new Abstract: Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude

Why this matters

Why now

The AI research community is continuously pushing the boundaries of agent capabilities, identifying the need for more robust benchmarks to accurately assess progress beyond existing limitations.

Why it’s important

This benchmark reveals the current limitations of frontier AI agents in complex, long-horizon real-world computer tasks, highlighting significant gaps that need to be addressed for true autonomous functionality.

What changes

The introduction of OSWorld 2.0 provides a more realistic and challenging yardstick for evaluating AI agents, shifting the focus towards practical, multi-step problem-solving rather than simpler, isolated tasks.

Winners

· AI research labs developing agentic systems
· Developers of robust autonomous agent frameworks
· Companies seeking to automate complex workflows

Losers

· AI models that perform well on narrow benchmarks but fail on real-world complexi
· Organizations relying on simple benchmarks for agent deployment decisions

Second-order effects

Direct

The benchmark will drive significant research and development efforts into improving AI agent capabilities for complex, long-duration tasks.

Second

Improved AI agents resulting from this research could automate a wider range of white-collar and professional workflows, increasing efficiency across industries.

Third

The automation of complex computer use by advanced AI agents could lead to fundamental restructuring of digital work, impacting job markets and enterprise software consumption patterns.

Editorial confidence: 95 / 100 · Structural impact: 70 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.AI

#cs.AI

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.