WorldRoamBench: An Open-World Benchmark for Long-Horizon Stability of Interactive World Models

arXiv:2606.31672v1 Announce Type: cross Abstract: Despite rapid progress in interactive world models (IWMs), existing benchmarks evaluate action following only at trajectory level and ignore memory and interaction physics. We introduce WorldRoamBench, an open-world benchmark for long-horizon stability across four dimensions, each with tailored innovations: (i) Action: per-frame action metric bypassing cross-model semantic scale disparity and exposing failures hidden by trajectory; (ii) Vision: segment-based drift metric capturing non-monotonic mid-sequence collapse missed by start-vs-end compa
The rapid advancement in interactive world models (IWMs) necessitates more robust and long-horizon specialized benchmarks to address their current limitations beyond simple trajectory following.
Improved benchmarks for interactive world models are critical for developing more stable and reliable AI agents and autonomous systems that can operate effectively over extended periods in complex, dynamic environments.
The introduction of WorldRoamBench shifts the evaluation paradigm for IWMs towards long-horizon stability, memory, interaction physics, and granular per-frame action and segment-based vision metrics.
- · AI researchers in world models
- · Developers of autonomous systems
- · Robotics companies
- · Simulation platform providers
- · AI models with poor long-term memory
- · Systems relying on short-horizon evaluation metrics
- · Benchmarks lacking depth in interaction physics
The new benchmark will accelerate the development of more robust and stable interactive world models.
More stable world models will enable the deployment of more capable and reliable AI agents in real-world scenarios.
The widespread adoption of highly stable AI agents could lead to significant automation breakthroughs in various industries, potentially impacting workforce structures.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI