
arXiv:2605.30100v1 Announce Type: new Abstract: World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random
The continuous development in AI necessitates better benchmarks for evaluating complex AI capabilities, particularly in state tracking and sophisticated reasoning.
This benchmark offers a robust, real-world derived tool to assess and advance AI's ability to maintain coherent internal representations, critical for agentic systems and world models.
The availability of a large-scale, exact state-tracking benchmark for structured environments like chess provides a more rigorous testing ground for AI models compared to previous synthetic or language-based tests.
- · AI researchers
- · World model developers
- · Gaming AI companies
- · AI models with poor state-tracking capabilities
Improved training and evaluation of AI world models for structured environments.
Accelerated development of more robust and reliable AI agents capable of complex sequential decision-making.
Potential for breakthroughs in AI applications requiring precise, long-term state maintenance beyond gaming, such as robotic control or complex system simulations.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG