
arXiv:2605.28277v1 Announce Type: new Abstract: Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from
The continuous evaluation of LLM capabilities is a critical area of research, with multilingual aspects becoming increasingly vital given global AI development and deployment.
Understanding whether LLMs develop internal 'world models' from text is fundamental to their architectural design, safety, and potential for truly general AI, impacting future AI product development and trust.
This new benchmark, MentalMap, provides a structured and multilingual framework for diagnosing spatial reasoning in LLMs, allowing for more precise assessment of their cognitive architectures and limitations.
- · AI researchers
- · LLM developers
- · Multilingual AI products
- · Cognitive science
- · LLMs lacking spatial reasoning
- · Developers ignoring multilingual testing
The benchmark reveals specific strengths and weaknesses of current LLMs in spatial reasoning across different languages.
This improved diagnostic capability guides the development of more robust and linguistically versatile LLM architectures.
It could accelerate the creation of truly general-purpose AI agents capable of understanding and interacting with the physical world across diverse cultural and linguistic contexts.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI