Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

arXiv:2606.01914v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet cons
This paper leverages access to nine open-weight MLLMs to identify and characterize a specific, pervasive failure mode in their spatial reasoning, indicating a growing focus on robust diagnostics as MLLM capabilities advance.
Understanding and mitigating spatial lexical bias is crucial for developing reliable multimodal AI, especially as these models move into applications requiring precise spatial understanding.
The identification of 'spatial lexical bias' as a distinct and widespread failure mode adds a new facet to the ongoing research into MLLM limitations beyond purely visual attention issues.
- · AI researchers
- · Multimodal AI developers
- · Companies building MLLM evaluation tools
- · Unreliable MLLMs
- · Applications requiring high spatial precision from current MLLMs
Ongoing MLLM development will need to incorporate diagnostics and mitigation strategies for spatial lexical bias to improve reliability.
Improved spatial reasoning in MLLMs will enable more robust applications in areas like robotics, augmented reality, and complex scene understanding.
The ability of MLLMs to perform complex spatial reasoning reliably could accelerate the development of autonomous systems with human-level environmental understanding.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL