Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

arXiv:2606.31719v1 Announce Type: new Abstract: In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be shared from what has been shared between dialogue participants through grounding. We formulate this as an interpretation-matching task on 13,077 annotated reference expressions from HCRC MapTask dialogues, and evaluate VLMs under systematically controlled manipulations of dialogue context and map-information access. Ou
The rapid advancement and deployment of Vision-Language Models make understanding their limitations in human-like communication critically important as they integrate into increasingly interactive systems.
This study highlights that current Vision-Language Models may fail in complex human-AI collaboration due to an inability to properly assess 'common ground,' which is fundamental for effective dialogue and shared action.
The focus shifts from merely 'seeing' to 'understanding shared context' in AI, necessitating more sophisticated interaction and grounding mechanisms for robust human-AI collaboration.
- · AI researchers focused on dialogue and cognitive architectures
- · Developers of more robust human-AI interaction systems
- · Companies investing in explainable and interpretable AI
- · Companies deploying unrefined VLMs in high-stakes collaborative environments
- · Applications relying solely on passive multimodal perception for interaction
- · Simplistic approaches to AI 'common sense'
Further research and development efforts will be directed towards improving VLMs' ability to establish and maintain common ground in interactive settings.
This technical limitation will likely slow down the deployment of fully autonomous AI agents in sensitive collaborative human environments until resolved.
The necessity for explicit grounding within AI could lead to new architectures that prioritize interactive learning and theory of mind over pure data correlation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL