Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

arXiv:2506.05412v4 Announce Type: replace-cross Abstract: Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained. We found a substantial performance gap between VLMs and humans,
This research provides a current assessment of Vision-Language Models' limitations in understanding nuanced social cues, despite rapid advancements in general VLM capabilities.
A strategic reader should care because this highlights a critical gap in VLM human-like perception, impacting their reliability in complex human-centric applications and agentic systems.
We now have clearer evidence that current VLMs struggle with fundamental nonverbal communication, suggesting that simple visual input processing is not sufficient for robust social intelligence.
- · Researchers in AI ethics and human-AI interaction
- · Developers of VLM training methodologies focused on nuanced social cues
- · Companies specializing in human behavior understanding via sensors
- · Developers deploying VLMs in high-stakes social interaction roles prematurely
- · General-purpose VLM architectures without explicit social cue training
VLMs will require more sophisticated training data and architectures specifically designed to differentiate subtle nonverbal cues like gaze from cruder indicators like head orientation.
The development of truly 'socially intelligent' AI agents will be delayed or necessitate hybrid models that incorporate explicit cognitive frameworks for human interaction.
This limitation could create opportunities for specialized AI models or human-in-the-loop systems to bridge the social perception gap in critical applications, affecting trust and adoption.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL