
arXiv:2308.06035v4 Announce Type: replace-cross Abstract: Humans routinely draw on visual context to predict upcoming words. To what extent current vision-language models produce comparable behaviour is unclear. Here we placed five state-of-the-art pretrained systems side-by-side with 600 human participants in a web-based Visual-World Paradigm. On each of 100 six-second movie clips, models and participants received either text only or synchronised video and text and judged how likely a specified target word was to appear next; human eye movements were tracked throughout. Adding visual context
This research is published as AI development rapidly progresses toward more human-like understanding and multimodal capabilities, making the alignment of these systems with human perception critical for future applications.
It highlights that human-AI alignment in multimodal understanding isn't solely about model size but about how models process attention, which is crucial for developing truly intelligent and intuitive AI systems.
The focus for improving vision-language models may shift from mere scaling to more sophisticated architectural designs that better mimic human attentional mechanisms.
- · AI researchers focusing on cognitive architectures
- · Developers of multimodal AI applications
- · Companies investing in explainable AI
- · AI development strategies solely focused on parameter scaling
- · Companies without strong cognitive science integration in their AI teams
Advances in understanding human visual-language processing inform new multimodal AI architectures.
More robust and less 'brittle' AI systems emerge that better handle nuanced real-world input.
Ethical AI alignment becomes more tractable as models inherently align more closely with human perceptive understanding.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL