
arXiv:2510.02528v2 Announce Type: replace-cross Abstract: Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of visual relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks. First,
The rapid advancement and adoption of Large Multimodal Models necessitates a deeper understanding of their internal mechanics for improved control and performance, making research into 'function vectors' timely.
This research provides crucial insights into the interpretability and manipulability of LMMs, paving the way for more robust, controllable, and efficient AI systems, especially in complex visual reasoning tasks.
We now have a theoretical and empirical basis for how LMMs process visual relationships, suggesting specific internal components (attention heads) can be targeted to modify model behavior rather than relying solely on external fine-tuning.
- · AI researchers and developers
- · Companies utilizing LMMs for visual tasks
- · AI safety and interpretability organizations
- · Developers relying on black-box LMM optimization
- · Inefficient LMM fine-tuning methods
Increased interpretability and targeted intervention within large multimodal models become possible.
Development of more robust and specialized LMMs with superior performance on visual relational tasks, requiring less data for adaptation.
The ability to 'program' specific relational capacities into LMMs could accelerate the development of more general and autonomous AI agents capable of complex environmental interaction.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG