Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

arXiv:2606.19120v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final a
The rapid advancement of MLLMs necessitates improved training methods to prevent shortcuts and enhance reasoning, making this research timely for developing more robust AI systems.
This development addresses a critical limitation in MLLMs by decoupling perception and reasoning, leading to more reliable and less 'shortcut-prone' AI, crucial for real-world applications.
The training methodology for MLLMs is enhanced, allowing models to genuinely 'see' and describe visual information before engaging in reasoning, rather than relying on textual proxies.
- · MLLM developers
- · AI safety researchers
- · Industries relying on multimodal AI
- · Developers of less robust MLLM training methods
Improved MLLM performance and reduced instances of superficial reasoning in multimodal tasks.
Faster adoption and deployment of MLLMs in critical applications due to increased trustworthiness.
Accelerated development of more complex AI agents capable of nuanced understanding across modalities.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG