Hidden Forgetting in Continual Multimodal Learning: When Accuracy Survives but Grounding Fails

arXiv:2607.02020v1 Announce Type: new Abstract: Multimodal large language models must continually adapt to evolving tasks and domains, yet standard continual learning metrics mainly measure whether old answers remain correct, leaving the stability of multimodal grounding largely unexamined. We study this overlooked failure mode and ask whether a continually adapted MLLM can preserve not only what it answers, but also how it uses visual, textual, OCR, chart, and document evidence. We identify \emph{hidden evidence-use forgetting}, where answer accuracy is retained while the model silently shift
The rapid development and deployment of Multimodal Large Language Models (MLLMs) are accelerating the need to understand their reliability and potential failure modes in continuous adaptation scenarios.
This research highlights a critical vulnerability in continually adapting MLLMs, where models can maintain apparent accuracy while losing their underlying ability to ground answers in evidence, posing risks for dependable AI applications.
The focus for evaluating continuously learning MLLMs will expand beyond just answer accuracy to encompass the stability and integrity of multimodal grounding mechanisms.
- · AI Safety Researchers
- · MLLM Developers focused on robustness
- · Auditing and Validation services
- · MLLM Deployers relying solely on accuracy metrics
- · Applications requiring high-fidelity evidence grounding
Increased research and development into metrics and techniques for identifying and mitigating 'hidden evidence-use forgetting' in MLLMs.
New MLLM architectures or training methodologies emerge that prioritize grounding stability alongside performance, potentially trading off initial speed for long-term reliability.
Regulatory bodies or industry standards begin to mandate specific grounding stability tests for MLLMs used in critical applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI