
arXiv:2604.16358v2 Announce Type: replace Abstract: MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment. To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We a
The increasing deployment of MLLMs in real-world, multi-turn contexts has exposed critical safety vulnerabilities, particularly with 'long-context safety decay' under adversarial conditions.
Ensuring the safety and robustness of MLLMs in interactive settings is paramount for their widespread adoption and to mitigate risks of misuse or unintended harm.
This research introduces a novel, adaptive training framework (SaFeR-Steer) that moves beyond static, single-turn safety alignment, addressing a significant mismatch between current training paradigms and operational deployment.
- · AI developers
- · MLLM users
- · AI safety researchers
- · Platform providers
- · Attackers exploiting MLLM vulnerabilities
- · Developers relying solely on single-turn safety data
Multi-turn MLLMs will become more resilient to adversarial attacks and malicious prompts.
Increased trust in MLLM applications, facilitating their integration into sensitive and complex workflows.
The methodology could inform broader safety alignment strategies across different AI modalities and agentic systems, fostering more robust general AI development.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG