
arXiv:2605.28860v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerabili
Ongoing research into LLM fine-tuning and the challenges of catastrophic forgetting are central to improving AI model stability and performance.
Understanding the mechanistic origins of catastrophic forgetting provides a critical pathway to developing more robust and efficient AI models, reducing retraining costs and improving continuous learning capabilities.
The understanding of why RL preserves circuits better than SFT shifts research focus towards leveraging RL's strengths or mitigating SFT's weaknesses for more stable model updates.
- · AI researchers
- · LLM developers
- · Companies deploying AI agents
- · AI-dependent industries
- · Developers reliant solely on SFT
- · AI applications requiring frequent and extensive retraining
Improved model retention and reduced forgetting in large language models will lead to more stable and adaptable AI systems.
This could accelerate the development and deployment of sophisticated AI agents and continuously learning systems across various sectors.
Enhanced model stability might reduce the computational and energy burden associated with frequent model updates, indirectly impacting compute supply chains and energy consumption.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG