Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

arXiv:2605.30833v1 Announce Type: cross Abstract: On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we in
The continuous development and scaling of large language models necessitate improved distillation techniques for efficient training and deployment, making advances in combating 'Supervision Fidelity Decay' particularly timely.
Addressing 'Supervision Fidelity Decay' is crucial for developing more robust and capable AI models, directly impacting the quality and reliability of AI agents and automated reasoning systems.
The ability to maintain teacher confidence and discriminative power during on-policy distillation will lead to more effective transfer of complex reasoning capabilities to student models.
- · AI developers
- · AI-driven automation platforms
- · Companies using distilled AI models
- · Inefficient AI training methods
- · Models prone to drift in long reasoning chains
Improved performance and efficiency of large language models and other AI agents.
Faster development cycles and deployment of more sophisticated autonomous AI systems.
Accelerated adoption of AI in critical sectors as reliability and reasoning capabilities improve, potentially leading to more complex AI agent ecosystems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI