Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

arXiv:2606.00755v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating met
The rapid advancement and deployment of large language models have highlighted critical technical hurdles in reinforcement learning, specifically entropy collapse, which this research aims to address.
Improving the stability and efficiency of reinforcement learning from human feedback is crucial for developing more capable and robust AI agents, impacting their practical applications.
This research proposes an internal mechanism (TS-OPSD) to mitigate entropy collapse in RL, offering a more integrated solution than existing external interventions and potentially accelerating model development.
- · AI researchers
- · Large Language Model developers
- · AI companies
- · Developers relying solely on external RL regularization methods
More efficient and diverse policy learning in reinforcement learning from human feedback for LLMs.
Faster development and deployment of more robust and reasoning-capable AI agents.
Enhanced AI capabilities leading to wider automation and new applications in complex problem-solving domains.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL