Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

arXiv:2605.22731v1 Announce Type: new Abstract: Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the curren
This research emerges as the field of large language model training and post-training methods is rapidly advancing, with a focus on optimizing performance and efficiency.
A deeper understanding of state distribution in LLM post-training could lead to more effective and robust AI models, impacting a wide range of applications and future AI development.
This research shifts the analytical lens from solely loss functions to also include state distributions, offering a complementary perspective for analyzing and improving post-training methods.
- · AI researchers
- · Large language model developers
- · Companies deploying LLMs
- · Inefficient LLM training methodologies
Improved understanding and optimization of supervised fine-tuning, reinforcement learning, and distillation for large language models.
Development of more stable, performant, and generalizable AI agents and applications.
Accelerated progress towards more autonomous and human-like AI systems capable of complex reasoning and interaction.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG