Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

arXiv:2606.04466v1 Announce Type: new Abstract: Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specif
This paper addresses a fundamental challenge in current SLM training, which is becoming increasingly critical as models scale and the demand for efficient, high-performance reasoning grows in diverse applications.
Optimizing the SFT-then-RL pipeline with stage-specific data sets promises more effective and resource-efficient training of Small Language Models, leading to superior reasoning capabilities and broader deployment.
The proposed difficulty-aware framework introduces a refined methodology for data curation in SLM training, moving beyond generic approaches to strategically align data with model learning stages.
- · AI model developers
- · Companies using SLMs
- · AI infrastructure providers
- · Inefficient SLM training methodologies
- · Developers ignoring data strategy
Improved performance and cost-efficiency in developing Small Language Models for reasoning tasks.
Accelerated deployment of capable SLMs into specialized AI agents and applications requiring advanced cognitive functions.
Enhanced competition among foundational model developers as more optimized, smaller models challenge larger, more general purpose AI.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL