The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

arXiv:2606.16152v1 Announce Type: new Abstract: Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling acro
The proliferation of methods to improve Small Language Models (SLMs) through distillation makes understanding optimal data very timely.
This paper challenges fundamental assumptions about data quality and utility in AI model training, potentially leading to more efficient and effective SLM development.
The focus for improving SLMs shifts from simply maximizing reward scores in training data to carefully considering the origin and specific utility of that data.
- · SLM developers
- · AI efficiency research
- · On-device AI applications
- · Oversimplified data quality metrics
- · Purely reward-model-driven distillation practices
Researchers will begin exploring more sophisticated metrics for data utility beyond simple reward scores for SLM training.
This could lead to new architectures or training methodologies specifically designed to leverage 'lower quality but higher utility' data efficiently.
The development of highly performant, small AI models could accelerate, broadening AI accessibility and deployment on resource-constrained devices.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI