How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

arXiv:2605.25698v1 Announce Type: new Abstract: High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theoretical guidance. We extend functional scaling laws by incorporating a data-quality dimension, and solve the joint data-quality and batch-size scheduling problem in asymptotic closed form. The solution reveals two regimes and a dual role of high-quality data. In the noise-limited regime, high-quality data should be used as a signal amplifier: lowering the batch size converts cleaner data into more signal without
As LLM training scales, the scarcity of high-quality data becomes a critical bottleneck, necessitating theoretical guidance for its optimal utilization.
This research provides a foundational framework for optimizing LLM training with scarce high-quality data, directly impacting efficiency and performance.
The understanding and methodology for integrating data quality and batch size scheduling in LLM training will become more principled and efficient.
- · AI model developers
- · Cloud AI providers
- · Data curation platforms
- · AI research institutions
- · Developers solely relying on brute-force scaling
- · Generic data providers
- · Less efficient LLM training approaches
More efficient and powerful LLMs will emerge from optimized data consumption strategies.
The value of high-quality, curated datasets will significantly increase, leading to new data market dynamics.
Accessibility to high-performance AI models could expand as training costs per performance unit decrease.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG