Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

arXiv:2605.20314v1 Announce Type: new Abstract: This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. We argue that the speedup comes from appropriate layer-wise growth enabled by sampling biases, which is more pronounced when the dataset size is smaller. We provide both theoretical analysis and empirical evidence from various interventions. Our results suggest that using
This research provides a new theoretical and empirical understanding of dataset size impact on AI training efficiency, which is highly relevant as compute costs become a critical constraint.
A strategic reader should care because this finding suggests a potential pathway to significantly reduce the compute and energy requirements for training AI models, impacting development costs and accessibility.
The conventional understanding that more data always equals better or faster training is challenged, specifically highlighting benefits in speed from repeating smaller datasets under certain conditions.
- · AI developers with limited compute
- · Hardware developers focused on efficiency
- · AI research institutions investigating scaling laws
- · Cloud providers offering AI training services
- · AI developers exclusively focused on massive datasets
- · Inefficient AI training practices
AI model training becomes more efficient, potentially reducing compute costs and time to deployment.
This could democratize AI development, allowing more players to train competitive models with fewer resources.
Reduced compute demands could also alleviate pressure on energy grids and contribute to more sustainable AI development practices.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG