Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

arXiv:2606.05863v1 Announce Type: new Abstract: Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weig
The continuous advancements in AI research, particularly in understanding training dynamics, are leading to deeper insights into complex phenomena like grokking.
Understanding grokking, which separates data fitting from rule learning, is crucial for developing more efficient, robust, and interpretable AI models, impacting trustworthiness and performance.
This research provides a theoretical framework to explain 'two training clocks' in grokking, potentially enabling targeted algorithmic improvements rather than relying on empirical observations.
- · AI researchers
- · Deep learning practitioners
- · Developers of foundational AI models
Improved understanding of how AI models generalize beyond training data.
Development of new optimization algorithms that explicitly manage the trade-off between memorization and generalization.
More predictable and robust AI systems across various applications, reducing unexpected failures or biases.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG