arXiv:2602.16340v3 Announce Type: replace Abstract: We study the implicit bias of momentum-based optimizers on smooth homogeneous models. We show that \textit{momentum steepest descent} algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are \textit{approximate} steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, an
Source: arXiv cs.LG — read the full report at the original publisher.
