arXiv:2603.00742v2 Announce Type: replace Abstract: While Adam has long been the ubiquitous default optimizer for deep neural networks, Muon has recently seen rapid adoption due to its superior training speed. Although much of the literature focuses on validating the benefits of Muon, our work investigates the potential downsides of the mechanism driving this speedup. On the theoretical front, we analyze the learning dynamics of simplified Muon on deep linear networks and linear attention. Our analysis reveals that Muon gains speed by avoiding saddle points, but does so at the expense of the s

Source: arXiv cs.LG — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.