
arXiv:2606.16899v1 Announce Type: new Abstract: Matrix based optimizers such as Muon can substantially speed up language model pretraining, but their gains over AdamW are observed to shrink as model size and data scale grow when using standard constant decoupled weight decay. We propose Hyperball, a simple optimizer wrapper that addresses this issue. Given a base optimizer such as Adam or Muon, Hyperball sets the Frobenius norms of weight matrices and their corresponding optimizer updates to fixed constants. On Qwen3 style models up to 1.2B parameters, Muon Hyperball achieves 20--30% token equ
The continuous drive for more efficient and scalable large language model pretraining necessitates novel optimization techniques to overcome current limitations.
Improved pretraining optimizers directly impact the speed, cost, and feasibility of developing larger and more capable AI models, accelerating research and deployment.
The proposed Hyperball optimizer could make matrix-based optimizers more viable and consistent across varying model and data scales, reducing the diminishing returns seen with existing methods.
- · AI research labs
- · Cloud providers
- · Large language model developers
- · Hardware manufacturers (GPUs)
- · Developers stuck with less efficient optimization methods
Faster and more cost-effective development of foundation models.
Increased competition among AI developers due to reduced barriers to training large models.
Acceleration of AI capabilities, potentially leading to more advanced applications emerging sooner.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG