Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

arXiv:2511.18159v2 Announce Type: replace Abstract: Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) maskin
The paper addresses a significant challenge in the training of Masked Diffusion Models (MDMs), which are a promising alternative to current AI architectures, by providing a theoretical explanation and solution to their high training variance.
Improving the stability and performance of MDMs could lead to more robust and powerful generative AI models, impacting various applications from content generation to scientific discovery.
This research provides a framework to optimize MDM training, potentially closing the performance gap between MDMs and autoregressive models, thus enabling the widespread adoption of a new class of AI models.
- · AI researchers
- · Generative AI developers
- · Cloud compute providers
- · Developers reliant solely on autoregressive models
Increased research and development into masked diffusion models due to improved training stability.
New applications and capabilities emerge from more powerful and stable generative AI models.
Accelerated development of AI agents capable of complex creative and problem-solving tasks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG