AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking

arXiv:2606.03347v1 Announce Type: new Abstract: Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We propose AugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data by separating conditioning from supervision. AugMask 1) constructs numeric inputs via conditional stochastic augmentation using lightweight auxiliary models, and 2) applies denoisin
The increasing sophistication and broader application of diffusion models necessitate methods to handle real-world data imperfections, making this a timely development for expanding their utility.
This development addresses a fundamental limitation in applying powerful generative models to common, incomplete tabular datasets, opening new avenues for data synthesis, imputation, and analysis across various domains.
The ability to train diffusion models effectively on incomplete tabular data makes these advanced AI techniques more accessible and robust for real-world applications where data quality is often imperfect.
- · AI researchers and developers
- · Data scientists
- · Industries relying on tabular data (finance, healthcare, retail)
- · Traditional data imputation methods
- · AI models requiring perfectly clean data
Diffusion models can now be widely applied to diverse real-world tabular datasets without extensive manual data cleaning.
Improved synthetic data generation from incomplete sources could accelerate AI development and privacy-preserving data sharing.
The enhanced robustness of generative AI in handling imperfect data could lead to more profound transformations in data-driven decision-making and automated insights.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG