Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

arXiv:2602.16065v2 Announce Type: replace Abstract: As artificial intelligence (AI)-generated content proliferates, models are increasingly trained on their own outputs, risking progressive degradation or collapse. In this article, we provide the first positive, rigorous theoretical results, to the best of our knowledge, showing that under model-agnostic mild conditions, the model converges to the true data-generating distribution. The convergence rate is the minimum of the model's intrinsic rate and the fraction of real data at each training iteration, revealing a phase transition between dat
The proliferation of AI-generated content and the increasing reliance on recursive training necessitate a theoretical understanding of model stability and convergence to safeguard against degradation.
This research provides crucial theoretical guarantees for the reliability of generative AI systems, directly addressing the critical issue of model collapse due to self-generated data.
The findings offer a pathway to design and train more robust generative AI models, potentially mitigating the risk of their quality deteriorating over time through recursive training on their own outputs.
- · AI developers
- · Generative AI platforms
- · Data scientists
- · AI-reliant industries
- · Platforms with weak data governance
- · Low-quality generative AI models
- · Data brokers of synthetic data
- · Unsupervised AI training methods
More resilient and trustworthy generative AI models will emerge, capable of self-improvement without catastrophic degradation.
This improved reliability could accelerate the adoption of generative AI across mission-critical applications and autonomous systems.
The enhanced foundational stability of AI might enable more rapid, cascading advancements in AI capabilities previously constrained by data quality concerns, including agentic systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG