ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

arXiv:2605.11215v2 Announce Type: replace-cross Abstract: Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework i
The increasing scale and hardware demands of LLM pre-training have made hardware faults a critical bottleneck, necessitating more resilient system designs.
This development addresses a fundamental reliability issue in training large AI models, which is crucial for advancing AI capabilities and reducing compute waste.
LLM pre-training will become more robust and efficient, reducing costly interruptions and potentially accelerating the development of even larger, more complex models.
- · AI compute providers
- · Large Language Model developers
- · Data center operators
- · AI research institutions
- · Inefficient AI training systems
- · Hardware manufacturers with high failure rates
More stable and faster training cycles for state-of-the-art AI models.
Reduced operational costs and resource consumption associated with failed or interrupted training runs.
Accelerated development and deployment of advanced AI applications, impacting various industries.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI