
arXiv:2605.02125v3 Announce Type: replace-cross Abstract: Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (ii
The increasing complexity and scale of AI models necessitate distributed training, pushing the boundaries of current federated learning implementations in high-performance computing environments.
Improving the efficiency and reliability of federated learning across multiple HPC facilities directly impacts the scalability and accessibility of advanced AI research and development.
The ability to run federated learning more effectively across distributed HPC resources without severe performance bottlenecks due to queueing and staleness.
- · AI researchers
- · HPC facility operators
- · Organizations training large distributed AI models
- · Distributed computing frameworks
- · Synchronous FL implementations
- · Asynchronous FL without staleness control
More efficient and scalable distributed AI model training becomes possible across geographically dispersed HPC resources.
This could accelerate the development of larger, more complex AI models and enable collaboration across institutional boundaries without centralizing data.
Reduced barriers to entry for institutions with limited individual compute power to participate in frontier AI research, potentially democratizing access to advanced AI capabilities.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG