
arXiv:2603.02376v2 Announce Type: replace-cross Abstract: Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance s
The increasing scale and complexity of distributed LLM training necessitates more efficient compute-communication co-design, which traditional manual methods struggle to optimize.
Automating the co-design of compute and communication for LLMs can significantly reduce training and inference costs and accelerate AI development by improving hardware utilization and performance.
The reliance on deep systems expertise for optimizing LLM infrastructure shifts towards agentic frameworks, potentially democratizing high-performance AI deployment.
- · AI developers
- · Cloud providers
- · HPC hardware manufacturers
- · AI infrastructure software vendors
- · Manual optimization experts
- · Less agile AI infrastructure solution providers
Faster and cheaper development of large language models and other distributed AI systems.
Increased accessibility to state-of-the-art AI capabilities for a broader range of organizations due to reduced operational overhead.
Accelerated innovation in AI models as the compute barrier to experimentation lowers, leading to new applications and capabilities.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG