How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

arXiv:2605.28302v1 Announce Type: new Abstract: Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogenei
The increasing scale and complexity of LLMs, especially MoE models, are pushing existing inference serving architectures to their limits, necessitating innovative disaggregation techniques.
Efficient serving of large language models, particularly MoE architectures, is critical for scaling AI applications and managing the compute costs associated with advanced AI capabilities.
New architectural approaches like Attention-FFN Disaggregation (AFD) are emerging, optimizing resource utilization for different computational demands within LLM inference.
- · Cloud AI providers
- · Hyperscalers
- · LLM developers
- · Specialized AI hardware manufacturers
- · Generic server architectures
- · Inefficient LLM serving techniques
Improved performance and cost-efficiency for hosting and operating frontier LLMs.
Accelerated deployment and broader accessibility of highly capable AI models due to lower operational costs.
Increased demand for specialized AI hardware optimized for disaggregated workloads, further diversifying the AI compute supply chain.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG