Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

arXiv:2606.10493v1 Announce Type: cross Abstract: Local deployment of large Mixture-of-Experts (MoE) models falls short of the service quality achieved in cloud-scale environments, even under low-concurrency workloads. We identify four key gaps in local MoE inference: reliance on capacity-reduced models (quantized, distilled, rerouted), inability to meet 30-second TTFT for long prefills (more than 12K), sub-baseline decode throughput (under 20 tokens/s), and poor concurrency under mixed prefill-decode and batched decode workloads. We present a CPU-GPU hybrid system that achieves cloud-level SL
The increasing complexity of MoE models and the desire for high-quality local inference are pushing innovation in hybrid CPU-GPU architectures to meet growing demand for sophisticated AI deployments outside of hyperscale clouds.
This development allows for cloud-grade service level objectives (SLOs) for large AI models to be achieved locally, broadening the applicability and accessibility of state-of-the-art AI without exclusive reliance on cloud infrastructure.
Local deployments of large Mixture-of-Experts models can now potentially match the performance and reliability previously only available in hyperscale cloud environments, impacting model deployment strategies and hardware optimization.
- · AI hardware manufacturers (CPU, GPU)
- · Edge AI providers
- · Enterprises deploying large AI models
- · AI framework developers
- · Cloud-exclusive AI inference providers (potentially, over time)
- · Developers relying solely on cloud for high-performance MoE inference
- · General-purpose hardware not optimized for hybrid AI inference
Improved performance and broader deployment of complex AI models in local or edge environments.
Increased demand for specialized hardware and integrated CPU-GPU solutions catering to hybrid AI workloads.
Potential acceleration of sovereign AI capabilities as nations and enterprises can achieve advanced AI performance without relying on external cloud providers.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG