CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely reach peak KV-cache demand at the same time, reserving worst-case KV capacity per model wastes memory; a shared KV-cache pool can instead provision aggregate active demand. However, KV-cache sharing is not sufficient when weights and KV-cache remain in a monolithic GPU
The proliferation of sparse Mixture-of-Experts (MoE) models and the demand for efficient Multi-LLM serving are exposing critical GPU memory challenges, driving innovation in resource management.
Efficiently serving multiple large language models, especially 'cold' MoE models, is crucial for scaling AI services, reducing operational costs, and optimizing hardware utilization within the compute supply chain.
This research proposes a method to optimize GPU memory usage by disaggregating KV-cache and model weights, allowing for more flexible and efficient sharing across diverse LLMs.
- · Cloud AI providers
- · GPU manufacturers
- · AI service companies
- · Inefficient AI deployment strategies
- · Monolithic GPU architectures
Reduced operational costs and increased throughput for hosting multiple LLMs.
Accelerated adoption of MoE models due to improved economic viability.
Enhanced competition in the AI services market as more models can be served affordably.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI