
arXiv:2503.18893v2 Announce Type: replace-cross Abstract: Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either require expensive pretraining or rely on per-token cross-layer cosine similarity that is often limited in practice. We show, via Centered Kernel Alignment (CKA), that the dominant singular vectors of KV-Cache are well aligned across layers. Motivated by this observation, we propose xKV, a post-training compression met
The continuous growth in size and capabilities of Large Language Models (LLMs) is pushing the boundaries of memory and computational efficiency, making KV-Cache compression a critical problem to solve now.
This work directly addresses a major bottleneck (memory costs) for long-context LLMs, which are essential for advanced AI applications and could accelerate the deployment of more powerful AI agents.
The ability to significantly reduce KV-Cache memory consumption post-training without requiring expensive pretraining makes long-context LLMs more accessible and efficient to run.
- · AI developers
- · Cloud infrastructure providers
- · LLM researchers
- · Less efficient LLM architectures
- · Companies reliant on expensive high-memory hardware
Reduced operational costs for running advanced LLMs, making them more affordable.
Faster development and deployment of sophisticated AI agents due to improved LLM efficiency.
Enhanced competition in the LLM space as smaller entities can run more powerful models on less extreme hardware.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG