
arXiv:2605.01910v2 Announce Type: replace Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified and systematic sam
The increasing scale of large language models and the demand for longer context windows are pushing current memory bandwidth limits, necessitating novel architectural solutions.
This development offers a potential breakthrough for more efficient and performant AI inference at scale, extending the practical limits of current hardware.
AI models could process significantly longer contexts with reduced memory bandwidth, enabling new applications and potentially lowering the computational cost of advanced AI.
- · AI model developers
- · Cloud providers
- · AI hardware manufacturers
- · AI inference services
- · Developers of less efficient inference solutions
Reduced operational costs for deploying large-scale AI models, particularly those with long context requirements.
Acceleration of research and development in AI architectures optimized for memory efficiency and throughput.
Broader accessibility and deployment of advanced AI capabilities due to lower resource demands, driving new market segments and applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG