Predict, Reuse, and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding

arXiv:2606.30389v1 Announce Type: new Abstract: Dynamic sparse attention (DSA) accelerates long-context LLM decoding by attending to only the top-K KV blocks relevant to each query, but it introduces a serialized selection-to-attention dependency that emerges as a new latency bottleneck. We present PRR, a speculate-reuse-repair runtime that exploits temporal locality in DSA selections to predict likely blocks, speculate the attention over them while selection is in flight, and incrementally repair missed blocks once the true selected set is known. PRR uses a lightweight EMA-based predictor, a

Source: arXiv cs.LG — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.

Stay ahead of the systems reshaping markets.