arXiv:2606.00724v1 Announce Type: new Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within

Source: arXiv cs.CL — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.