DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv:2605.15422v2 Announce Type: replace Abstract: Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across se
The rapid scaling of large language models for reinforcement learning (RL) training, particularly with long contexts and large rollouts, has revealed significant computational bottlenecks that require immediate architectural solutions.
Efficient training for RL models, especially through innovations like DualKV, directly impacts the scalability and cost-effectiveness of developing advanced AI agents, which are crucial for numerous applications.
This research introduces a method to significantly reduce computational and memory overhead in RL training by avoiding redundant computations on shared prompts, accelerating the development cycle for large-scale agentic systems.
- · AI model developers
- · Cloud computing providers (reduced cost for customers)
- · AI research institutions
- · Companies deploying advanced AI agents
- · Inefficient AI training approaches
- · Hardware not optimized for memory efficiency
Reduced compute costs and faster training times for large RL models.
Accelerated development and deployment of more sophisticated and capable AI agents across industries.
Potential for new classes of AI applications that were previously cost-prohibitive due to training demands, leading to broader economic impact.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG