Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

arXiv:2605.28184v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a
This research addresses a known challenge in integrating MTP into RL for LLMs, reflecting ongoing efforts to optimize large language model training paradigms.
Improved joint training methods for RL and MTP could unlock more sophisticated reasoning capabilities in large language models, leading to more capable AI systems.
The understanding of why previous joint training efforts failed is clarified, potentially leading to new architectures and training methods for advanced AI.
- · AI researchers
- · Large language model developers
- · AI-driven product companies
- · Companies relying on less sophisticated AI training methods
More efficient and powerful large language models could be developed.
Advanced LLMs could accelerate research and development in other AI subfields.
This could contribute to the broader development of highly autonomous and intelligent AI agents.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG