MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

arXiv:2606.06058v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sam
This research addresses a stability challenge in reinforcement learning, particularly critical as AI systems become more complex and require robust instruction following across various domains.
Improved stability in multi-constraint instruction following is essential for the reliable deployment of advanced AI agents in real-world applications where precise adherence to rules is paramount.
The proposed MDP-GRPO method offers a more stable and effective approach to training AI agents for tasks involving discrete, low-dispersion rewards, overcoming previous limitations of standard GRPO.
- · AI developers
- · Robotics companies
- · Industries adopting autonomous agents
- · Research institutions
- · Developers relying on unstable RL methods
- · Applications with high failure tolerances
More effective and reliable AI agent training for complex, constrained tasks.
Accelerated development and adoption of AI agents in critical infrastructure and high-stakes environments.
Enhanced trust and integration of autonomous AI systems into daily operations and public life.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL