arXiv:2603.10250v2 Announce Type: replace Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over samples from the behavior policy, which often induces an overgreedy policy and fails to utilize feedback from negative samples. In this work, we introduce GeMPO, a simple and unified framework that generalizes reweighting scheme in diffusion RL from softmax to general monotonic functions. GeMPO revisits diffusion RL via a measure matching perspective: First, we construct a virtual target policy measure via solving a regularized policy optimization
Source: arXiv cs.LG — read the full report at the original publisher.
