arXiv:2605.27877v1 Announce Type: new Abstract: Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored t

Source: arXiv cs.LG — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.