arXiv:2605.24357v1 Announce Type: new Abstract: In this paper, we study the role of the critic in actor--critic for entropy-regularized, finite, discounted environments. We establish that, when the critic is exact, using the latter as a baseline is a variance-reduction method in a strong sense. In this case, actor--critic with stochastic gradients matches the sample complexity of deterministic policy gradient, reaching an $\epsilon$-optimal regularized value with $\tilde{O}(\log(1/\epsilon))$ samples. In practice, the critic is learned alongside the actor: the variance of the actor update is t
Source: arXiv cs.LG — read the full report at the original publisher.
