Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

arXiv:2605.23182v1 Announce Type: new Abstract: Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $\mu_0$, GPI only requires identifying a policy with expected reward in an episode at least $\mu_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative in

Source: arXiv cs.LG — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.

Stay ahead of the systems reshaping markets.