arXiv:2606.27448v1 Announce Type: new Abstract: This paper studies the problem of regret minimization in Markovian bandits with \emph{non-observable states} and possibly \emph{constrained} decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the performance of the learning algorithm to the best \emph{pure policy} which -- akin to optimal policies of stochastic bandits -- picks the optimal arm from start to finish without ever switching. We introduce a generalization of rested Markovian bandits, \emph{self-degrading Markovian bandits}, for which pure policies a
Source: arXiv cs.LG — read the full report at the original publisher.
