arXiv:2506.00286v3 Announce Type: replace-cross Abstract: We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $\beta \neq 0$ controls the agent's risk attitude: $\beta>0$ for risk-averse and $\beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Ba

Source: arXiv cs.AI — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.