English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

While numerous works have focused on devising efficient algorithms forreinforcement learning (RL) with uniformly bounded rewards, it remains an openquestion whether sample or time-efficient algorithms for RL with largestate-action space exist when the rewards are \emph{heavy-tailed}, i.e., withonly finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In thiswork, we address the challenge of such rewards in RL with linear functionapproximation. We first design an algorithm, \textsc{Heavy-OFUL}, forheavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-roundregret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the\emph{first} of this kind. Here, $d$ is the feature dimension, and$\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward atthe $t$-th round. We further show the above bound is minimax optimal whenapplied to the worst-case instances in stochastic and deterministic linearbandits. We then extend this algorithm to the RL settings with linear functionapproximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the\emph{first} computationally efficient \emph{instance-dependent} $K$-episoderegret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d\sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and$\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling withthe central moment of reward and value functions, respectively. We also providea matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d\sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worstcase. Our result is achieved via a novel robust self-normalized concentrationinequality that may be of independent interest in handling heavy-tailed noisein general online regression problems.

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds