English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

We study human-in-the-loop reinforcement learning (RL) with trajectorypreferences, where instead of receiving a numeric reward at each step, theagent only receives preferences over trajectory pairs from a human overseer.The goal of the agent is to learn the optimal policy which is most preferred bythe human overseer. Despite the empirical successes, the theoreticalunderstanding of preference-based RL (PbRL) is only limited to the tabularcase. In this paper, we propose the first optimistic model-based algorithm forPbRL with general function approximation, which estimates the model usingvalue-targeted regression and calculates the exploratory policies by solving anoptimistic planning problem. Our algorithm achieves the regret of $\tilde{O}(\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure ofthe transition and preference model depending on the Eluder dimension andlog-covering numbers, $H$ is the planning horizon, $K$ is the number ofepisodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower boundindicates that our algorithm is near-optimal when specialized to the linearsetting. Furthermore, we extend the PbRL problem by formulating a novel problemcalled RL with $n$-wise comparisons, and provide the first sample-efficientalgorithm for this new setting. To the best of our knowledge, this is the firsttheoretical result for PbRL with (general) function approximation.

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation