English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

This paper studies the alignment process of generative models withReinforcement Learning from Human Feedback (RLHF). We first identify theprimary challenges of existing popular methods like offline PPO and offline DPOas lacking in strategical exploration of the environment. Then, to understandthe mathematical principle of RLHF, we consider a standard mathematicalformulation, the reverse-KL regularized contextual bandit for RLHF. Despite itswidespread practical application, a rigorous theoretical analysis of thisformulation remains open. We investigate its behavior in three distinctsettings -- offline, online, and hybrid -- and propose efficient algorithmswith finite-sample theoretical guarantees.  Moving towards practical applications, our framework, with a robustapproximation of the information-theoretical policy improvement oracle,naturally gives rise to several novel RLHF algorithms. This includes aniterative version of the Direct Preference Optimization (DPO) algorithm foronline settings, and a multi-step rejection sampling strategy for offlinescenarios. Our empirical evaluations on real-world alignment experiment oflarge language model demonstrate that these proposed methods significantlysurpass existing strong baselines, such as DPO and Rejection SamplingOptimization (RSO), showcasing the connections between solid theoreticalfoundations and their potent practical implementations.

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint