English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

Reinforcement learning from human feedback (RLHF) emerges as a promisingparadigm for aligning large language models (LLMs). However, a notablechallenge in RLHF is overoptimization, where beyond a certain threshold, thepursuit of higher rewards leads to a decline in human preferences. In thispaper, we observe the weakness of KL regularization which is commonly employedin existing RLHF methods to address overoptimization. To mitigate thislimitation, we scrutinize the RLHF objective in the offline dataset and proposeuncertainty-penalized RLHF (UP-RLHF), which incorporates uncertaintyregularization during RL-finetuning. To enhance the uncertainty quantificationabilities for reward models, we first propose a diverse low-rank adaptation(LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations.Then we optimize policy models utilizing penalized rewards, determined by bothrewards and uncertainties provided by the diverse reward LoRA ensembles. Ourexperimental results, based on two real human preference datasets, showcase theeffectiveness of diverse reward LoRA ensembles in quantifying rewarduncertainty. Additionally, uncertainty regularization in UP-RLHF proves to bepivotal in mitigating overoptimization, thereby contributing to the overallperformance.

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles