English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

Reinforcement learning from human feedback serves as a crucial bridge,aligning large language models with human and societal values. This alignmentrequires a vast corpus of human feedback to learn a reward model, which issubsequently used to finetune language models. However, we have identified thatthe reward model often finds shortcuts to bypass its intended objectives,misleadingly assuming that humans prefer longer responses. The emergence oflength bias often induces the model to favor longer outputs, yet it doesn'tequate to an increase in helpful information within these outputs. In thispaper, we propose an innovative solution, applying the Product-of-Experts (PoE)technique to separate reward modeling from the influence of sequence length. Inour framework, the main expert concentrates on understanding human intents,while the biased expert targets the identification and capture of length bias.To further enhance the learning of bias, we introduce perturbations into thebias-focused expert, disrupting the flow of semantic information. Experimentalresults validate the effectiveness of our approach, indicating that languagemodel performance is improved, irrespective of sequence length.

Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback