DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and  Improvement of Large Language Models | Zendy

Wendi Cui | Zendy; Jiaxin Zhang | Zendy; Zhuohang Li | Zendy; Lopez Damien | Zendy; Kamalika Das | Zendy; Bradley Malin | Zendy; Sricharan Kumar | Zendy

Research Library

ZAIA - AI Assistant About Blog Pricing Contact

Open AccessDCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Open Access

Author(s)

Wendi Cui,

Jiaxin Zhang,

Zhuohang Li,

Lopez Damien,

Kamalika Das,

Bradley Malin,

Sricharan Kumar

Publication year2024

Evaluating the quality and variability of text generated by Large LanguageModels (LLMs) poses a significant, yet unresolved research challenge.Traditional evaluation methods, such as ROUGE and BERTScore, which measuretoken similarity, often fail to capture the holistic semantic equivalence. Thisresults in a low correlation with human judgments and intuition, which isespecially problematic in high-stakes applications like healthcare and financewhere reliability, safety, and robust decision-making are highly critical. Thiswork proposes DCR, an automated framework for evaluating and improving theconsistency of LLM-generated texts using a divide-conquer-reasoning approach.Unlike existing LLM-based evaluators that operate at the paragraph level, ourmethod employs a divide-and-conquer evaluator (DCE) that breaks down theparagraph-to-paragraph comparison between two generated responses intoindividual sentence-to-paragraph comparisons, each evaluated based onpredefined criteria. To facilitate this approach, we introduce an automaticmetric converter (AMC) that translates the output from DCE into aninterpretable numeric score. Beyond the consistency evaluation, we furtherpresent a reason-assisted improver (RAI) that leverages the analytical reasonswith explanations identified by DCE to generate new responses aimed at reducingthese inconsistencies. Through comprehensive and systematic empirical analysis,we show that our approach outperforms state-of-the-art methods by a largemargin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating theconsistency of LLM generation across multiple benchmarks in semantic, factual,and summarization consistency tasks. Our approach also substantially reducesnearly 90% of output inconsistencies, showing promise for effectivehallucination mitigation.

Language(s)English

Seeing content that should not be on Zendy? Contact us.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore