Automatically Assessing Machine Summary Content Without a Gold Standard | Zendy

Annie Louis | Zendy; Ani Nenkova | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Automatically Assessing Machine Summary Content Without a Gold Standard

Author(s) -

Annie Louis,

Ani Nenkova

Publication year - 2012

Publication title -

computational linguistics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.314

H-Index - 98

eISSN - 1530-9312

pISSN - 0891-2017

DOI - 10.1162/coli_a_00123

Subject(s) - computer science , gold standard (test) , replicate , set (abstract data type) , similarity (geometry) , measure (data warehouse) , information retrieval , quality (philosophy) , data mining , machine learning , artificial intelligence , protocol (science) , statistics , medicine , philosophy , alternative medicine , mathematics , epistemology , pathology , image (mathematics) , programming language

The most widely adopted approaches for evaluation of summary content follow some protocol for comparing a summary with gold-standard human summaries, which are traditionally called model summaries. This evaluation paradigm falls short when human summaries are not available and becomes less accurate when only a single model is available. We propose three novel evaluation techniques. Two of them are model-free and do not rely on a gold standard for the assessment. The third technique improves standard automatic evaluations by expanding the set of available model summaries with chosen system summaries. We show that quantifying the similarity between the source text and its summary with appropriately chosen measures produces summary scores which replicate human assessments accurately. We also explore ways of increasing evaluation quality when only one human model summary is available as a gold standard. We introduce pseudomodels, which are system summaries deemed to contain good content according to automatic evaluation. Combining the pseudomodels with the single human model to form the gold-standard leads to higher correlations with human judgments compared to using only the one available model. Finally, we explore the feasibility of another measure-similarity between a system summary and the pool of all other system summaries for the same input. This method of comparison with the consensus of systems produces impressively accurate rankings of system summaries, achieving correlation with human rankings above 0.9.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research