Cost effective ontology population with data from lists in OCRed historical documents | Zendy

Thomas Packer | Zendy; David W. Embley | Zendy

AI Assistant Blog Pricing

Open Access

Cost effective ontology population with data from lists in OCRed historical documents

Author(s) -

Thomas Packer,

David W. Embley

Publication year - 2013

Publication title -

citeseer x (the pennsylvania state university)

Language(s) - English

Resource type - Conference proceedings

DOI - 10.1145/2501115.2501132

Subject(s) - computer science , metric (unit) , information retrieval , variety (cybernetics) , artificial intelligence , ontology , natural language processing , field (mathematics) , process (computing) , information extraction , selection (genetic algorithm) , sequence labeling , hidden markov model , machine learning , programming language , philosophy , operations management , mathematics , management , epistemology , pure mathematics , economics , task (project management)

A method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would contribute to making a variety of historical knowledge machine searchable, queryable, and linkable. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose ListReader, a wrapper-induction solution for information extraction that is specialized for lists in OCRed documents. ListReader can induce either a regular-expression grammar or a Hidden Markov Model. Each can infer list structure and field labels from OCR text. We decrease the cost and improve the accuracy of the induction process using semi-supervised machine learning and active learning, allowing induction of a wrapper from almost a single hand-labeled instance per field per list. After applying an induced wrapper, ListReader automatically maps the labeled text it produces to a rich variety of ontologically structured predicates. We evaluate our implementation on family history books in terms of the typical F-measure and a new metric, "Label Efficiency", which measures both extraction quality and cost in a single number. We show with statistical significance that ListReader reaches values closer to optimal levels than a state-of-the-art statistical sequence labeler.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom

About

About Careers Publisher Partners Contact Us Our institutional solutions Get Organisational Trial or Quote

Learn

FAQs Blog Terms of Use Privacy Policy

Download the Zendy App

Discover

Explore

Home ZAIA Blog