Discovery of Language Resources on the Web: Information Extraction from Heterogeneous Documents | Zendy

Vladimír Pekar | Zendy; Richard Evans | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Discovery of Language Resources on the Web: Information Extraction from Heterogeneous Documents

Author(s) -

Vladimír Pekar,

Richard Evans

Publication year - 2007

Publication title -

literary and linguistic computing

Language(s) - English

Resource type - Journals

eISSN - 1477-4615

pISSN - 0268-1145

DOI - 10.1093/llc/fqm010

Subject(s) - computer science , information extraction , web crawler , information retrieval , web page , coreference , terminology , task (project management) , domain (mathematical analysis) , named entity recognition , world wide web , focus (optics) , field (mathematics) , natural language processing , artificial intelligence , resolution (logic) , mathematical analysis , linguistics , philosophy , physics , mathematics , management , pure mathematics , optics , economics

Metadata onlyThe present article is concerned with the problem of automatic database population via information extraction (IE) from web pages obtained from heterogeneous sources, such as those retrieved by a domain crawler. Specifically, we address the task of filling single multi-field templates from individual documents, a common scenario that involves free-format documents with the same communicative goal such as job adverts, CVs, or meeting/seminar announcements. We discuss challenges that arise in this scenario and propose solutions to them at different levels of the processing of web page content. Our main focus is on the issue of information extraction, which we address with a two-step machine learning approach that first aims to determine segments of a page that are likely to contain relevant facts and then delimits specific natural language expressions with which to fill template fields. We also present a range of techniques for the enrichment of web pages with semantic annotations, such as recognition of named entities, domain terminology and coreference resolution, and examine their effect on the information extraction method. We evaluate the developed IE system on the task of automatically populating a database with information on language resources available on the web

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research