Discovery of Language Resources on the Web: Information Extraction from Heterogeneous Documents
Author(s) -
Vladimír Pekar,
Richard Evans
Publication year - 2007
Publication title -
literary and linguistic computing
Language(s) - English
Resource type - Journals
eISSN - 1477-4615
pISSN - 0268-1145
DOI - 10.1093/llc/fqm010
Subject(s) - computer science , information extraction , web crawler , information retrieval , web page , coreference , terminology , task (project management) , domain (mathematical analysis) , named entity recognition , world wide web , focus (optics) , field (mathematics) , natural language processing , artificial intelligence , resolution (logic) , mathematical analysis , linguistics , philosophy , physics , mathematics , management , pure mathematics , optics , economics
Metadata onlyThe present article is concerned with the problem of automatic database population via information extraction (IE) from web pages obtained from heterogeneous sources, such as those retrieved by a domain crawler. Specifically, we address the task of filling single multi-field templates from individual documents, a common scenario that involves free-format documents with the same communicative goal such as job adverts, CVs, or meeting/seminar announcements. We discuss challenges that arise in this scenario and propose solutions to them at different levels of the processing of web page content. Our main focus is on the issue of information extraction, which we address with a two-step machine learning approach that first aims to determine segments of a page that are likely to contain relevant facts and then delimits specific natural language expressions with which to fill template fields. We also present a range of techniques for the enrichment of web pages with semantic annotations, such as recognition of named entities, domain terminology and coreference resolution, and examine their effect on the information extraction method. We evaluate the developed IE system on the task of automatically populating a database with information on language resources available on the web
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom