Premium
From French Wikipedia to Erudit: A test case for cross‐domain open information extraction
Author(s) -
Gotti Fabrizio,
Langlais Philippe
Publication year - 2018
Publication title -
computational intelligence
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.353
H-Index - 52
eISSN - 1467-8640
pISSN - 0824-7935
DOI - 10.1111/coin.12120
Subject(s) - computer science , classifier (uml) , pipeline (software) , information extraction , entity linking , information retrieval , open domain , domain (mathematical analysis) , natural language processing , task (project management) , artificial intelligence , named entity recognition , precision and recall , question answering , knowledge base , mathematics , programming language , mathematical analysis , management , economics
Abstract In this paper, we describe an open information extraction pipeline based on ReVerb for extracting knowledge from French text. We put it to the test by using the information triples extracted to build an entity classifier, ie, a system able to label a given instance with its type (for instance, Michel Foucault is a philosopher). The classifier requires little supervision. One novel aspect of this study is that we show how general domain information triples (extracted from French Wikipedia) can be used for deriving new knowledge from domain‐specific documents unrelated to Wikipedia, in our case scholarly articles focusing on the humanities. We believe that the present study is the first that focuses on such a cross‐domain, recall‐oriented approach in open information extraction. While our system's performance shows room for improvement, manual assessments show that the task is quite hard, even for a human, in part because of the cross‐domain aspect of the problem we tackle.