
Pipeline for a Data-driven Network of Linguistic Terms
Author(s) -
Søren Wichmann
Publication year - 2021
Publication title -
linköping electronic conference proceedings
Language(s) - English
Resource type - Conference proceedings
eISSN - 1650-3740
pISSN - 1650-3686
DOI - 10.3384/ecp184176
Subject(s) - terminology , computer science , pipeline (software) , natural language processing , artificial intelligence , rank (graph theory) , linguistic description , linguistics , deep linguistic processing , pruning , conjunction (astronomy) , ranking (information retrieval) , pointwise , information retrieval , mathematics , philosophy , physics , combinatorics , astronomy , agronomy , biology , programming language , mathematical analysis
The present work is aimed at (1) developing a search machine adapted to the large DReaM corpus of linguistic descriptive literature and (2) getting insights into how a data-driven ontology of linguistic terminology might be built. Starting from close to 20,000 text documents from the literature of language descriptions, from documents either born digitally or scanned and OCR’d, we extract keywords and pass them through a pruning pipeline where mainly keywords that can be considered as belonging to linguistic terminology survive. Subsequently we quantify relations among those terms using Normalized Pointwise Mutual Information (NPMI) and use the resulting measures, in conjunction with the Google Page Rank (GPR), to build networks of linguistic terms.