Collocations and statistical analysis of n-grams
Author(s) -
Gunn Inger Lyse,
Gisle Andersen
Publication year - 2012
Publication title -
studies in corpus linguistics
Language(s) - English
Resource type - Book series
ISSN - 1388-0373
DOI - 10.1075/scl.49.05lys
Subject(s) - natural language processing , computer science , linguistics , philosophy
Multiword expressions (MWEs) are words that co-occur so often that they are perceived as a linguistic unit. Since MWEs pervade natural language, their identification is pertinent for a range of tasks within lexicography, terminology and language technology. We apply various statistical association measures (AMs) to word sequences from the Norwegian Newspaper Corpus (NNC) in order to rank two-and three-word sequences (bigrams and trigrams) in terms of their tendency to co-occur. The results show that some statistical measures favour relatively frequent MWEs (e.g. i motsetning til ‘as opposed to’), whereas other measures favour relatively low-frequent units, which typically comprise loan words (de facto), technical terms (notaries publicus) and phrasal anglicisms (practical jokes; cf. G. Andersen this volume). On this basis we evaluate the relevance of each of these measures for lexicography, terminology and language technology purposes
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom