
RUSSIAN LANGUAGE AND CORPUS DIVERSITY
Author(s) -
Alexander Piperski
Publication year - 2020
Publication title -
kompʹûternaâ lingvistika i intellektualʹnye tehnologii
Language(s) - English
Resource type - Conference proceedings
ISSN - 2075-7182
DOI - 10.28995/2075-7182-2020-19-615-627
Subject(s) - computer science , linguistics , syntax , natural language processing , selection (genetic algorithm) , corpus linguistics , artificial intelligence , grammar , variation (astronomy) , the internet , text corpus , world wide web , philosophy , physics , astrophysics
This paper discusses the use of most widely-known Russian corpora, namely Russian National Corpus, ruTenTen, General Internet Corpus of Russian, and Araneum Russicum Maximum, for the theoretical study of Russian language. Based on a sample of papers from 2019, I demonstrate that scholars, especially theoretical linguists, tend to ignore the opportunities provided by a wide range of Web corpora, even though these resources are well-known to the NLP community. I present a selection of case studies to show that data from “non-classical” corpora can be used for studying various linguistic phenomena, such as: 1) variation in morphology and syntax; 2) word formation and lexical change; 3) construction grammar. I also claim that the underuse of non-classical corpora is partly due to the fact that they are (perceived as) not quite user-friendly.