
QUANTILE-BASED APPROACH TO ESTIMATING COGNITIVE TEXT COMPLEXITY
Author(s) -
Maksim Eremeev,
Konstantin Vorontsov
Publication year - 2020
Publication title -
kompʹûternaâ lingvistika i intellektualʹnye tehnologii
Language(s) - English
Resource type - Conference proceedings
ISSN - 2075-7182
DOI - 10.28995/2075-7182-2020-19-256-269
Subject(s) - readability , computer science , quantile , security token , artificial intelligence , metric (unit) , crowdsourcing , natural language processing , language model , function (biology) , abstraction , value (mathematics) , cognition , machine learning , statistics , mathematics , operations management , computer security , evolutionary biology , neuroscience , world wide web , economics , biology , programming language , philosophy , epistemology
This paper introduces an approach to measuring the cognitive complexity of texts on various language levels. While standard readability indices are based on the linear combination of primary statistics, our general approach allows us to estimate complexity on morphological, lexical, syntactic, and discursive levels. Each model is defined by the tokens for the specific language level and the complexity function of a single token. We then use the reference collection of moderately complex texts and the quantile-based approach to spot the abnormally rare tokens. The proposed supervised ensemble, based on the ElasticNet model, incorporates models from all language levels. Having collected a labeled dataset through crowdsourcing, consisting of pairs of articles from the Russian Wikipedia, we consider several models and ensembles and compare them to common baselines. Suggested models are flexible due to the freedom in choosing the reference collection. The described experiments confirm the competitiveness of the proposed approach, as the ensembles demonstrate the best target metric value.