z-logo
open-access-imgOpen Access
THE SMALLER THE BETTER? HETEROGENEITY OF CORPUS, TRAINING SIZE, AND MORPHOLOGICAL TAGGING
Author(s) -
O. N. Lyashevskaya,
L. N. Ostyakova,
E. A. Salnikov,
Olena Semenova
Publication year - 2020
Publication title -
kompʹûternaâ lingvistika i intellektualʹnye tehnologii
Language(s) - English
Resource type - Conference proceedings
ISSN - 2075-7182
DOI - 10.28995/2075-7182-2020-19-1091-1108
Subject(s) - computer science , normalization (sociology) , artificial intelligence , natural language processing , training set , training (meteorology) , slavic languages , linguistics , geography , sociology , meteorology , anthropology , philosophy
Orthographic and morphological heterogeneity of historical texts in premodern Slavic causes many difficulties in pos- and morphological tagging. Existing approaches to these tasks show state-of-the-art results without normalization, but they are still very sensitive to the properties of training data such as genre and origin. In this paper, we investigate to what extent the heterogeneity and size of the training corpus influence the quality of pos tagging and morphological analysis. We observe that UDpipe trained on different parts of the Middle Russian corpus demonstrates a boost in accuracy when using less training data. We resolve this paradox by analyzing the distribution of pos-tags and short words across subcorpora.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here