Open Access
DOC2VEC OR BETTER INTERPRETABILITY? A METHOD STUDY FOR AUTHORSHIP ATTRIBUTION
Author(s) -
Elena Pimonova,
Oleg Durandin,
Alexey Malafeev
Publication year - 2020
Publication title -
kompʹûternaâ lingvistika i intellektualʹnye tehnologii
Language(s) - English
Resource type - Conference proceedings
ISSN - 2075-7182
DOI - 10.28995/2075-7182-2020-19-606-614
Subject(s) - computer science , interpretability , natural language processing , authorship attribution , artificial intelligence , representation (politics) , feature (linguistics) , set (abstract data type) , code (set theory) , style (visual arts) , linguistics , philosophy , archaeology , politics , political science , law , history , programming language
In this work, we perform a method study for the problem of authorship attribution in Russian and English. The datasets used consist of 324 works written in Russian and 207 works in English. We propose a set of text representation models that reflect various linguistic phenomena, in particular, morphological and syntactic ones. One distinctive feature of the proposed models is that they are interpretable. These models are used individually and in combination against a Doc2Vec baseline. For Russian, some of our models outperform Doc2Vec, but this does not happen in the case of English, for various reasons. However, the proposed models can also be used together with Doc2Vec, dramatically improving its performance: by 16.79% in the case of Russian and by 7.2% for English. Additionally, we experiment with two different methods for separating texts into blocks of K sentences (contiguous and bootstrapped) and performed parameter tuning of K. Finally, we conduct a feature importance analysis and show which linguistic markers of author style are the most pertinent for Russian, English and for both these languages. All code used in this work is made freely available to the community.