
TOXIC COMMENTS DETECTION IN RUSSIAN
Author(s) -
Sergey Smetanin
Publication year - 2020
Publication title -
kompʹûternaâ lingvistika i intellektualʹnye tehnologii
Language(s) - English
Resource type - Conference proceedings
ISSN - 2075-7182
DOI - 10.28995/2075-7182-2020-19-1149-1159
Subject(s) - computer science , encoder , sentence , annotation , artificial intelligence , the internet , natural language processing , source code , data science , machine learning , information retrieval , world wide web , operating system
Currently, social network sites tend to be one of the major communication platforms in both offline and online space. Freedom of expression of various points of view, including toxic, aggressive, and abusive comments, might have a long-term negative impact on people’s opinions and social cohesion. As a consequence, the ability to automatically identify and moderate toxic content on the Internet to eliminate the negative consequences is one of the necessary tasks for modern society. This paper aims at the automatic detection of toxic comments in the Russian language. As a source of data, we utilized anonymously published Kaggle dataset and additionally validated its annotation quality. To build a classification model, we performed fine-tuning of two versions of Multilingual Universal Sentence Encoder, Bidirectional Encoder Representations from Transformers, and ruBERT. Finetuned RuBERT achieved F1 = 92.20%, demonstrating the best classification score. We made trained models and code samples publicly available to the research community.