Open Access
Testing the classifier adapted to recognize the languages of works based on the Latin alphabet
Author(s) -
З. Д. Усманов,
Abdunabi A. Kosimov
Publication year - 2021
Publication title -
sistemy analiza i obrabotki dannyh
Language(s) - English
Resource type - Journals
eISSN - 2782-215X
pISSN - 2782-2001
DOI - 10.17212/2782-2001-2021-2-83-94
Subject(s) - classifier (uml) , computer science , artificial intelligence , natural language processing , homogeneity (statistics) , alphabet , word lists by frequency , linguistics , machine learning , philosophy , sentence
Using the example of a model collection of 10 texts in five languages (English, German, Spanish, Italian, and French) using Latin graphics, the article establishes the applicability of the γ-classifier for automatic recognition of the language of a work based on the frequency of 26 common Latin alphabetic letters. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of alphabetic unigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm that implements the hypothesis of “homogeneity” of works written in one language and “heterogeneity” of works written in different languages. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. The γ-classifier trained on the texts of the model collection showed a high, 100% accuracy in recognizing the languages of the works. For testing the classifier, an additional six random texts were selected, of which five were in the same languages as the texts of the model collection. By the method of the nearest (in terms of distance) neighbor, all new texts confirmed their homogeneity with the corresponding pairs of monolingual works. The sixth text in Romanian showed its heterogeneity in relation to all elements of the collection. At the same time, it showed closeness in minimum distances, first of all, to two texts in Spanish and then to two works in Italian.