Testing the classifier adapted to recognize the languages of works based on the Latin alphabet | Zendy

З. Д. Усманов | Zendy; Abdunabi A. Kosimov | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Testing the classifier adapted to recognize the languages of works based on the Latin alphabet

Author(s) -

З. Д. Усманов,

Abdunabi A. Kosimov

Publication year - 2021

Publication title -

analysis and data processing systems

Language(s) - English

Resource type - Journals

eISSN - 2782-215X

pISSN - 2782-2001

DOI - 10.17212/2782-2001-2021-2-83-94

Subject(s) - classifier (uml) , computer science , artificial intelligence , natural language processing , homogeneity (statistics) , alphabet , word lists by frequency , linguistics , machine learning , philosophy , sentence

Using the example of a model collection of 10 texts in five languages (English, German, Spanish, Italian, and French) using Latin graphics, the article establishes the applicability of the γ-classifier for automatic recognition of the language of a work based on the frequency of 26 common Latin alphabetic letters. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of alphabetic unigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm that implements the hypothesis of “homogeneity” of works written in one language and “heterogeneity” of works written in different languages. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. The γ-classifier trained on the texts of the model collection showed a high, 100% accuracy in recognizing the languages of the works. For testing the classifier, an additional six random texts were selected, of which five were in the same languages as the texts of the model collection. By the method of the nearest (in terms of distance) neighbor, all new texts confirmed their homogeneity with the corresponding pairs of monolingual works. The sixth text in Romanian showed its heterogeneity in relation to all elements of the collection. At the same time, it showed closeness in minimum distances, first of all, to two texts in Spanish and then to two works in Italian.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research