z-logo
open-access-imgOpen Access
The development of stemming algorithm for the Uzbek language
Author(s) -
Бакаев Илхом Изатович
Publication year - 2021
Publication title -
kibernetika i programmirovanie
Language(s) - English
Resource type - Journals
ISSN - 2644-5522
DOI - 10.25136/2644-5522.2021.1.35847
Subject(s) - computer science , uzbek , morpheme , agglutinative language , lexical analysis , lexicon , natural language processing , linguistics , artificial intelligence , algorithm , philosophy
The automatic processing of unstructured texts in natural languages is one of the relevant problems of computer analysis and text synthesis. Within this problem, the author singles out a task of text normalization, which usually suggests such processes as tokenization, stemming, and lemmatization. The existing stemming algorithms for the most part are oriented towards the synthetic languages with inflectional morphemes. The Uzbek language represents an example of agglutinative language, characterized by polysemanticity of affixal and auxiliary morphemes. Although the Uzbek language largely differs from, for example, English language, it is successfully processed by stemming algorithms. There are virtually no examples of effective implementation of stemming algorithms for the Uzbek language; therefore, this questions is the subject of scientific interest and defines the goal of this work. In the course of this research, the author solved the task of bringing the given texts in the Uzbek language to normal form, which on the preliminary stage were tokenized and cleared of stop words. To author developed the method of normalization of texts in the Uzbek language based on the stemming algorithm. The development of stemming algorithm employed hybrid approach with application of algorithmic method, lexicon of linguistic rules and database of the normal word forms of the Uzbek language. The precision of the proposed algorithm depends on the precision of tokenization algorithm. At the same time, the article did not explore the question of finding the roots of paired words separated by spaces, as this task is solved at the stage of tokenization. The algorithm can be integrated into various automated systems for machine translation, information extraction, data retrieval, etc.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here