z-logo
open-access-imgOpen Access
A modified algorithm of the latent semantic analysis for text processing in the Russian language
Author(s) -
А. А. Иванов,
S V Holtzer
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1715/1/012009
Subject(s) - latent semantic analysis , computer science , lemmatisation , natural language processing , artificial intelligence , simple (philosophy) , dimension (graph theory) , algorithm , mathematics , epistemology , pure mathematics , philosophy
The paper presents a methodology for analyzing texts in the Russian language. The methodology is based on the Latent Semantic Analysis (LSA) algorithm. A number of disadvantages of the classical method are considered, and modification methods of extracting N-grams from the text are proposed. The modified method allows one to reduce a number of extracted N-grams and an increasing the meaningfulness of the retrieved collection in comparison with a standard method. The reduction of the collection size leads to a reduced dimension of the TF-IDF matrix and accelerated the execution of the SVD method. The advantages of the developed machine learning algorithm are demonstrated on simple sentences. Owing to discussed ideas it becomes possible to effectively parallelize the text processing at the lemmatization step.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here