z-logo
open-access-imgOpen Access
Ways to Improve N-Gram Language Models for OCR and Speech Recognition of Slavic Languages
Author(s) -
Владимир Юрьевич Тарануха
Publication year - 2014
Publication title -
the advanced science journal
Language(s) - English
Resource type - Journals
eISSN - 2219-7478
pISSN - 2219-746X
DOI - 10.15550/asj.2014.04.065
Subject(s) - slavic languages , computer science , linguistics , natural language processing , speech recognition , artificial intelligence , philosophy
The problems of n-gram models for the OCR and speech recognition for the Slavic languages are investigated. The paper proposes methods applicable for most Slavic languages. Two approaches are tested: filtering of the n-gram model and the alternative ways of carrying out the smoothing. The filtering relies on heuristics based on frequencies and morphological features of words. The smoothing uses classes based on morphological features in combinations with new discounting formula. The smoothing can also be combined with inner filtering. The numerical experiments for the Ukrainian language show that both approaches produce interesting results. However, smoothing is more promising while being more complex and requiring further investigation of development of proper classes based on morphological information in order to outperform standard smoothing techniques.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom