Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System | Zendy

Kelana Jaya | Zendy; Deepa Gupta | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System

Author(s) -

Kelana Jaya,

Deepa Gupta

Publication year - 2016

Publication title -

international journal of electrical and computer engineering (ijece)

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.277

H-Index - 22

ISSN - 2088-8708

DOI - 10.11591/ijece.v6i3.pp1059-1071

Subject(s) - computer science , machine translation , hindi , natural language processing , artificial intelligence , evaluation of machine translation , standardization , phrase , machine translation software usability , example based machine translation , translation (biology) , text corpus , biochemistry , chemistry , messenger rna , gene , operating system

Even though lot of Statistical Machine Translation(SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different dataset, different parameters and different number of sentences during various phases of translation resulting in varied translation output. So comparing these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model. The main contribution of this paper is to discuss an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language. This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used. With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research