z-logo
open-access-imgOpen Access
Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs
Author(s) -
Shaolin Zhu,
Chenggang Mi,
Tianqi Li,
Yong Yang,
Xu Chun
Publication year - 2023
Publication title -
acm transactions on asian and low-resource language information processing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.239
H-Index - 14
eISSN - 2375-4702
pISSN - 2375-4699
DOI - 10.1145/3486677
Subject(s) - computer science , machine translation , sentence , natural language processing , artificial intelligence , similarity (geometry) , parallel corpora , word (group theory) , metric (unit) , linguistics , philosophy , operations management , economics , image (mathematics)
Parallel sentence pairs play a very important role in many natural language processing (NLP) tasks, especially cross-lingual tasks such as machine translation. So far, many Asian language pairs lack bilingual parallel sentences. As collecting bilingual parallel data is very time-consuming and difficult, it is very important for many low-resource Asian language pairs. While existing methods have shown encouraging results, they rely on bilingual data seriously or have some drawbacks in an unsupervised situation. To address these issues, we propose a new unsupervised similarity calculation and dynamic selection metric to obtain parallel sentence pairs in an unsupervised situation. First, our method maps bilingual word embedding (BWE) by postdoc adversarial training which rotates the source space to match the target without parallel data. Then, we introduce a new cross-domain similarity adaption to obtain parallel sentence pairs. Experimental results on real-world datasets show that our model can obtain better accuracy and recall on mining parallel sentence pairs. We also show that the extracted bilingual sentence corpora can significantly improve the performance of neural machine translation.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here