Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings | Zendy

Phillip Keung | Zendy; Julián Salazar | Zendy; Yichao Lu | Zendy; Noah A. Smith | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings

Author(s) -

Phillip Keung,

Julián Salazar,

Yichao Lu,

Noah A. Smith

Publication year - 2020

Publication title -

transactions of the association for computational linguistics

Language(s) - English

Resource type - Journals

ISSN - 2307-387X

DOI - 10.1162/tacl_a_00348

Subject(s) - computer science , machine translation , artificial intelligence , initialization , sentence , natural language processing , task (project management) , language model , translation (biology) , speech recognition , biochemistry , chemistry , messenger rna , gene , management , economics , programming language

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT’14 French-English and WMT’16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT’15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research