z-logo
open-access-imgOpen Access
Probing language identity encoded in pre-trained multilingual models: a typological view
Author(s) -
JianYu Zheng,
Ying Li
Publication year - 2022
Publication title -
peerj. computer science
Language(s) - English
Resource type - Journals
ISSN - 2376-5992
DOI - 10.7717/peerj-cs.899
Subject(s) - computer science , identity (music) , natural language processing , encoding (memory) , language model , artificial intelligence , linguistics , philosophy , physics , acoustics
Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”. We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here