Language ID for a Thousand Languages | Zendy

Fei Xia | Zendy; Carrie Lewis | Zendy; William D. Lewis | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Language ID for a Thousand Languages

Author(s) -

Fei Xia,

Carrie Lewis,

William D. Lewis

Publication year - 2010

Publication title -

lsa annual meeting extended abstracts

Language(s) - English

Resource type - Journals

ISSN - 2377-3367

DOI - 10.3765/exabs.v0i0.504

Subject(s) - coreference , computer science , task (project management) , natural language processing , artificial intelligence , resolution (logic) , management , economics

ODIN, the Online Database of INterlinear text, is a resource built over language data harvested from linguistic documents (Lewis, 2006). It currently holds approximately 190,000 instances of Interlinear Glossed Text (IGT) from over 1100 languages, automatically extracted from nearly 3000 documents crawled from the Web. A crucial step in building ODIN is identifying the languages of extracted IGT, a challenging task due to the large number of languages and the lack of training data. We demonstrate that a coreference approach to the language ID task significantly outperforms existing algorithms as it provides an elegant solution to the unseen language problem. We also discuss several issues that make automated Language ID and the maintenance of ODIN very difficult.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research