Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings | Zendy

Ivan Vulić | Zendy; MarieFrancine Moens | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

Author(s) -

Ivan Vulić,

MarieFrancine Moens

Publication year - 2015

Publication title -

proceedings of the 45th international acm sigir conference on research and development in information retrieval

Language(s) - English

Resource type - Conference proceedings

DOI - 10.1145/2766462.2767752

Subject(s) - computer science , clef , natural language processing , artificial intelligence , word (group theory) , latent dirichlet allocation , word embedding , vector space , embedding , information retrieval , topic model , linguistics , mathematics , philosophy , geometry , management , economics , task (project management)

We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research