z-logo
open-access-imgOpen Access
POS-tagging a bilingual parallel corpus: methods and challenges
Author(s) -
Irene Doval Reixa
Publication year - 2017
Publication title -
research in corpus linguistics
Language(s) - English
Resource type - Journals
ISSN - 2243-4712
DOI - 10.32714/ricl.05.03
Subject(s) - computer science , annotation , natural language processing , german , part of speech tagging , corpus linguistics , artificial intelligence , process (computing) , part of speech , linguistics , information retrieval , programming language , philosophy
This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here