
POS-tagging a bilingual parallel corpus: methods and challenges
Author(s) -
Irene Doval Reixa
Publication year - 2017
Publication title -
research in corpus linguistics
Language(s) - English
Resource type - Journals
ISSN - 2243-4712
DOI - 10.32714/ricl.05.03
Subject(s) - computer science , annotation , natural language processing , german , part of speech tagging , corpus linguistics , artificial intelligence , process (computing) , part of speech , linguistics , information retrieval , programming language , philosophy
This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.