POS-tagging a bilingual parallel corpus: methods and challenges
Author(s) -
Irene Doval
Publication year - 2017
Publication title -
research in corpus linguistics
Language(s) - English
Resource type - Journals
ISSN - 2243-4712
DOI - 10.32714/ricl.05.03
Subject(s) - computer science , annotation , natural language processing , german , part of speech tagging , corpus linguistics , artificial intelligence , process (computing) , part of speech , linguistics , information retrieval , programming language , philosophy
This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom