Automatic Encoding and Language Detection in the GSDL
Author(s) -
Otakar Pinkas
Publication year - 2014
Publication title -
journal of systems integration
Language(s) - English
Resource type - Journals
ISSN - 1804-2724
DOI - 10.20470/jsi.v5i4.211
Subject(s) - encoding (memory) , computer science , natural language processing , programming language , artificial intelligence
Automatic detection of encoding and language of the text is part of the Greenstone Digital Library Software (GSDL) for building and distributing digital collections. It is developed by the University of Waikato (New Zealand) in cooperation with UNESCO. The automatic encoding and language detection in Slavic languages is difficult and it sometimes fails. The aim is to detect cases of failure. The automatic detection in the GSDL is based on n-grams method. The most frequent n-grams for Czech are presented. The whole process of automatic detection in the GSDL is described. The input documents to test collections are plain texts encoded in ISO-8859-1, ISO-8859-2 and Windows-1250. We manually evaluated the quality of automatic detection. To the causes of errors belong the improper language model predominance and the incorrect switch to Windows-1250. We carried out further tests on documents that were more complex. We devote them a separate article.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom