z-logo
open-access-imgOpen Access
Automatic Encoding and Language Detection in the GSDL
Author(s) -
Otakar Pinkas
Publication year - 2014
Publication title -
journal of systems integration
Language(s) - English
Resource type - Journals
ISSN - 1804-2724
DOI - 10.20470/jsi.v5i4.211
Subject(s) - encoding (memory) , computer science , natural language processing , programming language , artificial intelligence
Automatic detection of encoding and language of the text is part of the Greenstone Digital Library Software (GSDL) for building and distributing digital collections. It is developed by the University of Waikato (New Zealand) in cooperation with UNESCO. The automatic encoding and language detection in Slavic languages is difficult and it sometimes fails. The aim is to detect cases of failure. The automatic detection in the GSDL is based on n-grams method. The most frequent n-grams for Czech are presented. The whole process of automatic detection in the GSDL is described. The input documents to test collections are plain texts encoded in ISO-8859-1, ISO-8859-2 and Windows-1250. We manually evaluated the quality of automatic detection. To the causes of errors belong the improper language model predominance and the incorrect switch to Windows-1250. We carried out further tests on documents that were more complex. We devote them a separate article.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom