z-logo
open-access-imgOpen Access
Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text
Author(s) -
William J.Teahan,
Khaled M.Alhawiti
Publication year - 2015
Publication title -
international journal of computer science and information technology/international journal of computer science and information technology (chennai. print)
Language(s) - English
Resource type - Journals
eISSN - 0975-4660
pISSN - 0975-3826
DOI - 10.5121/ijcsit.2015.7204
Subject(s) - computer science , preprocessor , natural language processing , natural (archaeology) , natural language , artificial intelligence , archaeology , history
In this paper, several new universal preprocessing techniques are described to improve Prediction by\udPartial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially\udadjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression\udalgorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution\udtechnique is described that leads to significant improvement in compression for many languages when they\udare encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for\udRussian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new\udpreprocessing technique that outputs separate vocabulary and symbols streams – that are subsequently\udencoded separately – is also investigated. This also leads to significant improvement in compression for\udmany languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally,\udnovel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text\udare described for dotted and non-dotted forms of the language.\u

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here