Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text | Zendy

William J.Teahan | Zendy; Khaled M.Alhawiti | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text

Author(s) -

William J.Teahan,

Khaled M.Alhawiti

Publication year - 2015

Publication title -

international journal of computer science and information technology

Language(s) - English

Resource type - Journals

eISSN - 0975-4660

pISSN - 0975-3826

DOI - 10.5121/ijcsit.2015.7204

Subject(s) - computer science , preprocessor , natural language processing , natural (archaeology) , natural language , artificial intelligence , archaeology , history

In this paper, several new universal preprocessing techniques are described to improve Prediction by\udPartial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially\udadjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression\udalgorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution\udtechnique is described that leads to significant improvement in compression for many languages when they\udare encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for\udRussian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new\udpreprocessing technique that outputs separate vocabulary and symbols streams – that are subsequently\udencoded separately – is also investigated. This also leads to significant improvement in compression for\udmany languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally,\udnovel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text\udare described for dotted and non-dotted forms of the language.\u

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research