Better Adaptive Text Compression Scheme
Author(s) -
Duha Amir Sultan
Publication year - 2018
Publication title -
mağallaẗ al-tarbiyaẗ wa-al-ʻilm
Language(s) - English
Resource type - Journals
eISSN - 2664-2530
pISSN - 1812-125X
DOI - 10.33899/edusj.2018.147575
Subject(s) - compression (physics) , compression ratio , computer science , scheme (mathematics) , data compression , process (computing) , algorithm , data compression ratio , mathematics , image compression , artificial intelligence , physics , mathematical analysis , image processing , image (mathematics) , thermodynamics , internal combustion engine , operating system
A data compression scheme suggested by Ziv and Lempel, LZ77, is applied to text compression. A slightly modified version suggested by Storer and Szymanski ,LZSS, is found to achieve compression ratios as good as most existing schemes for a wide range of texts. In these two methods and all other dynamic methods, a text file is searched from left to right to find the longest match between the lookahead buffer (the previously encoded text) and the characters to be encoded. The method suggested in this work depends the searching in two directions, from left to right and from right to left, although this process takes more time, better compression results were obtained. Introduction Data compression is the business of reducing the amount of space needed to store files on computers or of reducing the amount of time taken to transmit the information over a channel of given bandwidth [7]. Better Adaptive Text Compression Scheme Journal of Education and Science – Pure Sciences 49 Data stored on a computer falls into two groups: First: digital representation of data that is continuous in nature such as images, sounds, video sequences,...., bec use the form stored is lre dy the quantized version of the original, it is appropriate for further approximation to be permitted and lossy compression techniques can be used to obtain extremely compact representation. Second: data such as text, archival images of historical documents,.... Where the origin l source of d t must be capable of being reconstructed exactly, so lossless compression methods can be used to compact and save these kinds of data [17]. In this work, we concentrate on lossless compression, precisely on adaptive methods (explained in section 3), specially LZSS, which is an improvement of LZ77, a table of previous works (from 1977 to 2005) also presented in this section, a development is applied to LZSS generating a new scheme (named as LZD) with better compression results, the idea of LZD is explained in details in section 4. Finally, and in section 5, some experimental results on some files are showed followed by a simple comparison between LZSS and LZD methods. Types of Compression Methods Compression methods can be classified into many groups (Fig. 2.1), as they have been designed for a wide variety of types of information such as text, images, and sound. These usually call for quite different approaches to the problem because of the different types of information they contain. In general, compression methods are divided into two groups: Lossy and Lossless: Lossy, or irreversible, compression is used for digitized analogue signals such as speech and pictures. Lossless, reversible or noiseless, compression (where the original can be recovered exactly from it's compressed version) is particularly important for text, since in this situation errors are not exactly accepted [2]. Lossless compression methods, on the other hand, can be assorted to Online and Offline. Online methods accomplish the compression in one pass. Offline methods process the entire input string several times before the final encoding strategy is determined [25]. Another categorization can be made to static, semiadaptive, and adaptive compression schemes. In Static schemes the rules used to encode the string are kept fixed during the process. Semiadaptive schemes differ from the previous in using a different model for each encoded text. Better Adaptive Text Compression Scheme Journal of Education and Science – Pure Sciences 52 Adaptive (or dynamic) methods changes the rules according to the characteristics of the text, this is normally implemented as an online learning process [2]. Finally, an other assortment can be made, according to the entropy theory, (where the entropy is a measure of the information content of the text to give a limit of the best possible compression) to Entropy and Nonentropy methods. Entropy methods are used when the objective is to maximize compression. In Non-entropy methods speed of operation, economy of memory usage are desirable features beside the aim of compression [22]. More details can be found in [2], [25], [17], [7], [10], and [22] Compression methods groups Loss of compression encoding entropy Data passes rules theory Lossy lossless online offline static semiadaptive adaptive entropy non-entropy Fig. 2.1 Types of Compression Methods Adaptive Methods Almost all practical adaptive encoders are encompressed by a family of algorithms derived from the work of Ziv and Lempel. The essence is that the phrases are replaced with a pointer to where they have occurred earlier in the text. This family of schemes is called Ziv_lempel compression, abbreviated as LZ compression [6]. This method adapt quickly to a new topic, but it is also able to code short function words because they appear so frequently. Decoding a text that has been compressed in this manner is straight forward; the decoder simply replaces a pointer with the already decoded text that it points to. In practice, LZ coding achieves good compression, and an important feature is that decoding can be very fast. One form of a pointer is a pair (m,l) that represents the phrase of l characters starting at position m of the input string. The pointer is constructed from the earlier text of a predefined window. The window may be unrestricted (growing window) or it may restricted to a fixed_size window of the previous N characters, where N is typically several thousands [6]. The growing window offers better compression by making more substrings available. As the window becomes larger, however, the encoding may slow down because of the time taken to search for Better Adaptive Text Compression Scheme Journal of Education and Science – Pure Sciences 51 matching substrings; compression may get worse because pointer must be larger; and if memory runs out the window may have to be discarded; giving poor compression until it grows again. A fixed_size window avoids all these problems, but it has fewer substrings available as targets of pointers. Within the window chosen, limiting the set of substrings that may be the target of pointers makes the pointers smaller and encoding faster. The table below labels the previous works and the most significant variations of LZ compression, and summarizes the main distinguishing features among them:Table 2.1 Principal LZ Variations LZ77 [27] (1977) pointers and characters alternate, pointers indicates a substring in the previous N characters. LZR [18] (1981) pointers and characters alternate, pointers indicates a substring anywhere in the previous characters. LZSS [2] (1986) pointers and characters are distinguished by a flag bit, pointers indicate a substring in the previous N characters. LZH [3] (1987) same as LZSS, except Huffman coding is used for pointers on a second pass. LZ78 [28] (1978) pointers and characters alternate, pointers indicate a previously phrased substrings. LZW [24] (1984) the output contains pointers only, pointers indicates a previously phrased substring, pointers are of fixed size. LZC [20] (1985) the output contains pointers only, pointers indicate a previously phrased substring. LZT [21] (1987) same as LZC, but with phrases in a LRU list. LZMW [16] (1984) same as LZT, but phrases are built by concatenating the previous two phrases. LZJ [11] (1985) the output contains pointers only, pointers indicates a substring anywhere in the previous characters. LZFG [9] (1989) pointers select a node in a trie, strings in a trie are from a sliding window. LZRW [5] (1991) refers to variants of the LZ77 with an emphasis on improving compression speed through the use of hash table. LZX [5] (1998) it was publicly released as an Amiga file archiver. LZMA [5] (1998) uses a dictionary scheme similar to LZ77 with a variable size up to 4GB. LZWL [5] (2005) work with syllables Better Adaptive Text Compression Scheme Journal of Education and Science – Pure Sciences 52 3.1LZ77 Scheme:LZ77 was the first form of LZ compression to be published [27]. In this scheme, pointers denote phrases in a fixed-size window that precedes the coding position. There is a maximum length for substrings that may be replaced by a pointer , given by the parameter F (typically 10-20). These restrictions allow LZ77 to be implemented using a "sliding window" of N characters. Of these, the first N-F have already been encoded and the last F constitute a lookahead buffer. To encode a character, the first N-F characters of the window are reached to find the longest match with the lookahead buffer. The match may overlap with the buffer but obviously can not be the buffer itself. The longest match is then coded into the triple (i,j,a), where i is the offset of the longest match from the lookahead buffer, j is the length of the match, and a is the first character that did not match the substring in the window. The window is then shifted right j+1 characters, ready for another coding step. Attaching the explicit character to each pointer ensures that coding can proceed even if no match is found for the first character of the lookahead buffer [13]. From this notation, the string "a b b a a b b b a b a b" After completing the coding process the output would be: (0,0,a)(0,0,b)(2,1,a)(1,1,a)(1,3,b)(3,2,b)(8,3,null)
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom