Delta Encoding of Related Web Pages
Author(s) -
Zan Ouyang,
Nasir D. Memon,
Torsten Suel
Publication year - 2001
Language(s) - English
DOI - 10.1109/dcc.2001.10003
We examine the compression and transmission of related web pages using delta compression. Delta compression algorithms exploit the fact that the sender and receiver may both possess a reference file that is very similar to the file that has to be transmitted. Hence, transmitting only the difference between the two files will require a significantly smaller number of bits than transmitting the entire file. The potential applications that can benefit from our work include (1) storage of web pages in compressed form in main memory to reduce disk paging, (2) compression of a collection of web pages to facilitate faster downloads for off-line browsing, (3) compression of web pages to give a client more off-line content with the same storage capacity, and (4) archival of web sites. We develop a simple delta compression algorithm, wdelta, that allows for string matching to be done both within the target data and the reference data. wdelta uses previously seen text as a dictionary. For two files that are similar in structure, that is, where similar content is located in similar positions, we use a two-sliding window scheme that can efficiently match common strings and encode copy positions compactly. When the second file adds some bulk data or deletes bulk data from the reference file, we use a last-copy scheme that places a pointer in the position where the last match ended in the reference file, and encode the new match position corresponding to this pointer, thus efficiently encoding the copy positions. We present simulation results that show that wdelta outperforms a previously published delta compression algorithm named vdelta when applied to web pages that change in time. When applying a delta compression technique to a collection of related web pages, the goal is to efficiently represent the entire collection by means of appropriate deltas between the pages. The problem of finding an optimal delta encoding for a collection of web pages using pairwise deltas between pages can be formulated as the problem of obtaining a maximum weight branching in an appropriately defined weighted directed graph. It is known that a maximum weight branching can be found in O(jV j) time for a dense graph. For applications where this complexity is prohibitive we design appropriate heuristics. Our experiments indicate that the proposed techniques can give significant savings for collections of web pages that share sufficient common structure and/or content, such as web pages with the same URL but different versions in time, and web pages with similar structure located on the same server.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom