
EFFICIENT PREPROCESSING FOR WEB LOG COMPRESSION
Author(s) -
Sebastian Deorowicz,
Szymon Grabowski
Publication year - 2014
Publication title -
computing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.184
H-Index - 11
eISSN - 2312-5381
pISSN - 1727-6209
DOI - 10.47839/ijc.7.1.487
Subject(s) - web log analysis software , computer science , lossless compression , byte , data compression , transaction log , preprocessor , compression ratio , timestamp , upload , prefix , database , web server , data mining , operating system , the internet , algorithm , real time computing , artificial intelligence , web api , database transaction , automotive engineering , engineering , internal combustion engine , linguistics , philosophy
Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. They are usually archived, as it enables further analysis, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client’s IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.70 times in case of gzip and 1.86 times in case of bzip2.