z-logo
open-access-imgOpen Access
Document Analysis Systems: Theory and Practice
Author(s) -
SeongWhan Lee,
Yasuaki Nakano
Publication year - 1999
Publication title -
lecture notes in computer science
Language(s) - English
Resource type - Book series
SCImago Journal Rank - 0.249
H-Index - 400
eISSN - 1611-3349
pISSN - 0302-9743
DOI - 10.1007/3-540-48172-9
Subject(s) - computer science , series (stratigraphy) , software engineering , data science , paleontology , biology
Numerous approaches, including textual, structural and featural, for detecting duplicate documents have been investigated. Considering document images are usually stored and transmitted in compressed forms, it is advantageous to perform document matching directly on the compressed data. A two-stage process for matching Group 4 compressed document images is presented. In the coarse matching stage, ranked hypotheses axe generated based on compression bit profile correlations. These candidates are further evaluated using a feature set similar to the pass codes. Multiple descriptors based on local arrangement of the feature points axe constructed for efficient indexing into the database. Performance of the algorithm on the UW database is discussed. 1 I n t r o d u c t i o n As electronic document images become prevalent, efficient retrieval methods become increasingly more important. A common solution is to perform OCR followed by a text based search. Recently, alternatives to the text-based approach have been developed by extracting features directly from images, with the goal of achieving efficiency and robustness over OCR. An example of such a feature is word length. Using sequences of word lengths in documents as indexes, Hull identifies similar documents by comparing the number of hits in each image generated by the query [4]. Spitz maps alphabetic characters to a small set of character shape codes (CSC) which can be used to compile search keys for ASCII text retrieval [10]. CSC's can also be obtained from text images based on the relative positions of connected components to baselines and x-height lines, as used by Spitz for word spotting in document images [9]. Doermann, et. al. extend the application of CSC's to document duplicate detection by constructing multiple indexes using short sequences of CSC's extracted from the first line of text of sufficient length [2]. All of these methods are inherently text-line based. Line, word or even character segmentation need to be performed. The duplicate detection mechanism in DocBrowse is based on horizontal projection profiles [1]. The distance between wavelet coefficient vectors of the profiles represents document similarity. It is noted that this method out-performs the text-based approach on degraded documents and documents with small amounts of text. Since the majority of document images in databases are stored in compressed formats, it is advantageous to perform document matching on compressed files. Not S.-W. Lee and Y. Nakano (Ed.): DASr98, LNCS 1655, pp. 13-21, 1999. @ Springer-Verlag Berlin Heidelberg 1999 only does this eliminate the need for decompression and recompression, the reduced memory requirement makes commercialization more feasible. Matching compressed files of course presents additional challenges. For CCITT Group 4 compressed files, pass codes have been shown to contain critical information in identifying similar documents. In Hull’s work, pass codes extracted from a small text region are used with the Hausdorff distance metric to correctly identify 92.5% of duplicate documents [3]. However, calculation of the Hausdorff distance is computationally intensive and the number of distance calculations scales linearly with the size of database. The computational cost can be reduced by measuring global similarities of pass code distributions. It has been shown that the number of pass codes inside the cells of a fixed grid can effectively retrieve visually similar documents, and can be used as a preprocessing step for the Hausdorff measure [5]. In this paper, we present a two-stage algorithm for duplicate detection of Group 4 (G4) compressed documents. The first stage performs coarse matching based on document profile correlation. Global statistics such as line spacing and text height are used to confine the search space. If multiple candidates are generated, a set of endpoint features is extracted from the query document for detailed matching. Similar to the pass codes, the endpoint features contain sufficient information for various levels of processing, including page skew and orientation estimation. In addition, endpoint features are stable, symmetric and easily computable from Group 4 compressed files. The rest of the paper is organized as follows. Details of the coarse matching processing, including profile extraction, global statistics calculation and feature robustness are discussed in Section 2. Section 3 describes the detailed matching procedure which includes endpoint feature extraction and generation of local descriptors. Section 4 discusses experimental results and suggests further improvements, followed by conclusions in Section 5.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom