Premium
Structure recognition and information extraction from tabular documents
Author(s) -
Chandran Surekha,
Balasubramanian Sanjay,
Gandhi Tarak,
Prasad Arathi,
Kasturi Rangachar,
Chhabra Atul
Publication year - 1996
Publication title -
international journal of imaging systems and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.359
H-Index - 47
eISSN - 1098-1098
pISSN - 0899-9457
DOI - 10.1002/(sici)1098-1098(199624)7:4<289::aid-ima4>3.0.co;2-4
Subject(s) - table (database) , computer science , interpretation (philosophy) , information retrieval , information extraction , block (permutation group theory) , focus (optics) , document processing , character (mathematics) , image (mathematics) , artificial intelligence , horizontal and vertical , data mining , pattern recognition (psychology) , natural language processing , mathematics , programming language , physics , geometry , optics
We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.