z-logo
Premium
An alternative, layout‐driven approach to the clustering of documents
Author(s) -
Loia Vincenzo,
Senatore Sabrina
Publication year - 2008
Publication title -
international journal of intelligent systems
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.291
H-Index - 87
eISSN - 1098-111X
pISSN - 0884-8173
DOI - 10.1002/int.20289
Subject(s) - computer science , parsing , document clustering , document management system , information retrieval , document layout analysis , document structure description , focus (optics) , page layout , representation (politics) , electronic document , document type definition , cluster analysis , well formed document , document processing , the internet , world wide web , artificial intelligence , xml , physics , politics , advertising , law , optics , business , image (mathematics) , political science
Internet has become a huge repository of information and knowledge, based on the sharing of the electronic documents. Last trends in knowledge management focus on the knowledge representation based on the document content. In fact, most accustomed approaches achieve the document understanding by analyzing the “portions of information'' in the document which describe the content, through techniques of text parsing and extraction. This paper presents an alternative approach that departs from the consolidated techniques of document management and focuses on the logical structure of a PDF document as a discriminating source of document knowledge. The main idea is based on the fact, when the reader looks at a paper, his first perception is related to the layout of the document. The analysis of layout, typesetting, paginating, and graphical arrangement of a document provides interesting information about its content understanding; in general, the documents that are in the same category present similar page layout, fonts, and figures arrangement. In this sense, this work presents an alternative way to deal with documents recognition and understanding, through the analysis of the layout of electronic PDF documents and their classification. © 2008 Wiley Periodicals, Inc.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here