z-logo
open-access-imgOpen Access
Transforming paper documents into XML format with WISDOM++
Author(s) -
O. Altamura,
Floriana Esposito,
Donato Malerba
Publication year - 2001
Publication title -
international journal on document analysis and recognition (ijdar)
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.331
H-Index - 50
eISSN - 1433-2833
pISSN - 1433-2825
DOI - 10.1007/pl00013569
Subject(s) - computer science , well formed document , information retrieval , xml , document structure description , document type definition , document processing , preprocessor , process (computing) , transformation (genetics) , simple api for xml , html element , document layout analysis , the internet , xml framework , benchmarking , xml validation , world wide web , web page , artificial intelligence , image (mathematics) , xml signature , programming language , biochemistry , chemistry , marketing , gene , business
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems.\udThe application of an OCR to some parts of the document image is only one of the problems. In fact,\udthe generation of documents in HTML format is easier\udwhen the layout structure of a page has been extracted\udby means of a document analysis process. The\udadoption of an XML format is even better, since it can\udfacilitate the retrieval of documents in the Web. Nevertheless,\udan effective transformation of paper documents\udinto this format requires further processing steps, namely document image classification and understanding. WISDOM++\udis a document processing system that operates\udin five steps: document analysis, document classification,\uddocument understanding, text recognition with an OCR,\udand text transformation into HTML/XML format. The\udinnovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation,\udthe acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing\udthese innovative aspects is reported

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom