Extracting the Tables of Contents from Images of Documents
Author(s) -
Claudie Faure
Publication year - 2000
Language(s) - English
DOI - 10.5555/2856151.2856157
The conversion of paper documents into electronic structured data is necessary to make them available for the users of information systems, to enable a content-based information search adapted to the specific needs of each user and to adopt the most appropriate layout to read the retrieved documents from computer screen. PixED is specialised in converting scientific papers taken from French conference proceedings. It simulates the task that a human reader performs to detect the organisation of a document by scanning it without a deep reading of the text. A generic descriptive model which could be used to associate logical labels to physical components does nor exist because the layout and the typographic style are document dependant. The lack of a descriptive model is compensated by accessing the symbolic content of the most visible lines. The extraction of the table of contents is described to show how and why the conversion process combines several kinds of information: spatial relationships, typography, symbolic contents, document style and domain knowledge. The results are given for 17 processed documents (110 pages).
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom