Extracting the Tables of Contents from Images of Documents | Zendy

Claudie  Faure | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Extracting the Tables of Contents from Images of Documents

Author(s) -

Claudie Faure

Publication year - 2000

Language(s) - English

DOI - 10.5555/2856151.2856157

The conversion of paper documents into electronic structured data is necessary to make them available for the users of information systems, to enable a content-based information search adapted to the specific needs of each user and to adopt the most appropriate layout to read the retrieved documents from computer screen. PixED is specialised in converting scientific papers taken from French conference proceedings. It simulates the task that a human reader performs to detect the organisation of a document by scanning it without a deep reading of the text. A generic descriptive model which could be used to associate logical labels to physical components does nor exist because the layout and the typographic style are document dependant. The lack of a descriptive model is compensated by accessing the symbolic content of the most visible lines. The extraction of the table of contents is described to show how and why the conversion process combines several kinds of information: spatial relationships, typography, symbolic contents, document style and domain knowledge. The results are given for 17 processed documents (110 pages).

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research