z-logo
open-access-imgOpen Access
Semantic PDF Segmentation for Legacy Documents in Technical Documentation
Author(s) -
Jan Oevermann
Publication year - 2018
Publication title -
procedia computer science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.334
H-Index - 76
ISSN - 1877-0509
DOI - 10.1016/j.procs.2018.09.006
Subject(s) - computer science , disk formatting , metadata , information retrieval , documentation , xml , technical documentation , rdf , world wide web , segmentation , component (thermodynamics) , semantic web , artificial intelligence , physics , thermodynamics , programming language , operating system
The most common format to store and provide technical documentation is PDF. However, due to the unstructured nature of the format these documents are often excluded from a granular semantic access. While more and more companies are implementing XML-based component content management systems which can deliver annotated structured content, older legacy documents remain in their monolithic form. We developed a new approach which segments PDF documents into semantically related sections via classification knowledge gained from structured training content. This approach based on machine learning is independent from any formatting information or visual clues. In this paper, we take the results from multiple previous works and combine them into a holistic procedure model. We introduce a parameterizable range finding algorithm to refine segment detection and provide a RDF-based format to exchange the generated metadata which can then be used to improve information retrieval for users.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom