Open Access
sOCRates - a post-OCR text correction method
Author(s) -
Danny Suarez Vargas,
Lucas Lima de Oliveira,
Viviane Pereira Moreira,
Guilherme Torresan Bazzo,
Gustavo Acauan Lorentz
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5753/sbbd.2021.17866
Subject(s) - computer science , optical character recognition , information retrieval , natural language processing , socrates , artificial intelligence , word (group theory) , classifier (uml) , character (mathematics) , image (mathematics) , linguistics , philosophy , geometry , mathematics
A significant portion of the textual information of interest to an organization is stored in PDF files that should be converted into plain text before their contents can be processed by an information retrieval or text mining system. When the PDF documents consist of scanned documents, optical character recognition (OCR) is typically used to extract the textual contents. OCR errors can have a negative impact on the quality of information retrieval systems since the terms in the query will not match incorrectly extracted terms in the documents. This work introduces sOCRates, a post-OCR text correction method that relies on contextual word embeddings and on a classifier that uses format, semantic, and syntactic features. Our experimental evaluation on a test collection in Portuguese showed that sOCRates can accurately correct errors and improve retrieval results.