z-logo
open-access-imgOpen Access
How to Extract Good Knowledge from Bad Data: An Experiment with Eighteenth Century French Texts
Author(s) -
François Dominic Laramée
Publication year - 2019
Publication title -
digital studies / le champ numérique
Language(s) - French
Resource type - Journals
SCImago Journal Rank - 0.14
H-Index - 2
ISSN - 1918-3666
DOI - 10.16995/dscn.299
Subject(s) - computer science , spelling , optical character recognition , natural language processing , heuristic , artificial intelligence , point (geometry) , set (abstract data type) , construct (python library) , grammar , lexicon , repetition (rhetorical device) , information retrieval , linguistics , image (mathematics) , philosophy , geometry , mathematics , programming language
From a digital historian’s point of view, Ancien Regime French texts suffer from obsolete grammar, unreliable spelling, and poor optical character recognition, which makes these texts ill-suited to digital analysis. This paper summarizes methodological experiments that have allowed the author to extract useful quantitative data from such unlikely source material. A discussion of the general characteristics of hand-keyed and OCR’ed historical corpora shows that they differ in scale of difficulty rather than in nature. Behavioural traits that make text mining certain eighteenth century corpora particularly challenging, such as error clustering, a relatively high cost of acquisition relative to salience, outlier hiding, and unpredictable patterns of error repetition, are then explained. The paper then outlines a method that circumvents these challenges. This method relies on heuristic formulation of research questions during an initial phase of open-ended data exploration; selective correction of spelling and OCR errors, through application of Levenshtein’s algorithm, that focuses on a small set of keywords derived from the heuristic project design; and careful exploitation of the keywords and the corrected corpus, either as raw data for algorithms, as entry points from which to construct valuable data manually, or as focal points directing the scholar’s attention to a small subset of texts to read. Each step of the method is illustrated by examples drawn from the author’s research on the hand-keyed Encyclopedie and Bibliotheque Bleue and on collections of periodicals obtained through optical character recognition. Du point de vue d’un historien numerique, les textes francais d’Ancien Regime souffrent d’une grammaire obsolete, d’une orthographe irreguliere et d’une reconnaissance optique des caracteres de faible qualite. Cet article resume les experiences methodologiques qui ont permis a l’auteur d’extraire des mesures quantitatives utiles de ces improbables matieres premieres. Une discussion des caracteristiques generales des corpus de textes historiques transcrits a la main et des corpus produits par reconnaissance optique revele qu’ils different en degre de difficulte mais non en nature. Les comportements qui rendent certains de ces corpus particulierement difficiles a traiter numeriquement, dont la distribution non aleatoire des erreurs, un cout unitaire d’acquisition relativement eleve, la dissimulation des documents atypiques et l’imprevisibilite des erreurs repetees, sont ensuite expliques. L’article trace ensuite les grandes lignes d’une methode qui contourne ces problemes. Cette methode repose sur la selection heuristique de questions de recherche pendant une phase d’exploration ouverte des donnees; la correction selective des erreurs a l’aide de l’application de l’algorithme de Levenshtein a un petit nombre de mots-cles choisis pendant la phase d’exploration; et l’exploitation des mots-cles et du corpus corrige soit en tant que donnees brutes, soit comme points d’entree permettant l’extraction manuelle de donnees probantes, soit comme boussoles permettant d’orienter l’attention du chercheur vers un sous-ensemble de documents pertinents a lire. Des exemples tires de la recherche de l’auteur, qui porte a la fois sur des corpus ocerises de periodiques et sur les corpus reconstitues manuellement de l’Encyclopedie et de la Bibliotheque bleue, illustrent chacune des etapes. Mots-cles: fouille de texte; fouille de donnees; textometrie; production de l’espace; histoire numerique; correction d’erreurs

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom