Premium
Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach
Author(s) -
Järvelin Anni,
Keskustalo Heikki,
Sormunen Eero,
Saastamoinen Miamaria,
Kettunen Kimmo
Publication year - 2016
Publication title -
journal of the association for information science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.903
H-Index - 145
eISSN - 2330-1643
pISSN - 2330-1635
DOI - 10.1002/asi.23379
Subject(s) - computer science , newspaper , query expansion , matching (statistics) , string searching algorithm , information retrieval , word (group theory) , string (physics) , index (typography) , pattern matching , natural language processing , string metric , artificial intelligence , mathematics , world wide web , statistics , geometry , advertising , business , mathematical physics
The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms ( F innish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a C ranfield‐style test. Finally, a detailed topic‐level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition ( OCR ) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.