z-logo
open-access-imgOpen Access
Designing efficient algorithms for querying large corpora
Author(s) -
Paul Meurer
Publication year - 2021
Publication title -
oslo studies in language
Language(s) - English
Resource type - Journals
ISSN - 1890-9639
DOI - 10.5617/osla.8504
Subject(s) - computer science , suffix , regular expression , treebank , lexicon , set (abstract data type) , string (physics) , algorithm , matching (statistics) , artificial intelligence , expression (computer science) , automaton , string searching algorithm , natural language processing , pattern matching , parsing , mathematics , programming language , philosophy , linguistics , statistics , mathematical physics
I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here