Designing efficient algorithms for querying large corpora | Zendy

Paul Meurer | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Designing efficient algorithms for querying large corpora

Author(s) -

Paul Meurer

Publication year - 2021

Publication title -

oslo studies in language

Language(s) - English

Resource type - Journals

ISSN - 1890-9639

DOI - 10.5617/osla.8504

Subject(s) - computer science , suffix , regular expression , treebank , lexicon , set (abstract data type) , string (physics) , algorithm , matching (statistics) , artificial intelligence , expression (computer science) , automaton , string searching algorithm , natural language processing , pattern matching , parsing , mathematics , programming language , philosophy , linguistics , statistics , mathematical physics

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore