Premium
Approximate string matching using withinword parallelism
Author(s) -
Wright Alden H.
Publication year - 1994
Publication title -
software: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.437
H-Index - 70
eISSN - 1097-024X
pISSN - 0038-0644
DOI - 10.1002/spe.4380240402
Subject(s) - substring , string searching algorithm , approximate string matching , string (physics) , computer science , edit distance , dynamic programming , parallelism (grammar) , pattern matching , alphabet , word (group theory) , substitution (logic) , longest common subsequence problem , algorithm , integer (computer science) , parallel computing , mathematics , data structure , geometry , mathematical physics , programming language , linguistics , philosophy
Given a text string, a pattern string, and an integer k , the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod‐4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21‐fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu‐Manber algorithm are given.