z-logo
open-access-imgOpen Access
N-Gram Similarity and Distance
Author(s) -
Grzegorz Kondrak
Publication year - 2005
Publication title -
lecture notes in computer science
Language(s) - English
Resource type - Book series
SCImago Journal Rank - 0.249
H-Index - 400
eISSN - 1611-3349
pISSN - 0302-9743
ISBN - 3-540-29740-5
DOI - 10.1007/11575832_13
Subject(s) - edit distance , longest common subsequence problem , similarity (geometry) , n gram , subsequence , computer science , string metric , alphabet , string (physics) , word (group theory) , distance measures , similitude , gram , artificial intelligence , theoretical computer science , algorithm , mathematics , string searching algorithm , pattern matching , language model , mathematical analysis , linguistics , philosophy , genetics , geometry , bounded function , biology , bacteria , image (mathematics) , mathematical physics
In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom