Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition | Zendy

Wu TieeJian | Zendy; Hsieh YaChing | Zendy; Li LungAn | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition

Author(s) -

Wu TieeJian,

Hsieh YaChing,

Li LungAn

Publication year - 2001

Publication title -

biometrics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 2.298

H-Index - 130

eISSN - 1541-0420

pISSN - 0006-341X

DOI - 10.1111/j.0006-341x.2001.00441.x

Subject(s) - mahalanobis distance , euclidean distance , markov chain , distance measures , hamming distance , mathematics , sequence (biology) , similarity (geometry) , base (topology) , pattern recognition (psychology) , euclidean geometry , artificial intelligence , computer science , algorithm , statistics , genetics , biology , image (mathematics) , mathematical analysis , geometry

Summary. In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word‐based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics 53 , 1431–1439) characterized a family of word‐based dissimilarity measures that denned distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n‐ words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback–Leibler discrepancy between frequencies of all n‐words in the two sequences is introduced. Applications to real data demonstrate that Kullback–Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order k̂ Q for base composition, where k̂ Q is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback–Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order k̂ Q of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback–Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore