
A Relative-Entropy Algorithm for Genomic Fingerprinting Captures Host-Phage Similarities
Author(s) -
Harlan Robins,
Michael Krasnitz,
Hagar Barak,
Arnold J. Levine
Publication year - 2005
Publication title -
journal of bacteriology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.652
H-Index - 246
eISSN - 1067-8832
pISSN - 0021-9193
DOI - 10.1128/jb.187.24.8370-8374.2005
Subject(s) - biology , genetics , computational biology , dna profiling , evolutionary biology , dna
The degeneracy of codons allows a multitude of possible sequences to code for the same protein. Hidden within the particular choice of sequence for each organism are over 100 previously undiscovered biologically significant, short oligonucleotides (length, 2 to 7 nucleotides). We present an information-theoretic algorithm that finds these novel signals. Applying this algorithm to the 209 sequenced bacterial genomes in the NCBI database, we determine a set of oligonucleotides for each bacterium which uniquely characterizes the organism. Some of these signals have known biological functions, like restriction enzyme binding sites, but most are new. An accompanying scoring algorithm is introduced that accurately (92%) places sequences of 100 kb with their correct species among the choice of hundreds. This algorithm also does far better than previous methods at relating phage genomes to their bacterial hosts, suggesting that the lists of oligonucleotides are “genomic fingerprints” that encode information about the effects of the cellular environment on DNA sequence. Our approach provides a novel basis for phylogeny and is potentially ideally suited for classifying the short DNA fragments obtained by environmental shotgun sequencing. The methods developed here can be readily extended to other problems in bioinformatics.