Reducing storage requirements for biological sequence comparison
Author(s) -
Michael Roberts,
Wayne B. Hayes,
Brian R. Hunt,
Stephen M. Mount,
James A. Yorke
Publication year - 2004
Publication title -
bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.599
H-Index - 390
eISSN - 1367-4811
pISSN - 1367-4803
DOI - 10.1093/bioinformatics/bth408
Subject(s) - substring , string (physics) , string searching algorithm , computer science , sequence (biology) , computation , matching (statistics) , fraction (chemistry) , process (computing) , approximate string matching , simple (philosophy) , pattern matching , genome , algorithm , theoretical computer science , data mining , biology , mathematics , data structure , artificial intelligence , genetics , statistics , gene , programming language , chemistry , philosophy , organic chemistry , mathematical physics , epistemology
Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom