RSDB: representative protein sequence databases have high information content
Author(s) -
Jong-Eun Park,
Liisa Holm,
Andreas Heger,
C. Chothia
Publication year - 2000
Publication title -
bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.599
H-Index - 390
eISSN - 1367-4811
pISSN - 1367-4803
DOI - 10.1093/bioinformatics/16.5.458
Subject(s) - database , computer science , sequence database , sequence (biology) , homology (biology) , information retrieval , biological database , granularity , sequence homology , data mining , bioinformatics , biology , gene , peptide sequence , genetics , programming language
Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database?
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom