Premium
Comparison of sequence and structure‐based datasets for nonredundant structural data mining
Author(s) -
Chu Carmen K.,
Feng Lina L.,
Wouters Merridee A.
Publication year - 2005
Publication title -
proteins: structure, function, and bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.699
H-Index - 191
eISSN - 1097-0134
pISSN - 0887-3585
DOI - 10.1002/prot.20505
Subject(s) - protein data bank (rcsb pdb) , protein data bank , pairwise comparison , structural classification of proteins database , computer science , sequence (biology) , data mining , protein structure database , protein structure , algorithm , artificial intelligence , sequence database , biology , genetics , biochemistry , gene
Structural data mining studies attempt to deduce general principles of protein structure from solved structures deposited in the protein data bank (PDB). The entire database is unsuitable for such studies because it is not representative of the ensemble of protein folds. Given that novel folds continue to be unearthed, some folds are currently unrepresented in the PDB while other folds are overrepresented. Overrepresentation can easily be avoided by filtering the dataset. PDB_SELECT is a well‐used representative subset of the PDB that has been deduced by sequence comparison. Specifically, structures with sequences that exhibit a pairwise sequence identity above a threshold value are weeded from the dataset. Although length criteria for pairwise alignments have a structural basis, this automated method of pruning is essentially sequence‐based and runs into problems in the twilight zone, possibly resulting in some folds being overrepresented. The value‐added structure databases SCOP and CATH are also a potential source of a nonredundant dataset. Here we compare the sequence‐derived dataset PDB_SELECT with the structural databases SCOP (Structural Classification Of Proteins) and CATH (Class‐Architecture‐Topology‐Homology). We show that some folds remain overrepresented in the PDB_SELECT dataset while other folds are not represented at all. However, SCOP and CATH also have their own problems such as the labor‐intensiveness of the update process and the problem of determining whether all folds are equally or sufficiently distant. We discuss areas where further work is required. Proteins 2005. © 2005 Wiley‐Liss, Inc.