Comparison of sequence and structure‐based datasets for nonredundant structural data mining | Zendy

Chu Carmen K. | Zendy; Feng Lina L. | Zendy; Wouters Merridee A. | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Comparison of sequence and structure‐based datasets for nonredundant structural data mining

Author(s) -

Chu Carmen K.,

Feng Lina L.,

Wouters Merridee A.

Publication year - 2005

Publication title -

proteins: structure, function, and bioinformatics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.699

H-Index - 191

eISSN - 1097-0134

pISSN - 0887-3585

DOI - 10.1002/prot.20505

Subject(s) - protein data bank (rcsb pdb) , protein data bank , pairwise comparison , structural classification of proteins database , computer science , sequence (biology) , data mining , protein structure database , protein structure , algorithm , artificial intelligence , sequence database , biology , genetics , biochemistry , gene

Structural data mining studies attempt to deduce general principles of protein structure from solved structures deposited in the protein data bank (PDB). The entire database is unsuitable for such studies because it is not representative of the ensemble of protein folds. Given that novel folds continue to be unearthed, some folds are currently unrepresented in the PDB while other folds are overrepresented. Overrepresentation can easily be avoided by filtering the dataset. PDB_SELECT is a well‐used representative subset of the PDB that has been deduced by sequence comparison. Specifically, structures with sequences that exhibit a pairwise sequence identity above a threshold value are weeded from the dataset. Although length criteria for pairwise alignments have a structural basis, this automated method of pruning is essentially sequence‐based and runs into problems in the twilight zone, possibly resulting in some folds being overrepresented. The value‐added structure databases SCOP and CATH are also a potential source of a nonredundant dataset. Here we compare the sequence‐derived dataset PDB_SELECT with the structural databases SCOP (Structural Classification Of Proteins) and CATH (Class‐Architecture‐Topology‐Homology). We show that some folds remain overrepresented in the PDB_SELECT dataset while other folds are not represented at all. However, SCOP and CATH also have their own problems such as the labor‐intensiveness of the update process and the problem of determining whether all folds are equally or sufficiently distant. We discuss areas where further work is required. Proteins 2005. © 2005 Wiley‐Liss, Inc.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research