Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization | Zendy

Libbrecht Maxwell W. | Zendy; Bilmes Jeffrey A. | Zendy; Noble William Stafford | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

Author(s) -

Libbrecht Maxwell W.,

Bilmes Jeffrey A.,

Noble William Stafford

Publication year - 2018

Publication title -

proteins: structure, function, and bioinformatics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.699

H-Index - 191

eISSN - 1097-0134

pISSN - 0887-3585

DOI - 10.1002/prot.25461

Subject(s) - submodular set function , computer science , heuristic , optimization problem , sequence (biology) , set (abstract data type) , mathematical optimization , selection (genetic algorithm) , domain (mathematical analysis) , benchmark (surveying) , data mining , algorithm , mathematics , machine learning , artificial intelligence , biology , mathematical analysis , geodesy , genetics , programming language , geography

Abstract Selecting a non‐redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non‐redundant training sets for sequence and structural models or selection of “operational taxonomic units” from metagenomics data. Previous methods for this task, such as CD‐HIT, PISCES, and UCLUST, apply a heuristic threshold‐based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore