Constructing benchmark test sets for biological sequence analysis using independent set algorithms | Zendy

Samantha Petti | Zendy; Sean R. Eddy | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

Author(s) -

Samantha Petti,

Sean R. Eddy

Publication year - 2022

Publication title -

plos computational biology

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 2.628

H-Index - 182

eISSN - 1553-7358

pISSN - 1553-734X

DOI - 10.1371/journal.pcbi.1009492

Subject(s) - sequence (biology) , benchmark (surveying) , algorithm , set (abstract data type) , benchmarking , computer science , test set , alignment free sequence analysis , sequence analysis , multiple sequence alignment , sequence alignment , graph , test (biology) , mathematics , artificial intelligence , theoretical computer science , biology , genetics , peptide sequence , marketing , gene , business , programming language , geography , paleontology , geodesy

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p % identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research