z-logo
open-access-imgOpen Access
De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm
Author(s) -
Kristoffer Sahlin,
Paul Medvedev
Publication year - 2020
Publication title -
journal of computational biology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.585
H-Index - 95
eISSN - 1557-8666
pISSN - 1066-5277
DOI - 10.1089/cmb.2019.0299
Subject(s) - cluster analysis , bottleneck , computer science , scalability , nanopore sequencing , data mining , algorithm , greedy algorithm , dna sequencing , biology , machine learning , gene , genetics , database , embedded system
Long-read sequencing of transcripts with Pacific Biosciences (PacBio) Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (to scale) and makes use of quality values (to handle variable error rates). We test isONclust on three simulated and five biological data sets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large data sets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here