Protein Sequence Similarity Search Suitable for Parallel Implementation | Zendy

H. S. Mazumdar | Zendy; Maulika Patel | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Protein Sequence Similarity Search Suitable for Parallel Implementation

Author(s) -

H. S. Mazumdar,

Maulika Patel

Publication year - 2012

Publication title -

international journal of computer applications

Language(s) - English

Resource type - Journals

ISSN - 0975-8887

DOI - 10.5120/7935-1246

Subject(s) - computer science , similarity (geometry) , sequence (biology) , nearest neighbor search , information retrieval , data mining , artificial intelligence , genetics , biology , image (mathematics)

Having entered the post genomic era, there lies a plethora of information, both genomic and proteomic. This provides quite a lot of resources so that the computational and machine learning strategies be applied to address the problems of biological relevance. Searching in biological databases for similar or homologous sequences is a fundamental step for many bioinformatics tasks. On discovery of a new protein sequence or drug, a biologist would like to confirm the discovery by comparing with the largest available protein database. Alignment based methods become too complex and time consuming with the increase in the number of sequences. Alignment free sequence comparison is many a time used as a filtering step for application of alignment. A novel method of searching for similar sequences in a huge protein database is proposed. The method has two interesting aspects. One is the divide and conquer approach and use of hashing like scheme for indexing the large database. The index consists of the addresses of the 15-residue words in the UniRef100. fasta database. The second aspect is the possibility of data parallelism as the database is divided into m segments for indexing. This can further increase the efficiency of the algorithm. The creation of index is time consuming but the search time is constant and affordable. The method is particularly useful when used with the large databases like UniRef100. fasta which consists of 9757328 protein sequences as on May 2010. The index based searching algorithm is implemented in C # .

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research