z-logo
open-access-imgOpen Access
Multiple sequence alignment -- the gateway to further analysis
Author(s) -
Lisa Mullan
Publication year - 2002
Publication title -
briefings in bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.204
H-Index - 113
eISSN - 1477-4054
pISSN - 1467-5463
DOI - 10.1093/bib/3.3.303
Subject(s) - gateway (web page) , sequence (biology) , computer science , sequence analysis , world wide web , biology , genetics , gene
Whether the ultimate aim is a phylogenetic analysis of several orthologues, the identification of a pattern for particular feature or motif, or the basis for structural modelling, multiple sequence alignments allow the researcher to gather more biological information than a single sequence can offer. Possibly the most popular method for comparing three or more sequences is the clustering algorithm used in applications such as the Clustal (ClustalW and ClustalX) series of programs. It is certainly by no means the only method of alignment, but will be used to illustrate this text. Initial clustering of sequence pairs reduces the computing time required to align multiple sequences and this can be achieved using one of two possible methods. Slow clustering is the more rigorous of the two options, but is noticeably much slower for approximately 20 or more sequences, or fewer, longer regions. It uses the dynamic programming method of Needleman–Wunsch to align each sequence with another according to a weight matrix and gap penalties. The ultimate aim of the computer program is to achieve the highest score possible, within the constraints the program has been placed under. Weight matrices have been developed using homologous sequences, and allocate a score to each residue or nucleotide base indicating the probability of it replacing a different residue or nucleotide base as a possible mutation. In the case of protein sequences, this has been done for all 20 amino acid residues, together with the three ambiguity codes (B 1⁄4 Asp and Asn, Z 1⁄4 Glu and Gln, X 1⁄4 any residue) using several different methods. Nucleotide matrices have also been developed, and in general indicate a positive score for an identical match, and no score, or a negative one for a mismatch. Because of its very nature, and the existence of only four common bases, more information for the alignment can be obtained by using protein sequences, and it often makes sense to translate regions of coding DNA into protein sequence before aligning them. Once a high score has been achieved for each of the sequence pairs in the alignment, they are clustered together in accordance with their relative scores, using the neighbour-joining method to link the closest pairings together, and less similar sequences more remotely. This information is stored as a series of numerical distances arranged by means of nested brackets in a dendrogram file. This file is in no way representative of evolutionary distances, and should not be presented as such. It merely represents the proximity of each sequence within a cluster, and each cluster to another and is used to form the final alignment. The information retained in the dendrogram file may be kept and used to align other multiple sequence sets. Larger sequence volumes may be compared using a faster method, in order to reduce computing time. This is based on the algorithm of Wilbur and Lipman and is quicker but less accurate than the dynamic programming methods of the slow comparison. It involves definition of

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom