Elementary Sequence Analysis


Sequence Alignment

edited by B.Golding, Jan 1996

Contents

Return to index


Dot Plots

The comparison of sequences can be done in many different ways. The most direct method is to make this comparison via a visual means and this is what DOT PLOTS attempt to do. Dot plots are a group of methods that visually compare two sequences and look for regions of close similarity between them.

The Exact Way

The sequences to be compared are arranged along the margins of a matrix. At every point in the matrix where the two sequences are identical a dot is placed (i.e. at the intersection of every row and column that have the same letter in both sequences). A diagonal stretch of dots will indicate regions where the two sequences are similar. Done in this fashion a dot plot as shown in Figure 1 will be obtained. This is a dot plot of the globin intergenic region in chimpanzees plotted against itself (bases 1 to 400 vs. 1 to 300). As can be seen this dot plot is not very useful unless applied to protein sequences (where the background is much less dense), however some statistical methods can be applied to the results (Gibbs and McIntyre 1970, Eur J Biochem 16:1).

Maizel and Lenk (1981, PNAS 78:7665) popularized the dot plot and suggested the use of a filter to reduce the noise demonstrated in Figure 1. This noise is caused by matches that have occurred by chance and are not a true reflection of the similarities between the sequences but rather reflect the limited number of bases permitted in DNA sequences. There are a wide variety of filters that can be used, indeed they are only limited by your imagination. The one suggested by Maizel and Lenk was to place a dot only when a specified proportion of a small group of successive bases match. In Figure 2 the same dot plot is reproduced with a filter such that a window of 10 bases is highlighted only if 6 of these 10 bases match. In Figure 3 the same plot is again shown with a filter of 8 out of 10 matches. Note that these plots highlight the complete window while other programs might highlight a single point centered by the window. Another common way to filter the matches is to give them a weight according to their chemical similarity (Staden 1982, Nuc Acids Res 10:2951).

Figure 1: Dot Blot - without filtering.

Figure 2: Dot Blot - filtered 6 of 10.

Figure 3: Dot Blot - filtered 8 of 10.

The algorithm that generated these figures is included on the attached diskette (files DOTPLOT.PAS and DOTPLOT.EXE). I do not provide any warrantee that the enclosed programs will actually do anything (they are provided only "as is"). The source code of the program is provided and is written in very basic level Pascal and compiled with Borland's Turbo Pascal. The graphic parts of this program use Borland's Graphical Interface. This is a series of commands that automatically detect the type of computer (within limitations) and automatically load the correct graphical drivers. These drivers are located in the files with the file extensions "*.BGI" and these must be located in the directory from which the program is run. Currently the program is limited to DNA sequences but to permit amino acid sequences as well requires only a minor change in the subroutine that reads the sequences and does not require any change to the algorithm itself.

Note that the computational work involved with the generation of these matrices can be quite time consuming. The algorithm that is included on the attached diskette was written by me and hence makes no use of some of the available techniques to speed the computations. Note that if you are comparing a sequence of length N with another sequence of length M, then the total number of windows for which matches must be found is N x M. Hence the amount of work increases with the square of the sequence length. This rapidly becomes a large number. For example with N=700 and M=400, N x M = 280,000.

Identity Blocks

There is another way in which dot plots can be generated very quickly. This involves a computer method commonly known as "hashing" (list-sorting). These methods are incorporated into the FASTA algorithms. Basically, the idea is that instead of taking the complete matrix and calculating points for every entry in that matrix, a great saving can be made if the algorithm searches only for exact matches. Hence, this method looks for blocks of perfect identity. The computational complexity of this algorithm grows linearly with increasing N.

The algorithm simply sub-divides the sequence into all "words" of a user specified block size. The same is done for the alternate sequence. In addition, for both sequences the location of each word is also recorded. These arrays of "words" are then sorted alphabetically and the arrays of locations are sorted in parallel with the "words". Then, by comparing the sorted array from one sequence with that from the other sequence immediately gives the location of all identical "words".

Figure 4: Identity Dot Blot.

Figure 5: Identities of length 6bp.

Chimpanzee hemoglobin intergenic DNA against itself.

An algorithm which does this is included on the attached diskette (files IDENTITY.PAS and IDENTITY.EXE). It can be used to generate the dot plots shown in Figure 4 for identity blocks of length 5. The rapidity of this method compared to the exact method can be demonstrated by the dot plot shown in Figure 5 (with identity blocks of length 6). This figure extends the sequences compared in the chimp globin intergenic region from (1-400 vs 1-300) up to (1-4000 vs 1-3000). The length of time required by DOTPLOT to analyze the small region is not significantly shorter than the length of time it takes IDENTITY takes to do a 100 fold larger matrix.

Figure 6:Identities of length 6bp.

Chimpanzee hemoglobin intergenic DNA against spider monkey.

The beauty of this method is demonstrated in Figure 6. This is a plot of all identities of length 6 between the chimpanzee and spider monkey sequences in the same region. The evolutionary homology - relatedness between these sequences is easily discernible despite the approx. 60 million years that separate these two groups. Further more, this is intergenic DNA with no known function to selectively maintain this homology (modulo an even more ancient eta-globin pseudogene). The insertion of some DNA is easily observed within chimpanzee sequence and then a corresponding deletion further down. These correspond to the insertion of an Alu element in the chimp (and human and other ape) sequences and then the presence of a truncated L1 element in the spider monkey that is not present in the great apes. These events are difficult to find by a simple inspection of the actual sequence code but are readily found by a visual inspection.

A more distant similarity can be seen in Figure 7. This is a plot of the identities of length 6 between the same region of the chimpanzee haemoglobin intergenic region and another intergenic region from the spider monkey. Note the similarity in the circled region. This region of similarity corresponds to the location of another Alu element in the chimpanzee sequence.

Figure 7: Identity dot plot. Chimpanzee hemoglobin intergenic

region vs. Spider Monkey unrelated intergenic region.

Some other interesting dotplots are comparisons of the calmodulin protein against itself and the human epidermal growth factor against itself. Both show internal repetitive elements. The neatest dot plot that I have yet seen is the human zeta globin region and if you zero in on the intergenic region the plot gets fantastic (try to interpret this dot plot).


Alignments

The dot plots provide a ready way to visualize the sequences being compared. They are not very useful however in providing an actual alignment between the two sequences. To do this, other algorithms are required.

A word on terminology. People in the field shudder when the terms similarity and homology are used indiscriminately. Similarity simply means that sequences are in some sense similar and has no evolutionary connotations, whereas homology refers to evolutionarily related sequences stemming from a common ancestor.

To have a correct alignment can be crucial to many types of studies. It may, for example, alter from which part of a gene one segment was duplicated, it may alter the inferred number of point mutations, it may alter the inferred location of deletions / insertions, alter the inferred distance between species, and may alter the inferred phylogeny of the sequences along with whatever evolutionary hypotheses are dependent on these phylogenies.

An explicit and precise algorithm is also required. For example one paper in the prestigious journal NATURE stated that the alignment


------CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT---------

was optimal in the sense that gaps were inserted to maximize the number of base matches (the base matches are underlined - this is only part of the alignment). They obviously did this alignment by eye and did not use an explicit algorithm. An alternate alignment (due to 84219728 Fitch 1984, Nature 309:410) is


CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCA---CCCAGCCCCCTGGACCTGTAT

This alignment not only increases the number of base matches by 133 per cent, but also decreases the number of gaps by 50 per cent and reduces the number of gapped residues by 80 per cent. Hence, if the number of base matches can be increased by reducing the number of gaps, then clearly the original author's insertion of gaps did not maximize that number. Fitch recommends that the authors change their statement to the assertion that gaps were introduced to increase the number of base matches (rather than to maximize them). More generally this example shows the importance of i) using a well defined algorithm and ii) of using a computer based algorithm to perform these calculations. Even alignments that may appear simple and straightforward, if given to the computer, might yield alternatives that you did not consider.


The Needleman and Wunsch Algorithm

The most basic algorithm to align two sequences was developed by S.A. Needleman and C.D. Wunsch (1970, J. Mol. Biol. 48:443). The algorithm is a simple and beautiful way to find an alignment that maximizes a particular score. (The score can be calculated in a variety of methods - as will be indicated below). The overall method is reminiscent of the dot plot (though this was not popular in the 1970's and so there is probably no connection). The first step is to place the two sequences along the margins of a matrix as shown in Table 1.

In this first step, simply place a 1 anywhere the two sequences match and a 0 elsewhere. If done on a larger scale than is shown in Table 1, this would exactly recreate the dot plot shown in Figure 1. In this case however, we wish to find a path through this matrix which would define a more conventional alignment. For example, proceeding along the diagonal with no deviations would imply an alignment without any gaps. The introduction of a gap (either by an insertion or a deletion - an indel) in either sequence would correspond to moving either above or below the main diagonal.

To find the best route, Needleman and Wunsch suggested that you modify the matrix to represent this idea of tracing different pathways through the matrix. However, you want to include all possible pathways and from among these choose only that one which is best (in the sense of maximizing some score). Their method consists of two passes through the matrix. The first traces a score for all possible routes and moves right to left, bottom to top. Once the score for all possible routes are found, the maximum can be chosen (it will be somewhere on the topmost row or leftmost column) and a second pass can be carried out, this time running left to right, top to bottom to find that alignment that gives the maximum score.

The way to trace a score for all possible paths is shown in Table 2. For each element in the matrix you perform the following operation.

where k is any integer larger than i and l is any integer larger than j. In words, alter the matrix by adding to each element the largest element from the row just below and to the right of that element and from the column just to the right and below the element of interest. This row and column for one element are shown in Table 2 by boxes. The number contained in each cell of the matrix, after this operation is completed, is the largest number of identical pairs that can be found if that element is the origin for a pathway which proceeds to the upper left.

We wish to have an alignment which covers the entire sequence. Hence, we can find on the upper row or on the left column the element of the matrix with maximum value. An alignment must begin at this point and can then proceed to the lower right. This is the second pass through the matrix. At each step of this pass, starting from the maximum, one moves one row and column to the lower right and finds the maximum in this row or column. The alignment must proceed through this point.

Continuing in this fashion one eventually hits either the bottom row or the rightmost column and the alignment is finished. This tracing pattern is shown in Table 3. Note that in this case the optimal alignment is not unique. There are two alignments and both give the optimal score of 8 matches. These are ...

These two alignments can be written in more familiar form as either


               ABCNJ-RQCLCR-PM
               * * * * * ** *
               AJC-JNR-CKCRBP-


                    or as 


               ABC-NJRQCLCR-PM
               * * * * * ** *
               AJCJN-R-CKCRBP-

both with 8 asterisks to denote the 8 matches. Note that in this particular case, gaps are given the same penalty as a mismatch. They simply do not add to the score.

A program has been provided for you on the enclosed diskette to carry out a Needleman-Wunsch type of alignment. The program is called NWALIGN. It is severely limited by the size of arrays permitted and hence is limited to sequences of length 100 residues(hence it is better for protein rather than DNA sequences). The code is provided and this limitation can be removed. Also provided in this subdirectory is a program ALIGN (this is a PC version of ALIGN1 - see below). While NWALIGN is less capable than ALIGN, it will give you a better idea of what is happening in a local region of any alignment that you want. Importantly, NWALIGN lets you change the way in which the alignment is done by permitting you to give different weights to different types of mismatches and matches.

The Smith-Waterman Algorithm

The Needleman-Wunsch algorithm creates a global alignment. That is, it tries to take all of one sequence and align it with all of a second sequence. Short and highly similar subsequences may be missed in the alignment because they are outweighed by the rest of the sequence. Hence, one would like to create a locally optimal alignment. The Smith-Waterman (1981, J.Mol.Biol. 147:195-197) finds an alignment that determines the longest/best subsequence pair that give the maximum degree of similarity between the two original sequences. This means that not all of the sequences might be aligned together.

Only minimal changes to the Needleman-Wunsch algorithm are required. These are

  1. A negative score/weight must be given to mismatches.
  2. Zero must be the minimum score recorded in the matrix.
  3. The beginning and end of an optimal path may be found anywhere in the matrix - not just the last row or column.

The first point is required to cause the score to drop as more and more mismatches are added. Hence, the score will rise in a region of high similarity and then fall outside of this region. If there are two segments of high similarity then these must be close enough to a path between them to be linked by a gap or they will be left as independent segments of local similarity.

The second point is so that each pathway begins fresh at its beginning. Thus each short segment of similarity should begin with a score of zero. The third point indicates that the entire matrix must be searched for regions with high local similarity.

Again, for each element in the matrix you perform the following operation.

But in this case it is easier to go left to right, top to bottom in the matrix - so here k is any integer smaller than i and l is any integer smaller than j. Also, for a local alignment Mi,j must be some negative penalty if residue i is not the same as residue j.

As an example the previous alignment can be reproduced with a penalty of -0.5 for each mismatch. The matrix will then be as given in Table 4.

In this case the same alignment is found. However, the Smith-Waterman algorithm does not include the final M/P mismatch in its path as it is not part of the locally optimal solution. More generally, large chucks of each sequence may be missing from the local alignment (as in the alignment presented by BLAST).

It is not always the case that these two methods give the same answer. For example, try a global and a local alignment of TTGACACCCTCCCAATTGTA versus ACCCCAGGCTTTACACAT.

If the sequences are not known to be homologous throughout their entire length, a local alignment should be the method of choice. Often the two methods will give similar answers (as above) but if the homology is distant, a local alignment will be more likely to find the remaining patches of homology.


Testing Significance

These algorithms are trying to find the best way to match up two sequences. This does not mean that they will find anything profound. For example, if I take these two sentences (and delete spaces), they can be aligned.


     THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
     :: :.. . ..  ...:   :    ::::..       ::  .  : ...
     THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------

Depending on your intuition you may or may not think this to be a pretty good alignment particularly at the NH2 terminus. There are a total of 12 exact matches and 14 conservative substitutions. But there is obviously no homology between these two "protein"sequences. How do we test whether or not an alignment is significant?

As a more biological example, consider the alignment of human alpha haemoglobin and human myoglobin. If you remember your basic biology, you should remember that these two proteins do similar functions of transporting oxygen in the blood and muscle respectively. But are they evolutionarily related? An alignment of the two looks like ...


Human alpha haemoglobin (141 aa) vs. Human myoglobin (153 aa)


     VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQ
      :: ..   : ..::::.:.  ..:.:.: :.: . :.: . : .: .:.     ..:..
     GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED



     VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP
     .: :: .: .::.. . . ..  .....:.. :: : ..    ....:.:.. .:... :
     LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHP


     AEFTPAVHASLDKFLASVSTVLTSKYR------
     ..:.........: :.  .. ..:.:.      
     GDFGADAQGAMNKALELFRKDMASNYKELGFQG

Again looks like a reasonably good alignment. Or how about chicken lysozyme and bovine ribonuclease. An alignment of these gives


Chicken lysozyme (129 aa) vs. Bovine ribonuclease (124 aa)

     KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS
     :  .    :: ..:. .:.  . . ..  :.....:.  :..  . ... .. .. ....
     KETA----AAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQA


     RWWCNDGRTP--GSRNLCNIPCSALLSSDITASVNCAKKIVSDGDGMNAWVAWRNRCKGT
        :.. ...  .... :  ..:..  .:  .. ...:   .. .. .:       :.:.
     V--CSQKNVACKNGQTNCYQSYSTMSITDCRET-GSSKYPNCAYKTTQANKHIIVACEGN


     DVQAWIRGCRL
        .   . ..
     PYVPVHFDASV

Again a reasonable alignment (?).

A common test to determine if the alignment of two sequences is statistically significant is to carry out a jumble or permutation test. This consists of

  1. Randomly rearrange the order of one or both sequences
  2. Align the permuted sequences
  3. Record the score for this alignment
  4. Repeat Step 1-3 a large number of times.

Doing this say 10000 times, gives a distribution of alignment scores that could be expected for random sequences with a similar amino acid content. If the actual alignment has a score much higher than that of the permuted sequences, then you know that they must be homologous to some extent.

A plot of 10,000 alignment scores for the human myoglobin and human alpha haemoglobin sequences are shown below.

Figure 8: Histogram of alignment scores

The permuted scores range from 14 to 75 but most are less than 50. Also note the skewness of the distribution - statistics based on a normal distribution would be strongly biased. The skew is expected since in each case the alignment algorithm is trying to maximize the score. The score for the alignment of the two actual sequences is 179 (indicated by the arrow). Obviously, myoglobin and haemoglobin are evolutionarily related and still retain many features of their homology. The significance of this alignment is < 0.0001.

A plot of 10,000 alignment scores for the chick lysozyme and bovine ribonuclease sequences gives ...

Figure 9: Histogram of alignment scores

Again note, the skew and note that this "random" distribution is somewhat different from the haemoglobin "random" distribution. This is due to the differential effects of amino acid composition in these proteins. The permuted scores range from 14 to 72. The actual score for the proteins is 30 (indicated by the arrow). Obviously whatever homology that once existed between these proteins has been completely destroyed by time.

These two examples are clear cut. There is a large grey area where the tests may be uncertain of the degree of homology between sequences. For protein sequences Doolittle's rule of thumb is that greater than 25% identity will suggest homology, less than 15% is doubtful and for those cases between 15-25% identity a strong statistical argument is required. Personally, I would prefer the statistical test in all cases, since things like repeats and unusual amino acid compositions can sometimes confuse the picture.


Gaps and Indels

Note that gaps were rather freely permitted in the Needleman-Wunsch and Smith-Waterman alignments. What would happen if you feel that gaps should be rarer events in your particular protein? It is possible to assign a different weight to gaps. This can be done by subtracting from the score, some predetermined value every time a gap is required. In this algorithm you can define a weight as

where k is the length of the gap. Hence you can control whether many short gaps occur or whether long gaps occur but more infrequently. Deletions do occur but when they occur it is usually not in many small, short deletions but rather fewer and longer deletions.

How do you choose a gap penalty? Unfortunately, there is little knowledge to help here. Most of the tests done so far work from an empirical basis designed to achieve some end. For example, Smith and Fitch have derived (by exhaustive search) gap penalties that will best align distantly related haemoglobin genes. But there is no guarantee that these values would work well for the protein or (worse) the nucleotide sequence that you are interested in. Typical values are

                      0.5 < a < 5.0

0.05 < b < 1.0

but there is nothing special about these values other than the fact that they seem to work well for some of the common comparisons. Note that in general a > b. This corresponds with biological knowledge of how gaps are generated - it is easier to generate one gap of two residues rather than two gaps of one residue since the former can be created by a single mutational event.

Many programs provide their own gap penalties and the user does not have the opportunity to alter these. This is the case for program ALIGN1 on the diskette. In general, the authors of these algorithms have tested them and find the best empirical values for their particular implementation.


"Natural" Gap Weights - Thorne, Kishino & Felsenstein

In a series of papers Thorne, Kishino and Felsenstein (JME 33:114, 1991), and (JME 34:3, 1992) have developed a method to find maximum likelihood estimates of the gap penalties (and transition/transversion rates) while doing an alignment. This eliminates the necessity for choosing an arbitrary gap penalty.

They develop a maximum likelihood method to examine the possible paths of descent from a common ancestor for two sequences. The creation of gaps is modeled as a birth - death process with separate parameters for birth rate and death rate. The model then finds the likelihood of particular paths through the matrix given the transition parameters. It then examines alternative parameters and chooses that path and parameter set with the highest likelihood.

The big difficulty with this method is the enormous computer time required to carry out the calculations.

A related question is the assignment of weights to individual differences in nucleotide or protein sequences (more on this later). There have also been advances in methods to try to find statistically bounded sets of alignments. That is, the set of alignments that are within 95% confidence limits of some best answer.

Again another fertile area were many significant improvements are being made.


It is important to realize that an optimal alignment is optimal only for the particular values chosen for the mismatch and gap weights. When any of these are altered, the optimal alignment will also change. Also be aware of the fact that nature is seldom mathematically optimized. Fitch and Smith (1983, PNAS 80:1382) have derived a set of "rules of thumb" a subset of which are given in Table 5. Even with the very best programs it still requires some degree of experience to draw the right conclusions from the results produced and a good grasp of the biology of the problem is as essential as one's familiarity with computers.

Some rules of thumb for alignments
1. A gap and its length are distinct quantities. Different weights should be applied to each.
2. Weights for different mismatches should be permitted. A transition is more likely than a transversion; a Ile-Val more likely than Ile-Arg change.
3. If the two sequences have no obvious relationship at their right and left ends, then end gaps should not be penalized.
4. Unless two sequences are known to be homologous over their entire length, a local alignment is preferable to a global alignment.
5. An optimal alignment is by no means necessarily statistically significant. One must make some estimate of the probability that a given alignment is due to chance.
6. An alignment demonstrates similarity, not necessarily, homology. Homology is an evolutionary inference based on examination of the similarity and its biological meaning. Sequence similarity may result from homology but it may also result from chance, convergence or analogy.


Implementations

There are two alignment programs on the diskette. These are ALIGN1 and ALIGN2. The former is a better, more rigorous and a more recent algorithm but is slower. The latter is an older and less sensitive program but a more rapid algorithm because it makes greater use of simple hash techniques.

Align1 - Myers and Miller

This command can be run by simply typing

ALIGN1

This algorithm is due to E.W. Myers and W. Miller (1988, CABIOS 4:11). Basically it is a speeded and significantly improved implementation of a Gotoh's (1982, J. Mol.Biol. 162:705) method for aligning sequences. The program uses a PAM matrix for protein sequences. Again a ":" indicates an exact match while a "." indicates a conservative replacement. After typing "align1", type in the names of the files containing the two sequences and wait a few minutes. To save the output remember to use the redirection abilities of DOS.

Align2 - Lipman and Wilbur

This algorithm uses a mixture of a Smith-Waterman type algorithm and pure hashing to provide a method that is very rapid but not too sensitive. To run the program type

ALIGN2

The program will prompt you for the information it requires. The K-tuple requested by the program provides "words" of identity that are used to search for the likeliest regions of similarity. The window parameter defines the distances between words that are searched for the best local alignments. A complete description of the program can be found in Wilbur and Lipman (1983, PNAS 80:726). This program was distributed with the older versions of the FASTP program packages.

Multiple Sequences

Conceptually, there is no reason why a Needleman-Wunsch algorithm can not be performed with more than two sequences. The matrix simply becomes multi-dimensional and the algorithm would work successively through each dimension. There are however, significant practical problems with this approach. In this case instead of growing as an problem, the computational time will grow as , where m is the number of species. Hence, even for only 100 nucleotides from 5 species, this is or the equivalent of doing an alignment for two sequences each 100,000 nucleotides long. Obviously different methods need to be employed. In general these require more assumptions and are not as precise nor "all-encompassing" as the Needleman-Wunsch or Smith-Waterman algorithm.

If desired, simple scores of similarity can be readily found using "hashing" techniques. These can operate on many sequences simultaneously and rapidly. But to take these local regions of identity and turn out an alignment is somewhat more complicated.

Bains (1986, Nuc. Acids Res. 14:159) suggested an iterative method which involves successive applications of the standard algorithms. It begins with a trial consensus alignment (say the alignment between sequences 1 and 2). Then the third sequence is aligned against the consensus sequence and a new consensus emerges. This continues until the consensus alignment converges to a global consensus.

One of the most popular multiple alignment programs works on this principle and is called Clustal. It was written by Higgins and Sharp (1989, CABIOS 5:151). It is able to handle 100 sequences up to a maximum length of 5000 residues. The alignments are done in three steps. In the first step, all pairwise similarity scores are calculated. As I understand it, this is done using hashing and a constant gap penalty. The second step is to create a similarity matrix and then to cluster the sequences based on this similarity using UPGMA cluster analysis (see the next section). The final step is the multiple alignment step. This is performed by gradually aligning groups of sequences, according to their branching order in the dendrogram. The algorithm used attempts to minimize the distance (via the similarity score) between groups of sequences. At this stage variable gap penalties are permitted.

Other methods make use of a multiple dimensional dot plot and then look for dots that are common to each group (Vingron & Argos 1991 J.Mol.Biol. 218:33-43). Still others rely heavily on user input as does the popular windows program MACAW ( Schuler, Altschul & Lipman, 1991 Proteins Struct. Func. Genet. 9: 180-190). This whole area is ripe for major theoretical advances and for the creation of better interface programs.

One obvious extension of these algorithms is to construct an alignment and a phylogeny for sequences all at the same time. This is because the alignment will affect distances between sequences and this will affect the inferred phylogeny. Similarly a different phylogeny will imply a different alignment of the sequences. Now you are talking about a real problem! Never-the-less, some progress has been made in this area. Jotun Hein has come up with a program TREEALIGN which will do exactly this. It is a complicated program not suited for PC's (but hey a 586 is powerful machine). It is available in the list from EMBL software given in the last section. This also leads us into the next section on phylogenies.

_______________________________

Return to index