Sequence Alignment
Contents
The comparison of sequences can be done in many different ways. The most direct method is to make this comparison via a visual means and this is what DOT PLOTS attempt to do. Dot plots are a group of methods that visually compare two sequences and look for regions of close similarity between them.
The sequences to be compared are arranged along the margins of a matrix. At every point in the matrix where the two sequences are identical a dot is placed (i.e. at the intersection of every row and column that have the same letter in both sequences). A diagonal stretch of dots will indicate regions where the two sequences are similar. Done in this fashion a dot plot as shown in Figure 1 will be obtained. This is a dot plot of the globin intergenic region in chimpanzees plotted against itself (bases 1 to 400 vs. 1 to 300). As can be seen this dot plot is not very useful unless applied to protein sequences (where the background is much less dense), however some statistical methods can be applied to the results (Gibbs and McIntyre 1970, Eur J Biochem 16:1).
Maizel and Lenk (1981, PNAS 78:7665) popularized the dot plot and suggested the use of a filter to reduce the noise demonstrated in Figure 1. This noise is caused by matches that have occurred by chance and are not a true reflection of the similarities between the sequences but rather reflect the limited number of bases permitted in DNA sequences. There are a wide variety of filters that can be used, indeed they are only limited by your imagination. The one suggested by Maizel and Lenk was to place a dot only when a specified proportion of a small group of successive bases match. In Figure 2 the same dot plot is reproduced with a filter such that a window of 10 bases is highlighted only if 6 of these 10 bases match. In Figure 3 the same plot is again shown with a filter of 8 out of 10 matches. Note that these plots highlight the complete window while other programs might highlight a single point centered by the window. Another common way to filter the matches is to give them a weight according to their chemical similarity (Staden 1982, Nuc Acids Res 10:2951).
The algorithm that generated these figures is included on the
attached diskette (files DOTPLOT.PAS and DOTPLOT.EXE). I do not
provide any warrantee that the enclosed programs will actually do
anything (they are provided only "as is"). The source code of the
program is provided and is written in very basic level Pascal and
compiled with Borland's Turbo Pascal. The graphic parts of this
program use Borland's Graphical Interface. This is a series of
commands that automatically detect the type of computer (within
limitations) and automatically load the correct graphical drivers.
These drivers are located in the files with the file extensions "*.BGI"
and these must be located in the directory from which the program is
run. Currently the program is limited to DNA sequences but to permit
amino acid sequences as well requires only a minor change in the
subroutine that reads the sequences and does not require any change to
the algorithm itself.
Note that the computational work involved with the generation
of these matrices can be quite time consuming. The algorithm that
is included on the attached diskette was written by me and hence
makes no use of some of the available techniques to speed the
computations. Note that if you are comparing a sequence of length
N with another sequence of length M, then the total number of
windows for which matches must be found is N x M. Hence the amount
of work increases with the square of the sequence length. This
rapidly becomes a large number. For example with N=700 and M=400,
N x M = 280,000.
There is another way in which dot plots can be generated very
quickly. This involves a computer method commonly known as
"hashing" (list-sorting). These methods are incorporated into the
FASTA algorithms. Basically, the idea is that instead of taking
the complete matrix and calculating points for every entry in that
matrix, a great saving can be made if the algorithm searches only
for exact matches. Hence, this method looks for blocks of perfect
identity. The computational complexity of this algorithm grows
linearly with increasing N.
The algorithm simply sub-divides the sequence into all "words"
of a user specified block size. The same is done for the alternate
sequence. In addition, for both sequences the location of each
word is also recorded. These arrays of "words" are then sorted
alphabetically and the arrays of locations are sorted in parallel
with the "words". Then, by comparing the sorted array from one
sequence with that from the other sequence immediately gives the
location of all identical "words".
Identity Blocks
Chimpanzee hemoglobin intergenic DNA against itself.
An algorithm which does this is included on the attached diskette
(files IDENTITY.PAS and IDENTITY.EXE). It can be used to generate the
dot plots shown in Figure 4 for identity blocks of length 5. The
rapidity of this method compared to the exact method can be
demonstrated by the dot plot shown in Figure 5 (with identity blocks of
length 6). This figure extends the sequences compared in the chimp
globin intergenic region from (1-400 vs 1-300) up to (1-4000 vs
1-3000). The length of time required by DOTPLOT to analyze the small
region is not significantly shorter than the length of time it takes
IDENTITY takes to do a 100 fold larger matrix.
Chimpanzee hemoglobin intergenic DNA against spider monkey.
The beauty of this method is demonstrated in Figure 6. This is a
plot of all identities of length 6 between the chimpanzee and spider
monkey sequences in the same region. The evolutionary homology -
relatedness between these sequences is easily discernible despite the
approx. 60 million years that separate these two groups. Further more,
this is intergenic DNA with no known function to selectively maintain
this homology (modulo an even more ancient eta-globin pseudogene). The
insertion of some DNA is easily observed within chimpanzee sequence and
then a corresponding deletion further down. These correspond to the
insertion of an Alu element in the chimp (and human and other ape)
sequences and then the presence of a truncated L1 element in the spider
monkey that is not present in the great apes. These events are
difficult to find by a simple inspection of the actual sequence code
but are readily found by a visual inspection.
A more distant similarity can be seen in Figure 7. This is a
plot of the identities of length 6 between the same region of the
chimpanzee haemoglobin intergenic region and another intergenic
region from the spider monkey. Note the similarity in the circled
region. This region of similarity corresponds to the location of
another Alu element in the chimpanzee sequence.
region vs. Spider Monkey unrelated intergenic region.
Some other interesting dotplots are comparisons of the calmodulin protein against itself and the
human epidermal growth factor against itself.
Both show internal repetitive elements. The neatest dot plot that I
have yet seen is the human zeta globin region
and if you zero in on the intergenic
region the plot
gets fantastic (try to interpret this dot plot).
The dot plots provide a ready way to visualize the sequences
being compared. They are not very useful however in providing an
actual alignment between the two sequences. To do this, other
algorithms are required.
A word on terminology. People in the field shudder when the
terms similarity and homology are used indiscriminately.
Similarity simply means that sequences are in some sense similar
and has no evolutionary connotations, whereas homology refers to
evolutionarily related sequences stemming from a common ancestor.
To have a correct alignment can be crucial to many types of
studies. It may, for example, alter from which part of a gene one
segment was duplicated, it may alter the inferred number of point
mutations, it may alter the inferred location of deletions /
insertions, alter the inferred distance between species, and may
alter the inferred phylogeny of the sequences along with whatever
evolutionary hypotheses are dependent on these phylogenies.
An explicit and precise algorithm is also required. For
example one paper in the prestigious journal NATURE stated that the
alignment
Alignments
------CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCACCCAGCCCCCTGGACCTGTAT---------
was optimal in the sense that gaps were inserted to maximize the number of base matches (the base matches are underlined - this is only part of the alignment). They obviously did this alignment by eye and did not use an explicit algorithm. An alternate alignment (due to 84219728 Fitch 1984, Nature 309:410) is
CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCA---CCCAGCCCCCTGGACCTGTAT
This alignment not only increases the number of base matches by 133 per cent, but also decreases the number of gaps by 50 per cent and reduces the number of gapped residues by 80 per cent. Hence, if the number of base matches can be increased by reducing the number of gaps, then clearly the original author's insertion of gaps did not maximize that number. Fitch recommends that the authors change their statement to the assertion that gaps were introduced to increase the number of base matches (rather than to maximize them). More generally this example shows the importance of i) using a well defined algorithm and ii) of using a computer based algorithm to perform these calculations. Even alignments that may appear simple and straightforward, if given to the computer, might yield alternatives that you did not consider.
The most basic algorithm to align two sequences was developed by S.A. Needleman and C.D. Wunsch (1970, J. Mol. Biol. 48:443). The algorithm is a simple and beautiful way to find an alignment that maximizes a particular score. (The score can be calculated in a variety of methods - as will be indicated below). The overall method is reminiscent of the dot plot (though this was not popular in the 1970's and so there is probably no connection). The first step is to place the two sequences along the margins of a matrix as shown in Table 1.
In this first step, simply place a 1 anywhere the two sequences match and a 0 elsewhere. If done on a larger scale than is shown in Table 1, this would exactly recreate the dot plot shown in Figure 1. In this case however, we wish to find a path through this matrix which would define a more conventional alignment. For example, proceeding along the diagonal with no deviations would imply an alignment without any gaps. The introduction of a gap (either by an insertion or a deletion - an indel) in either sequence would correspond to moving either above or below the main diagonal.
To find the best route, Needleman and Wunsch suggested that you modify the matrix to represent this idea of tracing different pathways through the matrix. However, you want to include all possible pathways and from among these choose only that one which is best (in the sense of maximizing some score). Their method consists of two passes through the matrix. The first traces a score for all possible routes and moves right to left, bottom to top. Once the score for all possible routes are found, the maximum can be chosen (it will be somewhere on the topmost row or leftmost column) and a second pass can be carried out, this time running left to right, top to bottom to find that alignment that gives the maximum score.

The way to trace a score for all possible paths is shown in Table 2. For each element in the matrix you perform the following operation.

where k is any integer larger than i and l is any integer larger than j. In words, alter the matrix by adding to each element the largest element from the row just below and to the right of that element and from the column just to the right and below the element of interest. This row and column for one element are shown in Table 2 by boxes. The number contained in each cell of the matrix, after this operation is completed, is the largest number of identical pairs that can be found if that element is the origin for a pathway which proceeds to the upper left.

We wish to have an alignment which covers the entire sequence. Hence, we can find on the upper row or on the left column the element of the matrix with maximum value. An alignment must begin at this point and can then proceed to the lower right. This is the second pass through the matrix. At each step of this pass, starting from the maximum, one moves one row and column to the lower right and finds the maximum in this row or column. The alignment must proceed through this point.
Continuing in this fashion one eventually hits either the bottom row or the rightmost column and the alignment is finished. This tracing pattern is shown in Table 3. Note that in this case the optimal alignment is not unique. There are two alignments and both give the optimal score of 8 matches. These are ...

These two alignments can be written in more familiar form as either
ABCNJ-RQCLCR-PM
* * * * * ** *
AJC-JNR-CKCRBP-
or as
ABC-NJRQCLCR-PM
* * * * * ** *
AJCJN-R-CKCRBP-
both with 8 asterisks to denote the 8 matches. Note that in this
particular case, gaps are given the same penalty as a mismatch.
They simply do not add to the score.
A program has been provided for you on the enclosed diskette to carry out a Needleman-Wunsch type of alignment. The program is called NWALIGN. It is severely limited by the size of arrays permitted and hence is limited to sequences of length 100 residues(hence it is better for protein rather than DNA sequences). The code is provided and this limitation can be removed. Also provided in this subdirectory is a program ALIGN (this is a PC version of ALIGN1 - see below). While NWALIGN is less capable than ALIGN, it will give you a better idea of what is happening in a local region of any alignment that you want. Importantly, NWALIGN lets you change the way in which the alignment is done by permitting you to give different weights to different types of mismatches and matches.
The Needleman-Wunsch algorithm creates a global alignment. That is, it tries to take all of one sequence and align it with all of a second sequence. Short and highly similar subsequences may be missed in the alignment because they are outweighed by the rest of the sequence. Hence, one would like to create a locally optimal alignment. The Smith-Waterman (1981, J.Mol.Biol. 147:195-197) finds an alignment that determines the longest/best subsequence pair that give the maximum degree of similarity between the two original sequences. This means that not all of the sequences might be aligned together.
Only minimal changes to the Needleman-Wunsch algorithm are required. These are
The first point is required to cause the score to drop as more and more mismatches are added. Hence, the score will rise in a region of high similarity and then fall outside of this region. If there are two segments of high similarity then these must be close enough to a path between them to be linked by a gap or they will be left as independent segments of local similarity.
The second point is so that each pathway begins fresh at its beginning. Thus each short segment of similarity should begin with a score of zero. The third point indicates that the entire matrix must be searched for regions with high local similarity.
Again, for each element in the matrix you perform the following operation.

But in this case it is easier to go left to right, top to bottom in the matrix - so here k is any integer smaller than i and l is any integer smaller than j. Also, for a local alignment Mi,j must be some negative penalty if residue i is not the same as residue j.
As an example the previous alignment can be reproduced with a penalty of -0.5 for each mismatch. The matrix will then be as given in Table 4.

In this case the same alignment is found. However, the Smith-Waterman algorithm does not include the final M/P mismatch in its path as it is not part of the locally optimal solution. More generally, large chucks of each sequence may be missing from the local alignment (as in the alignment presented by BLAST).
It is not always the case that these two methods give the same answer. For example, try a global and a local alignment of TTGACACCCTCCCAATTGTA versus ACCCCAGGCTTTACACAT.
If the sequences are not known to be homologous throughout their entire length, a local alignment should be the method of choice. Often the two methods will give similar answers (as above) but if the homology is distant, a local alignment will be more likely to find the remaining patches of homology.
These algorithms are trying to find the best way to match up two sequences. This does not mean that they will find anything profound. For example, if I take these two sentences (and delete spaces), they can be aligned.
THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
:: :.. . .. ...: : ::::.. :: . : ...
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------
Depending on your intuition you may or may not think this to be a pretty good alignment particularly at the NH2 terminus. There are a total of 12 exact matches and 14 conservative substitutions. But there is obviously no homology between these two "protein"sequences. How do we test whether or not an alignment is significant?
As a more biological example, consider the alignment of human alpha haemoglobin and human myoglobin. If you remember your basic biology, you should remember that these two proteins do similar functions of transporting oxygen in the blood and muscle respectively. But are they evolutionarily related? An alignment of the two looks like ...
Human alpha haemoglobin (141 aa) vs. Human myoglobin (153 aa)
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQ
:: .. : ..::::.:. ..:.:.: :.: . :.: . : .: .:. ..:..
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED
VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP
.: :: .: .::.. . . .. .....:.. :: : .. ....:.:.. .:... :
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHP
AEFTPAVHASLDKFLASVSTVLTSKYR------
..:.........: :. .. ..:.:.
GDFGADAQGAMNKALELFRKDMASNYKELGFQG
Again looks like a reasonably good alignment. Or how about chicken lysozyme and bovine ribonuclease. An alignment of these gives
Chicken lysozyme (129 aa) vs. Bovine ribonuclease (124 aa)
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS
: . :: ..:. .:. . . .. :.....:. :.. . ... .. .. ....
KETA----AAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQA
RWWCNDGRTP--GSRNLCNIPCSALLSSDITASVNCAKKIVSDGDGMNAWVAWRNRCKGT
:.. ... .... : ..:.. .: .. ...: .. .. .: :.:.
V--CSQKNVACKNGQTNCYQSYSTMSITDCRET-GSSKYPNCAYKTTQANKHIIVACEGN
DVQAWIRGCRL
. . ..
PYVPVHFDASV
Again a reasonable alignment (?).
A common test to determine if the alignment of two sequences is statistically significant is to carry out a jumble or permutation test. This consists of
Doing this say 10000 times, gives a distribution of alignment scores that could be expected for random sequences with a similar amino acid content. If the actual alignment has a score much higher than that of the permuted sequences, then you know that they must be homologous to some extent.
A plot of 10,000 alignment scores for the human myoglobin and human alpha haemoglobin sequences are shown below.
The permuted scores range from 14 to 75 but most are less than 50.
Also note the skewness of the distribution - statistics based on a
normal distribution would be strongly biased. The skew is expected
since in each case the alignment algorithm is trying to maximize
the score. The score for the alignment of the two actual sequences
is 179 (indicated by the arrow). Obviously, myoglobin and
haemoglobin are evolutionarily related and still retain many
features of their homology. The significance of this alignment is
< 0.0001.
A plot of 10,000 alignment scores for the chick lysozyme and
bovine ribonuclease sequences gives ...
Again note, the skew and note that this "random" distribution is
somewhat different from the haemoglobin "random" distribution.
This is due to the differential effects of amino acid composition
in these proteins. The permuted scores range from 14 to 72. The
actual score for the proteins is 30 (indicated by the arrow).
Obviously whatever homology that once existed between these
proteins has been completely destroyed by time.
These two examples are clear cut. There is a large grey area
where the tests may be uncertain of the degree of homology between
sequences. For protein sequences Doolittle's rule of thumb is that
greater than 25% identity will suggest homology, less than 15% is
doubtful and for those cases between 15-25% identity a strong
statistical argument is required. Personally, I would prefer the
statistical test in all cases, since things like repeats and
unusual amino acid compositions can sometimes confuse the picture.
Note that gaps were rather freely permitted in the Needleman-Wunsch and
Smith-Waterman alignments. What would happen if you feel that gaps
should be rarer events in your particular protein? It is possible to
assign a different weight to gaps. This can be done by subtracting
from the score, some predetermined value every time a gap is required.
In this algorithm you can define a weight as
where k is the length of the gap. Hence you can control whether
many short gaps occur or whether long gaps occur but more
infrequently. Deletions do occur but when they occur it is
usually not in many small, short deletions but rather fewer and
longer deletions.
How do you choose a gap penalty? Unfortunately, there is little
knowledge to help here. Most of the tests done so far work from an
empirical basis designed to achieve some end. For example, Smith and
Fitch have derived (by exhaustive search) gap penalties that will best
align distantly related haemoglobin genes. But there is no guarantee
that these values would work well for the protein or (worse) the
nucleotide sequence that you are interested in. Typical values are
0.05 < b < 1.0
but there is nothing special about these values other than the fact
that they seem to work well for some of the common comparisons. Note
that in general a > b. This corresponds with biological knowledge of
how gaps are generated - it is easier to generate one gap of two
residues rather than two gaps of one residue since the former can be
created by a single mutational event.
Many programs provide their own gap penalties and the user does not
have the opportunity to alter these. This is the case for program
ALIGN1 on the diskette. In general, the authors of these algorithms
have tested them and find the best empirical values for their
particular implementation.
In a series of papers Thorne, Kishino and Felsenstein (JME 33:114, 1991), and (JME 34:3, 1992) have developed a
method to find maximum likelihood estimates of the gap penalties (and
transition/transversion rates) while doing an alignment. This
eliminates the necessity for choosing an arbitrary gap penalty.
They develop a maximum likelihood method to examine the possible paths
of descent from a common ancestor for two sequences. The creation of
gaps is modeled as a birth - death process with separate parameters
for birth rate and death rate. The model then finds the likelihood of
particular paths through the matrix given the transition parameters.
It then examines alternative parameters and chooses that path and
parameter set with the highest likelihood.
The big difficulty with this method is the enormous computer time
required to carry out the calculations.
A related question is the assignment of weights to individual
differences in nucleotide or protein sequences (more on this later).
There have also been advances in methods to try to find statistically
bounded sets of alignments. That is, the set of alignments that are
within 95% confidence limits of some best answer.
Again another fertile area were many significant improvements are being
made.
It is important to realize that an optimal alignment is optimal only
for the particular values chosen for the mismatch and gap weights.
When any of these are altered, the optimal alignment will also change.
Also be aware of the fact that nature is seldom mathematically
optimized. Fitch and Smith (1983, PNAS 80:1382) have derived a set of
"rules of thumb" a subset of which are given in Table 5. Even with the
very best programs it still requires some degree of experience to draw
the right conclusions from the results produced and a good grasp of the
biology of the problem is as essential as one's familiarity with
computers.
Gaps and Indels

0.5 < a < 5.0
"Natural" Gap Weights - Thorne, Kishino & Felsenstein
| Some rules of thumb for alignments | |
|---|---|
| 1. | A gap and its length are distinct quantities. Different weights should be applied to each. |
| 2. | Weights for different mismatches should be permitted. A transition is more likely than a transversion; a Ile-Val more likely than Ile-Arg change. |
| 3. | If the two sequences have no obvious relationship at their right and left ends, then end gaps should not be penalized. |
| 4. | Unless two sequences are known to be homologous over their entire length, a local alignment is preferable to a global alignment. |
| 5. | An optimal alignment is by no means necessarily statistically significant. One must make some estimate of the probability that a given alignment is due to chance. |
| 6. | An alignment demonstrates similarity, not necessarily, homology. Homology is an evolutionary inference based on examination of the similarity and its biological meaning. Sequence similarity may result from homology but it may also result from chance, convergence or analogy. |
There are two alignment programs on the diskette. These are ALIGN1 and ALIGN2. The former is a better, more rigorous and a more recent algorithm but is slower. The latter is an older and less sensitive program but a more rapid algorithm because it makes greater use of simple hash techniques.
This command can be run by simply typing
ALIGN1
This algorithm is due to E.W. Myers and W. Miller (1988, CABIOS 4:11). Basically it is a speeded and significantly improved implementation of a Gotoh's (1982, J. Mol.Biol. 162:705) method for aligning sequences. The program uses a PAM matrix for protein sequences. Again a ":" indicates an exact match while a "." indicates a conservative replacement. After typing "align1", type in the names of the files containing the two sequences and wait a few minutes. To save the output remember to use the redirection abilities of DOS.
This algorithm uses a mixture of a Smith-Waterman type algorithm and pure hashing to provide a method that is very rapid but not too sensitive. To run the program type
ALIGN2
The program will prompt you for the information it requires. The K-tuple requested by the program provides "words" of identity that are used to search for the likeliest regions of similarity. The window parameter defines the distances between words that are searched for the best local alignments. A complete description of the program can be found in Wilbur and Lipman (1983, PNAS 80:726). This program was distributed with the older versions of the FASTP program packages.
Conceptually, there is no reason why a Needleman-Wunsch algorithm can
not be performed with more than two sequences. The matrix simply
becomes multi-dimensional and the algorithm would work successively
through each dimension. There are however, significant practical
problems with this approach. In this case instead of growing as an
problem, the
computational time will grow as
, where m is the number of species. Hence, even for
only 100 nucleotides from 5 species, this is
or the equivalent of doing an
alignment for two sequences each 100,000 nucleotides long.
Obviously different methods need to be employed. In general these
require more assumptions and are not as precise nor
"all-encompassing" as the Needleman-Wunsch or Smith-Waterman
algorithm.
If desired, simple scores of similarity can be readily found using "hashing" techniques. These can operate on many sequences simultaneously and rapidly. But to take these local regions of identity and turn out an alignment is somewhat more complicated.
Bains (1986, Nuc. Acids Res. 14:159) suggested an iterative method which involves successive applications of the standard algorithms. It begins with a trial consensus alignment (say the alignment between sequences 1 and 2). Then the third sequence is aligned against the consensus sequence and a new consensus emerges. This continues until the consensus alignment converges to a global consensus.
One of the most popular multiple alignment programs works on this principle and is called Clustal. It was written by Higgins and Sharp (1989, CABIOS 5:151). It is able to handle 100 sequences up to a maximum length of 5000 residues. The alignments are done in three steps. In the first step, all pairwise similarity scores are calculated. As I understand it, this is done using hashing and a constant gap penalty. The second step is to create a similarity matrix and then to cluster the sequences based on this similarity using UPGMA cluster analysis (see the next section). The final step is the multiple alignment step. This is performed by gradually aligning groups of sequences, according to their branching order in the dendrogram. The algorithm used attempts to minimize the distance (via the similarity score) between groups of sequences. At this stage variable gap penalties are permitted.
Other methods make use of a multiple dimensional dot plot and then look for dots that are common to each group (Vingron & Argos 1991 J.Mol.Biol. 218:33-43). Still others rely heavily on user input as does the popular windows program MACAW ( Schuler, Altschul & Lipman, 1991 Proteins Struct. Func. Genet. 9: 180-190). This whole area is ripe for major theoretical advances and for the creation of better interface programs.
One obvious extension of these algorithms is to construct an alignment and a phylogeny for sequences all at the same time. This is because the alignment will affect distances between sequences and this will affect the inferred phylogeny. Similarly a different phylogeny will imply a different alignment of the sequences. Now you are talking about a real problem! Never-the-less, some progress has been made in this area. Jotun Hein has come up with a program TREEALIGN which will do exactly this. It is a complicated program not suited for PC's (but hey a 586 is powerful machine). It is available in the list from EMBL software given in the last section. This also leads us into the next section on phylogenies.
_______________________________