The comparison of sequences can be done in many different ways. The most direct method is to make this comparison via a visual means and this is what ``dot plots'' attempt to do. Dot plots are a group of methods that visually compare two sequences and look for regions of close similarity between them.
The sequences to be compared are arranged along the margins of
a matrix. At every point in the matrix where the two sequences are
identical a dot is placed (i.e. at the intersection of every row
and column that have the same letter in both sequences). A
diagonal stretch of dots will indicate regions where the two
sequences are similar. Done in this fashion a dot plot as shown
in Figure
will be obtained. This is a dot plot of the globin
intergenic region in chimpanzees plotted against itself (bases 1 to
400 vs. 1 to 300) The solid line on the main diagonal is a reflection that
every base of the sequence is trivially identical to itself.
As can be seen this dot plot is not very useful
unless applied to protein sequences (where the background is much
less dense), however some statistical methods can still be applied to the
results
(Gibbs
and McIntyre 1970).
Maizel and Lenk (1981)
popularized the dot plot and suggested the use of a filter
to reduce the noise demonstrated in Figure
. This noise is caused by
matches that have occurred by chance. Because only four different nucleotides
are possible, nucleotides will match other nucleotides elsewhere in the
sequence without any homology present and hence are not a true reflection of
the similarities between the sequences but rather reflect the limited
number of bases permitted in DNA sequences. There are a wide variety
of filters that can be used, indeed they are only limited by your
imagination. The one suggested by Maizel and Lenk was to place a dot
only when a specified proportion of a small group of successive bases
match. In Figure
the same dot plot is reproduced with a filter such
that a window of 10 bases is highlighted only if 6 of these 10 bases
match. In Figure
the same plot is again shown with a filter of 8 out
of 10 matches. Note that these plots highlight the complete window
while other programs might highlight a single point centered by the
window. Another common way to filter the matches is to give them a
weight according to their chemical similarity
(
Staden
1982, Nuc. Acids Res. 10:2951).
Figure: Dot Blot - without filtering.
Figure: Dot Blot - filtered 6 of 10.
Figure: Dot Blot - filtered 8 of 10.
The computational work involved with the generation of these matrices
can be quite time consuming. If you are comparing a sequence of length
N with another sequence of length M, then the total number of windows
for which matches must be calculated is
. Hence the amount of
work increases with the square of the sequence length. This rapidly
becomes a large number. For example with N=700 and M=400,
.
There is another way in which dot plots can be generated very quickly. This involves a computer method commonly known as ``hashing" (list-sorting). As mentioned previously, these methods are incorporated into the FASTA algorithms. Basically, the idea is that instead of taking the complete matrix and calculating points for every entry in that matrix, a great saving can be made if the algorithm searches only for exact matches. Hence, this method looks only for blocks of perfect identity. The computational complexity of this algorithm grows linearly with increasing N.
The algorithm simply sub-divides the sequence into all ``words" of a user specified block size. The same is done for the alternate sequence. In addition, for both sequences the location of each word is also recorded. These arrays of ``words" are then sorted alphabetically and the arrays of locations are sorted in parallel with the ``words". Then, by comparing the sorted array from one sequence with that from the other sequence immediately gives the location of all identical ``words".
Figure: Identities of length 6bp. Chimpanzee hemoglobin intergenic DNA against itself.
An algorithm which does be used to generate the
dot plots shown in Figure
for identity blocks of length 5. The
rapidity of this method compared to the exact method can be
demonstrated by the dot plot shown in Figure
(with identity blocks of
length 6). This figure extends the sequences compared in the chimpanzee
globin intergenic region from (1-400 vs 1-300) up to (1-4000 vs
1-3000). The length of time required for a plot of the small
region is not significantly shorter than the length of time it takes
to calculate short identities on a 100 fold larger matrix.
Figure: Identities of length 6bp. Chimpanzee hemoglobin intergenic DNA against spider monkey.
The beauty of this method is demonstrated in Figure
. This is a plot
of all identities of length 6 between the chimpanzee and spider monkey
sequences in the same region. The evolutionary homology between these
sequences is easily discernible by the solid lines along the main
diagonal despite the approx. 60 million years that separate these two
groups. Further more, this is intergenic DNA with no known function to
selectively maintain this homology (modulo an even more ancient
eta-globin pseudogene). The insertion of some DNA is easily observed
within chimpanzee sequence and then a corresponding deletion further
down. These correspond to the insertion of an Alu element in the chimpanzee
(and human and other ape) sequences (at approx. bp 1000) and then the
presence of a truncated L1 element in the spider monkey (inserted at
approx. bp 2600) that is not present in the great apes. These events
are difficult to find by a simple inspection of the actual sequence
code but are readily found by a visual inspection.
A more distant similarity can be seen in Figure
. This is a
plot of the identities of length 6 between the same region of the
chimpanzee haemoglobin intergenic region and another intergenic
region from the spider monkey. Note the similarity (the short diagonal
line) in the circled region. This region of similarity corresponds to
the location of another Alu element in the chimpanzee sequence.
Figure: Identity dot plot. Chimpanzee hemoglobin intergenic region
vs. Spider Monkey unrelated intergenic region.
There are many programs freely available to make dot plots. One which is particularly fast and interactive is the dotter program. Some other interesting dot plots are comparisons of the calmodulin protein against itself and the human epidermal growth factor against itself. Both show internal repetitive elements. The neatest dot plot that I have yet seen is the human zeta globin region and if you zero in on the intergenic region the plot becomes fantastic (try to interpret this dot plot).