next up previous contents
Next: Alignments Up: Sequence Alignment Previous: Sequence Alignment

Dot Plots

The comparison of sequences can be done in many different ways. The most direct method is to make this comparison via a visual means and this is what ``dot plots'' attempt to do. Dot plots are a group of methods that visually compare two sequences and look for regions of close similarity between them.

The Exact Way

The sequences to be compared are arranged along the margins of a matrix. At every point in the matrix where the two sequences are identical a dot is placed (i.e. at the intersection of every row and column that have the same letter in both sequences). A diagonal stretch of dots will indicate regions where the two sequences are similar. Done in this fashion a dot plot as shown in Figure gif will be obtained. This is a dot plot of the globin intergenic region in chimpanzees plotted against itself (bases 1 to 400 vs. 1 to 300) The solid line on the main diagonal is a reflection that every base of the sequence is trivially identical to itself. As can be seen this dot plot is not very useful unless applied to protein sequences (where the background is much less dense), however some statistical methods can still be applied to the results (Gibbs and McIntyre 1970).

Maizel and Lenk (1981) popularized the dot plot and suggested the use of a filter to reduce the noise demonstrated in Figure gif. This noise is caused by matches that have occurred by chance. Because only four different nucleotides are possible, nucleotides will match other nucleotides elsewhere in the sequence without any homology present and hence are not a true reflection of the similarities between the sequences but rather reflect the limited number of bases permitted in DNA sequences. There are a wide variety of filters that can be used, indeed they are only limited by your imagination. The one suggested by Maizel and Lenk was to place a dot only when a specified proportion of a small group of successive bases match. In Figure gif the same dot plot is reproduced with a filter such that a window of 10 bases is highlighted only if 6 of these 10 bases match. In Figure gif the same plot is again shown with a filter of 8 out of 10 matches. Note that these plots highlight the complete window while other programs might highlight a single point centered by the window. Another common way to filter the matches is to give them a weight according to their chemical similarity ( Staden 1982, Nuc. Acids Res. 10:2951).

   figure1091
Figure: Dot Blot - without filtering.

   figure1098
Figure: Dot Blot - filtered 6 of 10.

   figure1105
Figure: Dot Blot - filtered 8 of 10.

The computational work involved with the generation of these matrices can be quite time consuming. If you are comparing a sequence of length N with another sequence of length M, then the total number of windows for which matches must be calculated is tex2html_wrap_inline2439 . Hence the amount of work increases with the square of the sequence length. This rapidly becomes a large number. For example with N=700 and M=400, tex2html_wrap_inline2445 .

Identity Blocks

There is another way in which dot plots can be generated very quickly. This involves a computer method commonly known as ``hashing" (list-sorting). As mentioned previously, these methods are incorporated into the FASTA algorithms. Basically, the idea is that instead of taking the complete matrix and calculating points for every entry in that matrix, a great saving can be made if the algorithm searches only for exact matches. Hence, this method looks only for blocks of perfect identity. The computational complexity of this algorithm grows linearly with increasing N.

The algorithm simply sub-divides the sequence into all ``words" of a user specified block size. The same is done for the alternate sequence. In addition, for both sequences the location of each word is also recorded. These arrays of ``words" are then sorted alphabetically and the arrays of locations are sorted in parallel with the ``words". Then, by comparing the sorted array from one sequence with that from the other sequence immediately gives the location of all identical ``words".

   figure1113
Figure: Identity Dot Blot.

   figure1120
Figure: Identities of length 6bp. Chimpanzee hemoglobin intergenic DNA against itself.

An algorithm which does be used to generate the dot plots shown in Figure gif for identity blocks of length 5. The rapidity of this method compared to the exact method can be demonstrated by the dot plot shown in Figure gif (with identity blocks of length 6). This figure extends the sequences compared in the chimpanzee globin intergenic region from (1-400 vs 1-300) up to (1-4000 vs 1-3000). The length of time required for a plot of the small region is not significantly shorter than the length of time it takes to calculate short identities on a 100 fold larger matrix.

   figure1129
Figure: Identities of length 6bp. Chimpanzee hemoglobin intergenic DNA against spider monkey.

The beauty of this method is demonstrated in Figure gif. This is a plot of all identities of length 6 between the chimpanzee and spider monkey sequences in the same region. The evolutionary homology between these sequences is easily discernible by the solid lines along the main diagonal despite the approx. 60 million years that separate these two groups. Further more, this is intergenic DNA with no known function to selectively maintain this homology (modulo an even more ancient eta-globin pseudogene). The insertion of some DNA is easily observed within chimpanzee sequence and then a corresponding deletion further down. These correspond to the insertion of an Alu element in the chimpanzee (and human and other ape) sequences (at approx. bp 1000) and then the presence of a truncated L1 element in the spider monkey (inserted at approx. bp 2600) that is not present in the great apes. These events are difficult to find by a simple inspection of the actual sequence code but are readily found by a visual inspection.

A more distant similarity can be seen in Figure gif. This is a plot of the identities of length 6 between the same region of the chimpanzee haemoglobin intergenic region and another intergenic region from the spider monkey. Note the similarity (the short diagonal line) in the circled region. This region of similarity corresponds to the location of another Alu element in the chimpanzee sequence.

   figure1138
Figure: Identity dot plot. Chimpanzee hemoglobin intergenic region vs. Spider Monkey unrelated intergenic region.

There are many programs freely available to make dot plots. One which is particularly fast and interactive is the dotter program. Some other interesting dot plots are comparisons of the calmodulin protein against itself and the human epidermal growth factor against itself. Both show internal repetitive elements. The neatest dot plot that I have yet seen is the human zeta globin region and if you zero in on the intergenic region the plot becomes fantastic (try to interpret this dot plot).


next up previous contents
Next: Alignments Up: Sequence Alignment Previous: Sequence Alignment