Answers are due in hard copy (paper version) to be put into the drop
box on the first floor of the Life Science Building (number 75-78)
by Thursday Nov 9, 5:00pm. Assignments will not be accepted after
this time and the boxes will not be checked again until the following
Wednesday.
----------------------------------------------------------------------
1. In lectures the K2p distance was given when P=2/100, Q=20/100 and
when P=11/100, Q=11/100.
a) Calculate the distance for the symmetrical result, P=20/100 and
Q=2/100. Is the distance estimate the same as when P=2/100,
Q=20/100? If not why is it larger/smaller? If so, why did you
observe this value?
# With P=0.2, Q=0.02; K2p is 0.2826 while for P=0.02, Q=0.2 it was
# 0.2649. It is larger in this case because you expect to see more
# transversions (Q) and if you see a larger number of transitions (P)
# you are "hiding" many more substitutions within the transitions.
b) At what values of P,Q would the K2p distance be equal to the
JC69 distance? Explain, in words, why these values are
appropriate.
# If transitions and transversions are occurring at random, then
# expect to see twice as many transversions as transitions. In this
# case P = 0.5 Q. So
# -(1/2)ln(1-2P-Q) - (1/4)ln(1-2Q) = -(1/2)ln(1-2Q) - (1/4)ln(1-2Q)
# = -(3/4)ln(1-2Q)
# = -(3/4)ln(1-4/3D)
# where D = Q+P
----------------------------------------------------------------------
2. Here are four sequences in fasta format with increasing levels of
difference between them. Each is twenty bp long.
> Seq1
TATATATATATATATATATA
> Seq2
AATATATATATATATATATA
> Seq3
ATATAATATATATATATATA
> Seq4
CTCTCTATATTATATATATA
Compare just the pairwise distance of Seq1-Seq2, Seq1-Seq3, Seq1-Seq4.
Find the Hamming distance (by hand) and use the program dnadist to
calculate the JC and K2p distances. Lastly, assuming that alpha=1.0,
also calculate the JC distance with a gamma correction.
For which pair of sequences (1-2, 1-3, or 1-4) do the different methods
make the biggest difference in estimates of genetic distance.
# Hamming JC K2p JCgamma
# 1-2 0.05 0.0517 0.0531 0.0536
# 1-3 0.25 0.3041 0.3616 0.3750
# 1-4 0.50 0.8240 1.1376 1.5000
#
# Biggest difference is when the distance is large. For small
# distances, most corrections will not make any difference.
----------------------------------------------------------------------
3. You have two aligned codon sequences below. Use dnadist in
http://evol.mcmaster.ca/p3S03.html to answer the following
questions.
>seq1
AAG GTC TTT GAA AGG TGT GAG TTG GCC AGA ACT CTG AAA AGA TTG GGA ATG
GAT GGC TAC
>seq2
AAC GAC TTG GAT AGC TGT GAG TTG GCT AGA ACT CTG AGA AGA TTG GGA ATC
GAT GGC TAC
a) Calculate the genetic distance between the sequences based on the
nucleotides in the third codon position. Make sure to use the
default parameters in dnadist.
# The alignment based on the third codon position is ...
#
# >seq1
# GCTAGTGGCATGAAGAGTCC
# >seq2
# CCGTCTGGTATGAAGACTCC
#
# Then, you can calculate the genetic distance between the two sequences
# based on the new alignment.
#
# 2
# seq1 0.000000 0.450805
# seq2 0.450805 0.000000
b) Calculate the genetic distance between the two sequences based
on the whole alignment. Make sure to use the default parameters
in dnadist.
# 2
# seq1 0.000000 0.155471
# seq2 0.155471 0.000000
c) Which distance is larger? Why is it larger?
# The distance based on nucleotides in the third position is larger. The
# first and the second position in codons are under stronger purifying
# selection, so they are less variable than the third position.
----------------------------------------------------------------------
4. The following shows the first 10 amino acids when aligning NOX1 from
human with NOX1 from rhesus monkey, mouse and opossum. What is (a) the
raw score using the BLOSUM62 matrix for human rhesus (use the matrix
on page 134 of your notes; Table 7.2). (b) What is the score between
human and mouse, and (c) what is the score between human and opossum?
(Don't forget to show your work).
MGNWVVNHWF (human)
MGNWVVNHWF (rhesus)
MGNWLVNHWL (mouse)
METWVVNHWF (opossum)
# a) The raw score between human and rhesus is for the ten perfect
# matches in those first 10 amino acids. It is
# 5+6+6+11+4+4+6+8+11+6 = 67
# M G N W V V N H W F
#
# Human to mouse alignment has a V-L and an F-L mismatch. The V-L
# mismatch decreases the score by s(VV) - s(VL), where s(XY) is the
# score for aligning X and Y in BLOSUM62. We find s(VV) = 4, and
# s(VL) = 1, so the VL match decreases S=67 by 4-1 = 3. Similarly,
# the FL match decreases the score by s(FF) - s(FL) = 6 - 0 = 6.
# The total decrease then is then 3+6 = 9, so the human-mouse score
# will be S=67-9=58.
# b) the human-opossum mismatches are GE, NT, which cost s(GG) - s(GE)
# = (6 - -2) = 8, ands(NN) - s(NT) = (6 - 0) = 6, so the human-
# opossum score will be S=67 - (6 + 8) = 67 - 14 = 53.
----------------------------------------------------------------------
5. In the PAM matrix presented in lecture [October 31, slide 116] there
are three transitions amino acid exchanges that are more common than
the rest (i.e. look at the three most common exchanges). What are
unique about the amino acids involved in these three exchanges that
may help explain why these exchanges occur the most frequently?
# Each transition occurs given a single nucleotide substitution (using
# the common/standard codon table). Also, for one of these transitions
# (Asn <-> Asp) the substitution is a transition and thus more likely.
If we were assuming a Jukes Cantor model, what would this matrix look
like?
# All the off-diagonal numbers in the table would be the same
Consider an alignment of two sequences of amino acids. Propose a method
for determining if the total alignment score using the PAM method
(summation of log-odds; hint: see slide 130 from the Oct, 31 lecture) is
different enough from the expectation under the Jukes Cantor model to be
able to reject the null hypothesis that the protein changes are not
occurring under the assumptions of the Jukes Cantor model. (Propose a
method, it is not necessary to do it).
# Calculate the total score (slide 130, Oct 13 lecture) for a randomly
# generated matrix assuming all substitutions are equally likely. Do this,
# say 10,000 times. Compare the observed score to the distribution of
# of scores from the 10,000 random matrices. Determine what proportion
# many produce a score > the observed score. Compare to 0.05.
----------------------------------------------------------------------