Answers are due in hard copy (paper version) to be put into the drop
box on the first floor of the Life Science Building (number 75-78)
by Wednesday Nov 9, 5:00pm. Assignments will not be accepted after
this time and the boxes will not be checked again until the following
Wednesday.
----------------------------------------------------------------------
1. a) Taq polymerase is used in PCR because the enzyme is resistant to
denaturation at high temperature (needed to denature the double
stranded DNA templates). Find the name of the organism that Taq was
isolated from and find the GC content of its genome. Another
organism with heat stable enzymes is Aquifex aeolicus. Find the GC
content of Aquifex aeolicus (for both provide the strain name for
the GC content that you are quoting). What kind of environment do
both of these organisms live in? Do they have the same GC content?
# Searching the NCBI genome database yields ...
# Thermus aquaticus Y51MC23 with 68% GC content.
# Aquifex aeolicus VF5 with 43% GC content.
# Both are hyperthermophilic bacteria.
#
# Originally the high GC was suggested as an adaptation to high
# temperature since the GC pairs provide three as opposed to two
# hydrogen bonds ... hence greater stability. But Aquifex and other
# genomes show that this is not required and probably not the cause.
b) Assume that the frequencies of the four nucleotides are T: 0.1, A:
0.1, G: 0.4 and C: 0.4. We will assume that these are the
equilibrium frequencies of the nucleotides such that given a large
amount of mutation the frequencies would converge to these values (some
thermophilic bacteria have a highly skewed GC content apparently to
aid their adaptation to these extreme temperatures: presumably the
triple bonds of G/C are better than the double bounds of A/T).
If two sequences in these two beasts diverged for an infinite
amount of time, what would be the raw sequence distance expected
(i.e. a count as in D=k/n)? What would this be normally in the
absence of a GC content bias? What would Jukes-Cantor (JC) report
as the degree of divergence for this bias and to what value would
JC approach as the estimated divergence in the absence of this bias?
Hence, JC should be corrected for ...?
# If the two sequences are random then you can calculate that if you
# have X in one sequence the choice in the second sequence is
# random but constrained by the frequencies. So an A across from a T
# should occur with pure random chance as 0.1 * 0.1. An A across
# form a G with 0.1 * 0.4 ... and so on keep going.
#
# It is easier to find the chance of being the same and then to
# calculate the chance of being different, take 1 - 'chance of being
# the same'.
#
# Will have identical A's, 0.1*0.1; identical T's, 0.1*0.1; identical
# C's 0.4*0.4; identical G's 0.4*0.4. So total identical is
# 0.01+0.01+0.16+0.16 = 0.34. So total different is 0.66.
#
# D = 0.66 versus D = 0.75.
#
# K = -(3/4) ln(1-(4/3)D)
# = 1.59
# otherwise K -> infinity
#
# Hence all measures of distance should (and usually are) corrected
# for base frequency bias.
----------------------------------------------------------------------
2. Take the sequences below and calculate the genetic distance according
to the program dnadist (advanced form: using method "Jukes-Cantor",
all other parameters default) on http://evol.mcmaster.ca/p3S03.html
(be careful of input format and that your pasted data is read
correctly).
ACTTATATATACCGGAGACTATATGAGA
ACTTTTATATACCGGAGGCTATACGAGA
Now calculate the genetic distance for
AC--TTATATATACCGGAGACTATTTATGAGA
ACAATTTTATATACCGGAGGCTA--TACGAGA
How do the sequence pairs differ and how do the genetic distances
differ? How has this program treated the additional differences that
exist between the second pair of sequences.
# Distance is 0.115613 for both pairs of sequences.
#
# The program has ignored the indels.
#
# As mentioned in class this is common for most programs because we
# still don't know how to accurately model indels.
----------------------------------------------------------------------
3. You have two aligned codon sequences below. Use dnadist in
http://evol.mcmaster.ca/p3S03.html to answer the following
questions.
>seq1
AAG GTC TTT GAA AGG TGT GAG TTG GCC AGA ACT CTG AAA AGA TTG GGA ATG
GAT GGC TAC
>seq2
AAC GAC TTG GAT AGC TGT GAG TTG GCT AGA ACT CTG AGA AGA TTG GGA ATC
GAT GGC TAC
a) Calculate the genetic distance between the sequences based on the
nucleotides in the third codon position. Make sure to use the
default parameters in dnadist.
# The alignment based on the third codon position is ...
#
# >seq1
# GCTAGTGGCATGAAGAGTCC
# >seq2
# CCGTCTGGTATGAAGACTCC
#
# Then, you can calculate the genetic distance between the two sequences
# based on the new alignment.
#
# 2
# seq1 0.000000 0.450805
# seq2 0.450805 0.000000
b) Calculate the genetic distance between the two sequences based
on the whole alignment. Make sure to use the default parameters
in dnadist.
# 2
# seq1 0.000000 0.155471
# seq2 0.155471 0.000000
c) Which distance is larger? Why is it larger?
# The distance based on nucleotides in the third position is larger. The
# first and the second position in codons are under stronger purifying
# selection, so they are less variable than the third position.
----------------------------------------------------------------------
4. You are examining a number of genes in your favorite organism and
calculate the synonymous rate to be 0.021 and the non-synonymous rate to
be 0.053. You decide to use PAML to determine what kind of selection, if
any, these genes are undergoing.
a) Where would the parameter 'w' fall with respect to the number 1
(above/equal/below)?
# 'w' would be larger than 1 because the non-synonymous rate is larger
# than the synonymous rate.
b) Are these genes likely undergoing selection and how do you know
this? If so, what kind?
# Yes, these genes are undergoing positive selection because 'w' > 1.
c) How could you determine why or why not these genes are undergoing
selection?
# For example: we could use the KEGG database to determine how these
# genes may be involved in similar or the same pathway. This could
# suggest that the entire pathway is under positive selection and
# possibly plays a crucial role in allowing the organism to adapt to
# different environments. Or any other reasonable answer.
d) Repeat parts a-c with a synonymous rate of 0.021 and a non-
synonymous rate of 0.005.
----------------------------------------------------------------------
5. a) With a higher mutation rate, you will calculate a higher genetic
distance. However, this relationship is not linear with the Hamming
distance. Why is that?
# Higher mutation rates mean more changes that you do not see (back
# mutations)
b) At equilibrium (i.e. base frequencies equal, 1 transition for every
2 transversions, both transitions of equal frequency, base frequency
= 0.25, etc.), will both JC, K2p, TN93, T92 all estimate the same
genetic distance (yes/no) for a pair of sequences? Why?
# Yes. It is because each model assigning individual parameters for
# subsets of the data (e.g. the K2P is allowing for transitions and
# transversions to have separate rates). At equilibrium, these subsets
# will be exactly as expected (e.g. all base frequencies equal 0.25),
# and accounting for them will not change the estimate.
c) Assuming you have real data (i.e. not at equilibrium) order the below
nucleotide distance models from the one that will give you the smallest
estimate of distance to the largest estimate of distance (hint: you will
not need to do math).
- GTR, JC, Tamura & Nei (TN93), K2P, Tamura (T92)
# JC < K2P < T92 < TN93 < GTR
d) Why are they ordered in this way? What is it about the relationships of
these models that allows you to make this inference?
# These models are nested, meaning that simpler models are a special
# case of more complex models (e.g. the JC is the same as the K2P when
# transition/transversion rates are not considered separately).
# Accounting for more subsets will increase estimates because it is
# accounting for more variation that you can't see.
e) Explain why, in one short sentence, why we use the Gamma distribution
in calculating genetic distance.
# We use it to correct for variation in mutation rate among sites
f) In the gamma distribution, alpha reflects the shape of the curve.
Biologically, what does alpha represent? In words, what would an
exponential shaped curve (decreasing exponential, not growth,
say alpha = .1 in the figure presented in class) represent in terms
of mutation rates per site?
# Alpha represents how much variance there is in the mutation rate among
# sites. An exponential shaped curve would indicate that most sites have
# a very low mutation rate, and a few have a very high mutation rate.
----------------------------------------------------------------------