Answers are due in pdf format to AVENUE before Wednesday Oct 31,
5:00pm. Assignments will not be accepted after this time.
----------------------------------------------------------------------
1. The following is an alignment of 4 amino acids along with the
codons that encode them. Assuming you do not know the true
nucleotide sequence:
a) count the minimum number of nucleotide site changes needed to produce this alignment
b) count the maximum number of nucleotide site changes needed to produce this alignment
c) calculate the Jukes-Cantor distance for a)
d) calculate the Jukes-Cantor distance for b)
*** Round all answers to 6 decimal places ***
ATG GAA ACT CCT
GAG ACC CCC
ACA CCA
ACG CCG
Met Glu Thr Pro (corresponding codons above)
Met Lys Ala Pro (corresponding codons below)
ATG AAA GCT CCT
AAG GCC CCC
GCA CCA
GCG CCG
# 1st codon - 0 differences (must be identical)
# 2nd codon - minimum 1 difference (1st position)
# - maximum 2 differences (1st + 3rd positions)
# 3rd codon - minimum 1 difference (1st position)
# - maximum 2 differences (1st + 3rd positions)
# 4rd codon - minimum 0 differences
# - maximum 1 difference (3rd position)
# a) min = 2
# b) max = 5
#
# Therefore,
# D_jc = -3/4 * ln( 1 - 4/3 * D )
# c) D_jc (min) = -3/4 * ln( 1 - 4/3 * 2/12 ) = 0.188486 (k = 2, n = 12)
# d) D_jc (max) = -3/4 * ln( 1 - 4/3 * 5/12 ) = 0.608198 (k = 5, n = 12)
#
# This is a huge range. You need the DNA sequence to get an accurate
# result.
----------------------------------------------------------------------
2. Take the sequences below and calculate the genetic distance according
to the program dnadist (advanced form: using method "Jukes-Cantor",
all other parameters default) on http://evol.mcmaster.ca/p3S03.html
(be careful of input format and that your pasted data is read
correctly).
ACTTATATATACCGGAGACTATATGAGA
ACTTTTATATACCGGAGGCTATACGAGA
Now calculate the genetic distance for
AC--TTATATATACCGGAGACTATTTATGAGA
ACAATTTTATATACCGGAGGCTA--TACGAGA
How do the sequence pairs differ and how do the genetic distances
differ? How has this program treated the additional differences that
exist between the second pair of sequences.
# There are two indels, each length 2, present in the second pair.
# Distance is 0.115613 for both pairs of sequences.
#
# The program has ignored the indels.
#
# As mentioned in class this is common for most programs because
# modelling indels is very difficult.
----------------------------------------------------------------------
3. Saba is studying HLA-DRB genes using RNAseq. Here are two
haplotypes that she has sequenced from the blood sample of one
adult human (the sequences are given in groups of three bases
corresponding to codons):
>HLA-DRB haplotype1
CCC CCC AAG ACT CAT ATG ACC CAC CAC
>HLA-DRB haplotype2
CCC CCC CAA GAC ACC TAT GAC CCT CCT
a) Go to the web site listed on page 2 of your notes,
http://evol.mcmaster.ca/p3S03.html, and use protdist to calculate
the protein distance between the haplotypes. Be sure to set the
protdist options to use a Dayhoff PAM matrix as the distance model.
Which of the observed amino-acid substitutions are unlikely
according to the PAM250 matrix shown in table 7.1 of the
course textbook?
# The protein distance is 1.957405 PAM units
# Two of the amino-acid substitutions are less likely to occur by chance
# alone: His and Thr (log odds = -1), Met and Tyr (log odds = -2).
b) Explain why one adult individual can have such different protein
sequences for HLA-DRB? (cite your sources)
# Reasonable answers incorporate any of the following ideas:
# There are 4 functional genes (HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5)
# that encode this family of proteins. So the different haplotypes may
# correspond to mRNA's from different genes.
#
# The gene HLA-DRB1 has one of the highest levels of protein-coding
# allelic diversity observed among all genes in the human genome. It is
# therefore very common for individuals to be heterozygous at this gene.
# So the different haplotypes may correspond to mRNA's from different
# alleles.
#
# In mammals, HLA-DRB genes undergo recombination that generates
# genetic variation among the somatic cells of a single individual.
# So the different haplotypes may correspond to mRNA's from
# different recombinant products of the same allele found in
# different somatic cells.
c) Saba wants to "eyeball" if positive selection is acting. A rough
estimate can be achieved by getting the substitution rate at 3rd
codon positions (which *tend* to be synonymous) and comparing it to
the substitution rate at 1st and 2nd codon positions (which *tend*
to be non-synonymous). Please calculate this crude estimate of the
non-synonymous vs synonymous rate; be sure to state any assumptions
used to get the estimate.
What conclusion should Saba make about selection?
# Assuming that there haven't been multiple substitutions, use the hamming distance.
# synonymous distance ~ 0.666666, non-synonymous distance ~ 0.611111
# ratio of synonymous/non-synonymous = 0.916667
#
# Assuming that transitions and transversions are equally likely, use the Jukes-Cantor distance.
# synonymous distance ~ 1.647918, non-synonymous distance ~ 1.264799
# ratio of synonymous/non-synonymous = 0.767513
#
# (Trying to use the Kimura-2-parameter model or F84 model with default
# values in dnadist gives differences at non-synonymous sites too large
# to estimate distance.)
#
# An estimate of the non-synonymous vs synonymous rate < 1 indicates
# negative selection. The estimate of the non-synonymous vs synonymous
# rate is very close to 1 so this could also indicate that selection is
# acting neutrally or that there is a weak signal of positive selection.
----------------------------------------------------------------------
4. a) Substitutions are a result of mutation plus selection acting on
them. Is it possible to estimate the number of substitutions per site
between two sequences if you do not have the mutation rate? Explain
your answer.
# Yes! You can use Jukes-Cantor. Jukes-Cantor distance expresses the
# distance as the expected number of substitutions per site. It assumes
# equal substitution probabilities between nucleotides. Therefore, you
# do not need the mutation rate to calculate JC.
b) The following table summarizes distances calculated between two
sequences. The Ratio column is the ratio between expected nucleotide
differences (Jukes-Cantor) and observed number of nucleotide
differences.
Length of Observed # of Jukes-Cantor Ratio
Sequence differences Distance
100bp 5 5.17 1.03
100bp 10 10.7 1.07
100bp 20 23.3 1.16
100bp 30 38.3 1.28
Explain why the ratio column increases as the number of observed
nucleotide differences increases.
# The higher the proportion of different nucleotides (Jukes-Cantor
# distance) the larger the discrepancy between the observed number of
# differences and the expected number of differences (the values in the
# ratio column).
#
# Jukes-Cantor accounts for multiple substitutions to happen at
# one site.
#
# For example, if two sequences are AAAA and ACAA the second sequence
# could have changed from AAAA to ATAA to AGAA and then to ACAA. This
# means that three changes have occurred and Jukes-Cantor would try to
# estimate that there were three changes. Getting three changes is more
# likely if the rate of substitution is high. If the rate of
# substitution is low, then only one substitution is likely at any site
# and this will be observed as such; Jukes-Cantor then estimates much
# closer to the number of changes actually observed.
----------------------------------------------------------------------
5. 16S is a common marker in phylogeographic studies.
From NCBI (https://www.ncbi.nlm.nih.gov/), get some 16S sequences from
3 frog species using the accession number provided:
- Xenopus tropicalis: KU166819.1
- Hymenochirus curtipes: KY080143.1
- Pipa parva: KU495453.1
After downloading these sequences in a fasta format, perform an alignment
(using your 3 sequences) using clustalw and muscle (default parameters)
from the course website (http://evol.mcmaster.ca/p3S03.html).
Provide us with both alignments. Are they the same? What are the main
differences between both programs?
#Not exactly
#progressive alignments (ClustalW) versus iterative alignment (Muscle).
b) The phylogenetic relationship between these species is still not
clear. Using the output from Muscle, calculate the genetic distances
between the 3 species (default parameters, using dnadist from the
course website). Which species seem to be most closely related?
#KU166819.1(Xenopus tropicalis) 0.000000 0.180606 0.202530
#KY080143.1(Hymenochirus curtipes) 0.180606 0.000000 0.240442
#KU495453.1(Pipa parva) 0.202530 0.240442 0.000000
#Xenopus tropicalis with Hymenochirus curtipes
----------------------------------------------------------------------