An important characteristic of exons is codon usage. Not only synonymous, but non-synonymous codons are highly patterned. Base composition is one factor that influences codon usage. Organisms, especially bacteria, have a wide range of GC content and this obviously reflects the types of amino acids they have and the codons they use for these amino acids. A mutational bias in the genome is thought to play a major role in determining overall base composition. Other influences, such as selection for compact genome size, have also been suggested. Mutational bias could reflect replication error, repair efficiency, nucleotide pools and, as well, possibly other unidentified factors. The causes of high, low or intermediate GC content among organisms are not known. Neither are the causes of variation among genes within species. Amino acid composition is an obvious possibility, but even with constant composition, GC content can vary. Substitution of functionally similar amino acid residues allows an additional range of GC content. The problem of GC content and amino acid composition is a chicken-or-egg situation. They are correlated, but which is driving which and what are the underlying forces?
Codon Tables. Fifteen codon tables are used to conceptually translate
GenBank DNA sequences. The universal genetic code (Figure
, p. 9) is
used by most nuclear genomes and differs from the bacterial table only
inhaving additional start codons. The universal genetic code, like
the other tables, has been patterned by evolution. An obvious feature
is that most third position transitions (R
R or
Y
Y) are
synonymous. Third position transversions (R
Y) and substitutions in
the first or second positions are not so forgiving. The origin of
third position synonymity is likely the phenomenon of wobble and wobble
base pairing. Non-Watson-Crick base pairs are not only allowed but
common in codon-anticodon interactions. Specifically, G-U is the major
wobble base pair. This means that with wobble base pairing, NN(U/C)
[anticodon G
C or U] and NN(A/G) [anticodon U
A or G] will be
translated as the same amino acid. Thus, the third position must be
synonymous or errors will occur. Furthermore, post-transcriptional
modification of the first anticodon position in tRNA can alter wobble
base pairing. The nucleotide A rarely occurs in the first anticodon
position, but is usually modified to Inosine which pairs with C, U, or
A. Naturally this only occurs in 4-fold degenerate groups such as the
arginine codons.
Figure: Nucleotide fractions at codon positions in R. meliloti:(275
entries), E.coli (13,007 entries), and C. perfringens (104
entries). Codon tables from the Codon Usage Database
(http://www.dna.affrc.go.jp/nakamura/codon.html).
Amino acid and GC-content play important roles in the choice of nucleotides at the three codon positions (c1, c2, c3). Figure 9.4 (p. 11) shows nucleotide choices for three bacteria, one with high GC (R. meliloti), one with nearly 50% GC (E. coli) and one with high AT (C. perfringens). The cumulative average frequencies on the sense strands of sequenced genes are shown on the right (vertical bars) of histograms showing deviations from this average at each position. A number of workers have pointed out major features of this figure. In all organisms, G is preferred in the first position. T and, less obviously, A are avoided. The second position is less consistent, but A is often preferred, especially at moderate or high GC. The third position shows most clearly the effect of variable GC content. G/C are preferred in the third position at high GC-content and avoided at high AT. A/T are preferred at high AT and avoided at high GC. Preferences/avoidances for G/C or A/T are not equal. C is preferred over G at high GC and T over A at high AT. Or, more concisely, pyrimidines are preferred to purines at extreme base contents. This, of course, is associated with a Chargaff asymmetry.
G at C1. The preference for G, and more weakly, A at c1 and the pyrimidine preference at c3 is the basis for the original, prototype RNY codon proposed by Shepherd (1982. Cold Spring Harbor Symp. Quant. Biol. 46:3618-3622) who suggested that the genetic code originated as an RNY code, and only later were more specific choices narrowed to different amino acids. The fact that the abiotically-formed amino acids, (i.e., those most abundant in meteorites: glycine, alanine, valine, aspartate and glutamate), have codons beginning with G may be connected with the RNY codon and/or the frequency of G at c1 (see Andersson and Kurland 1990 ). Trifonov (1987) suggested that a pattern of GHNGHNGHN...(H=non-G) is used to monitor the reading frame of the message. The 16S rRNA molecule exposes a sequence of (nnC) at the surface that interacts with the mRNA molecule. Furthermore, he presented data indicating that frameshift hotspots can occur where this pattern is disrupted.
Figure: The relative content of U and A at the second position of 3180
E. coli genes
A at C2. The second position shows a strong pattern of amino acid
hydrophopicity in the universal code (Figure
). T at c2 is confined
to hydrophobic amino acids, while A at c2 is confined to hydrophilic
ones. The situation for G or C at c2 is somewhat mixed. This causes a
strong codon difference bias in the second position between proteins
that are essentially hydrophilic (e.g., globular) and those which are
essentially hydrophobic (e.g., membrane proteins). This is clearly
shown for proteins of the E. coli genome (Figure
). When the
frequency of U at c2 for a protein is divided by the total of U+A at
c2, two types of protein are revealed. Most proteins have a value of
about 0.5, characteristic of water-soluble proteins. About 20% have
[U/(U+A)]
equal approximately 0.7, characteristic of membrane-bound
proteins.
GC-content in a genome affects the amino acid choices made by genes. Much of this influence is exactly what is expected from random choice of nucleotides at each position according to a genome GC-content ( Lobry 1997 ), however, significant deviations are found for each amino acid.
The effect of GC-content on charged amino acids is illustrated in Fig. 9.6. Arginine, whose codons are relatively CG-rich, is a decreasing fraction of the total codons as the GC-content decreases (AT increases). Lysine, on the other hand, whose codons are AT-rich, is an increasing fraction of total codons as the AT-content increases (GC decreases). Since both are positively charged amino acids and therefore can to some extent replace one another, their sum is more constant with changes in genome nucleotide content, however, lysine dominates. Glutamate and aspartate, on the other hand, do not show as strong an effect of GC-content. Their codons are evenly balanced between AT and GC (provided the synonymous position is randomly chosen). There is a strong correlation between the regression slope of the AT-content effect for an amino acid and the average AT content of the codons for that amino acid (assuming the synonymous position is randomly determined). That is, genomes tend to prefer amino acids which have codons that fit their nucleotide content.
Figure: Codon usage of charged amino acids across 120 organisms
Figure: The arginine content of 3154 E. coli genes is plotted
against their total AT-content (all codon positions)
A similar effect is seen among the genes within a genome. Figure
shows the arginine content of a gene plotted against the genes
AT-content for 3180 E. coli genes. Although only a very small part of
the variation in arginine content is explained by gene AT-content (
), that fraction is highly significant. More importantly, the
trend is the same as across organisms, arginine content decreases with
increasing AT-content to about the same degree in each case.
As indicated by Figure
, the major effect of nucleotide content is at
the third position. Arginine and lysine are used as an example in Fig.
9.8 (p. 15) to show changes at c3 with AT-content. For lysine, the
synonymous choice is between A and G, so it is not surprising that A is
preferred at high AT (could it be otherwise?). For arginine, a
six-fold degenerate amino acid, the situation is more complex. The use
of codons shifts from the 4-fold component (high GC) to the 2-fold
component (high AT) as AT-content increases across organisms. AGA
becomes the predominant codon at high AT. However, organisms fall into
distinct clusters in their use of arginine and lysine codons.
Especially interesting is the pattern of lysine codons for organisms
with approximately 50% GC. Some organisms appear to use on average
about 75% AAA and 25% AAG (E. coli is an example), while others have
nearly (but not exactly) 50% of each.
Obviously codon choice is biased at extremes of nucleotide content, but is it biased given that the nucleotide content is extreme? Furthermore, what about the codon choices of genes within the genome of a single species? How can variation be assessed and are these choices biased, and if so, what forces determine them? A number of measures of codon bias have been proposed to answer these questions. None are entirely satisfactory. Furthermore, it is not always clear what null hypothesis should be used to assess synonymous codon bias. Is it equal use of synonymous codons, or random nucleotide use according to genome frequencies, or according to genome c3 frequencies?
Ikemura (1981) showed that abundant proteins of E. coli use synonymous codons corresponding to the anticodons of abundant tRNA species. Dong et al. (1996) updated and revised earlier data and showed that tRNA abundances in rapidly growing E. coli are correlated with the pool of available codons in a way that optimizes translation rate. It is evident that either the synonymous codon choices of highly expressed genes have evolved to match tRNA expression or tRNA expression has evolved to match a biased set of of highly expressed genes. Lacking any explanation for why highly expressed genes should have different synonymous choices, most people prefer the former explanation. This assumes that there are selective differences among synonymous codons that become stronger for proteins that are expressed more often.
Figure: Relative codon use for Arginine and Lysine codons across 120
organisms (cumulative codon tables
Figure: Example calculation of relative fitnesses (
) for the
Isoleucine codon group using observed cumulative codon numbers in the reference
set of highly expressed E. coli genes
Figure: Distribution of CAI for 3180 E. coli genes
Codon Adaption Index.
Sharp
and Li (1987)
developed an index of codon
choice that measures how closely the codon use of a specific gene
matches the optimum codons used in a set of reference genes. They used
a set of highly expressed genes in E. coli as their reference set, the
idea being to determine if codon use was correlated with gene
expression. They showed that CAI was in fact correlated with
expression level in E. coli and it has become customary to use it in
the absence of direct measurements as an indicator of gene
expression. CAI is calculated from eqn.
.
Where
is a measure of the fitness of codon k relative to the
reference set (see Figure
). CAI is the geometric average of the
fitness values of the codons used by the gene. Thus, for example, if
the gene uses 10 AUU codons for isoleucine, there are 10 terms of 0.229
in the product. L is the total number of codons in the gene (the stop
codon, MET and TRP are not counted). The value of CAI reaches 1.0 only
if the gene uses exclusively the codons from the reference set which
have
(e.g. AUC in the case of the E. coli highly
expressed genes). Thus, even the highly expressed genes that were used
to form the reference set do not have CAI values greater than about
0.75. The average E. coli gene has CAI = 0.35 (Figure
).
A difficulty with CAI is that it measures synonymous codon correspondence to a biased set of reference genes, but not codon bias itself. Thus, genes with a low CAI may be highly biased, but simply biased towards codons which are not most frequent in the reference set.
Effective Number of Codons. A more direct measure of synonymous codon
bias was proposed by
Wright
(1990).
ENC measures the effective number of codons by a method
used in population genetics to determine the effective number of
alleles segregating in a population. Each amino acid group is
analogous to a locus and the synonymous codons are analogous to
alleles. Eqn.
shows how ENC values are calculated.
(The last line should read "]/[n-1]", the division is missing). A quantity analogous to the heterozygosity (
) is calculated for each
codon group (j). In the calculation of
, n is the number of codons
used in the gene for the amino acid and
is the frequency (within the
amino acid group) of the i
codon (
). ENC is somewhat dependent
on the proteins amino acid composition if certain amino acids are
rarely used. It is also subject to extreme statistical fluctuation for
short genes.
Figure: Effective number of codons for 46 D. melanogaster genes on
the X chromosome
ENC measures bias in an intuitively clear manner. The larger the
variety of synonymous codons used by a gene, the larger is ENC. The
minimum expected value is 20 and the maximum is 61 (though actual
values sometimes fluctuate outside these limits). Random synonymous
codon usage should lead to ENC values close to 60. Here random means
with respect to choosing possible synonymous codons from an equally
probable set. ENC will deviate from its maximum value if there is
selection among synonymous codons (e.g., according to tRNA
availability). It may also deviate if there is a mutation bias
creating unequal nucleotide frequencies at the third codon position.
ENC depends on the nucleotide composition at c3, since only if
G=C=A=T=25% is it possible for ENC to be its maximum value of
60. Of
course, the nucleotide content at c3 and codon bias are inseparably
related. Figure
shows the distribution of ENC for 46 D.
melanogaster genes on the X chromosome. ENC is highly correlated with
GC-content because biased genes of Drosophila prefer C at c3
(nucleotide frequencies at each codon position are highly correlated).
Only AT-rich genes approach random synonymous codon use.
Chi Square-like Statistics. Indices based on a chi square statistic have been proposed. This statistic measures the agreement between observed codon use (within amino acid groups) and that expected according to a null hypothesis. A difficulty is that it is usually not clear what null hypothesis should be used. Medigue et al. (1991) use a chi square statistic for each amino acid to cluster E. coli genes by factorial correspondence analysis. They claim that genes could be divided into three groups on this basis. One (26%) contained many highly expressed genes having large CAI values. A second group (58%) has more randomuse of synonymous codons. They proposed that the third group (16%) consisted of genes transferred into E. coli from other organisms.
The degree of codon bias of a gene is correlated with its rate of
synonymous substitution.
Sharp
and Li (1987)
compared genes from E. coli and
S. typhimurium. There was a strong negative correlation between
and
CAI. That is, more biased genes (large CAI) have low
. The
correlation between replacement substitutions and CAI, though of the
same sign, was not significant. However, several studies have
indicated correlations between
and
. Thus, genes that are more
biased in their choice of synonymous codons tend to be more conserved.
This is attributed to purifying selection on synonymous sites due to
translational differences.