next up previous contents
Up: Exon Analysis Previous: Open Reading Frames

Codon Usage

An important characteristic of exons is codon usage. Not only synonymous, but non-synonymous codons are highly patterned. Base composition is one factor that influences codon usage. Organisms, especially bacteria, have a wide range of GC content and this obviously reflects the types of amino acids they have and the codons they use for these amino acids. A mutational bias in the genome is thought to play a major role in determining overall base composition. Other influences, such as selection for compact genome size, have also been suggested. Mutational bias could reflect replication error, repair efficiency, nucleotide pools and, as well, possibly other unidentified factors. The causes of high, low or intermediate GC content among organisms are not known. Neither are the causes of variation among genes within species. Amino acid composition is an obvious possibility, but even with constant composition, GC content can vary. Substitution of functionally similar amino acid residues allows an additional range of GC content. The problem of GC content and amino acid composition is a chicken-or-egg situation. They are correlated, but which is driving which and what are the underlying forces?

   figure2155
Figure: Codon table

Codon Tables. Fifteen codon tables are used to conceptually translate GenBank DNA sequences. The universal genetic code (Figure gif, p. 9) is used by most nuclear genomes and differs from the bacterial table only inhaving additional start codons. The universal genetic code, like the other tables, has been patterned by evolution. An obvious feature is that most third position transitions (R tex2html_wrap_inline3159 R or Y tex2html_wrap_inline3159 Y) are synonymous. Third position transversions (R tex2html_wrap_inline3159 Y) and substitutions in the first or second positions are not so forgiving. The origin of third position synonymity is likely the phenomenon of wobble and wobble base pairing. Non-Watson-Crick base pairs are not only allowed but common in codon-anticodon interactions. Specifically, G-U is the major wobble base pair. This means that with wobble base pairing, NN(U/C) [anticodon G tex2html_wrap_inline3019 C or U] and NN(A/G) [anticodon U tex2html_wrap_inline3019 A or G] will be translated as the same amino acid. Thus, the third position must be synonymous or errors will occur. Furthermore, post-transcriptional modification of the first anticodon position in tRNA can alter wobble base pairing. The nucleotide A rarely occurs in the first anticodon position, but is usually modified to Inosine which pairs with C, U, or A. Naturally this only occurs in 4-fold degenerate groups such as the arginine codons.

Codon Pattern by Position

   figure2165
Figure: Nucleotide fractions at codon positions in R. meliloti:(275 entries), E.coli (13,007 entries), and C. perfringens (104 entries). Codon tables from the Codon Usage Database (http://www.dna.affrc.go.jp/nakamura/codon.html).

Amino acid and GC-content play important roles in the choice of nucleotides at the three codon positions (c1, c2, c3). Figure 9.4 (p. 11) shows nucleotide choices for three bacteria, one with high GC (R. meliloti), one with nearly 50% GC (E. coli) and one with high AT (C. perfringens). The cumulative average frequencies on the sense strands of sequenced genes are shown on the right (vertical bars) of histograms showing deviations from this average at each position. A number of workers have pointed out major features of this figure. In all organisms, G is preferred in the first position. T and, less obviously, A are avoided. The second position is less consistent, but A is often preferred, especially at moderate or high GC. The third position shows most clearly the effect of variable GC content. G/C are preferred in the third position at high GC-content and avoided at high AT. A/T are preferred at high AT and avoided at high GC. Preferences/avoidances for G/C or A/T are not equal. C is preferred over G at high GC and T over A at high AT. Or, more concisely, pyrimidines are preferred to purines at extreme base contents. This, of course, is associated with a Chargaff asymmetry.

G at C1. The preference for G, and more weakly, A at c1 and the pyrimidine preference at c3 is the basis for the original, prototype RNY codon proposed by Shepherd (1982. Cold Spring Harbor Symp. Quant. Biol. 46:3618-3622) who suggested that the genetic code originated as an RNY code, and only later were more specific choices narrowed to different amino acids. The fact that the abiotically-formed amino acids, (i.e., those most abundant in meteorites: glycine, alanine, valine, aspartate and glutamate), have codons beginning with G may be connected with the RNY codon and/or the frequency of G at c1 (see Andersson and Kurland 1990 ). Trifonov (1987) suggested that a pattern of GHNGHNGHN...(H=non-G) is used to monitor the reading frame of the message. The 16S rRNA molecule exposes a sequence of (nnC) at the surface that interacts with the mRNA molecule. Furthermore, he presented data indicating that frameshift hotspots can occur where this pattern is disrupted.

   figure2181
Figure: The relative content of U and A at the second position of 3180 E. coli genes

A at C2. The second position shows a strong pattern of amino acid hydrophopicity in the universal code (Figure gif). T at c2 is confined to hydrophobic amino acids, while A at c2 is confined to hydrophilic ones. The situation for G or C at c2 is somewhat mixed. This causes a strong codon difference bias in the second position between proteins that are essentially hydrophilic (e.g., globular) and those which are essentially hydrophobic (e.g., membrane proteins). This is clearly shown for proteins of the E. coli genome (Figure gif). When the frequency of U at c2 for a protein is divided by the total of U+A at c2, two types of protein are revealed. Most proteins have a value of about 0.5, characteristic of water-soluble proteins. About 20% have [U/(U+A)] tex2html_wrap_inline3179 equal approximately 0.7, characteristic of membrane-bound proteins.

GC-content in a genome affects the amino acid choices made by genes. Much of this influence is exactly what is expected from random choice of nucleotides at each position according to a genome GC-content ( Lobry 1997 ), however, significant deviations are found for each amino acid.

The effect of GC-content on charged amino acids is illustrated in Fig. 9.6. Arginine, whose codons are relatively CG-rich, is a decreasing fraction of the total codons as the GC-content decreases (AT increases). Lysine, on the other hand, whose codons are AT-rich, is an increasing fraction of total codons as the AT-content increases (GC decreases). Since both are positively charged amino acids and therefore can to some extent replace one another, their sum is more constant with changes in genome nucleotide content, however, lysine dominates. Glutamate and aspartate, on the other hand, do not show as strong an effect of GC-content. Their codons are evenly balanced between AT and GC (provided the synonymous position is randomly chosen). There is a strong correlation between the regression slope of the AT-content effect for an amino acid and the average AT content of the codons for that amino acid (assuming the synonymous position is randomly determined). That is, genomes tend to prefer amino acids which have codons that fit their nucleotide content.

   figure2194
Figure: Codon usage of charged amino acids across 120 organisms

   figure2201
Figure: The arginine content of 3154 E. coli genes is plotted against their total AT-content (all codon positions)

A similar effect is seen among the genes within a genome. Figure gif shows the arginine content of a gene plotted against the genes AT-content for 3180 E. coli genes. Although only a very small part of the variation in arginine content is explained by gene AT-content ( tex2html_wrap_inline3181 ), that fraction is highly significant. More importantly, the trend is the same as across organisms, arginine content decreases with increasing AT-content to about the same degree in each case.

Synonymous Codon Choices and Codon Bias

As indicated by Figure gif, the major effect of nucleotide content is at the third position. Arginine and lysine are used as an example in Fig. 9.8 (p. 15) to show changes at c3 with AT-content. For lysine, the synonymous choice is between A and G, so it is not surprising that A is preferred at high AT (could it be otherwise?). For arginine, a six-fold degenerate amino acid, the situation is more complex. The use of codons shifts from the 4-fold component (high GC) to the 2-fold component (high AT) as AT-content increases across organisms. AGA becomes the predominant codon at high AT. However, organisms fall into distinct clusters in their use of arginine and lysine codons. Especially interesting is the pattern of lysine codons for organisms with approximately 50% GC. Some organisms appear to use on average about 75% AAA and 25% AAG (E. coli is an example), while others have nearly (but not exactly) 50% of each.

Obviously codon choice is biased at extremes of nucleotide content, but is it biased given that the nucleotide content is extreme? Furthermore, what about the codon choices of genes within the genome of a single species? How can variation be assessed and are these choices biased, and if so, what forces determine them? A number of measures of codon bias have been proposed to answer these questions. None are entirely satisfactory. Furthermore, it is not always clear what null hypothesis should be used to assess synonymous codon bias. Is it equal use of synonymous codons, or random nucleotide use according to genome frequencies, or according to genome c3 frequencies?

Ikemura (1981) showed that abundant proteins of E. coli use synonymous codons corresponding to the anticodons of abundant tRNA species. Dong et al. (1996) updated and revised earlier data and showed that tRNA abundances in rapidly growing E. coli are correlated with the pool of available codons in a way that optimizes translation rate. It is evident that either the synonymous codon choices of highly expressed genes have evolved to match tRNA expression or tRNA expression has evolved to match a biased set of of highly expressed genes. Lacking any explanation for why highly expressed genes should have different synonymous choices, most people prefer the former explanation. This assumes that there are selective differences among synonymous codons that become stronger for proteins that are expressed more often.

   figure2218
Figure: Relative codon use for Arginine and Lysine codons across 120 organisms (cumulative codon tables

   figure2225
Figure: Example calculation of relative fitnesses ( tex2html_wrap_inline3191 ) for the Isoleucine codon group using observed cumulative codon numbers in the reference set of highly expressed E. coli genes

   figure2233
Figure: Distribution of CAI for 3180 E. coli genes

Codon Adaption Index.

Sharp and Li (1987) developed an index of codon choice that measures how closely the codon use of a specific gene matches the optimum codons used in a set of reference genes. They used a set of highly expressed genes in E. coli as their reference set, the idea being to determine if codon use was correlated with gene expression. They showed that CAI was in fact correlated with expression level in E. coli and it has become customary to use it in the absence of direct measurements as an indicator of gene expression. CAI is calculated from eqn. gif.

  equation2245

Where tex2html_wrap_inline3191 is a measure of the fitness of codon k relative to the reference set (see Figure gif). CAI is the geometric average of the fitness values of the codons used by the gene. Thus, for example, if the gene uses 10 AUU codons for isoleucine, there are 10 terms of 0.229 in the product. L is the total number of codons in the gene (the stop codon, MET and TRP are not counted). The value of CAI reaches 1.0 only if the gene uses exclusively the codons from the reference set which have tex2html_wrap_inline3199 (e.g. AUC in the case of the E. coli highly expressed genes). Thus, even the highly expressed genes that were used to form the reference set do not have CAI values greater than about 0.75. The average E. coli gene has CAI = 0.35 (Figure gif).

A difficulty with CAI is that it measures synonymous codon correspondence to a biased set of reference genes, but not codon bias itself. Thus, genes with a low CAI may be highly biased, but simply biased towards codons which are not most frequent in the reference set.

Effective Number of Codons. A more direct measure of synonymous codon bias was proposed by Wright (1990). ENC measures the effective number of codons by a method used in population genetics to determine the effective number of alleles segregating in a population. Each amino acid group is analogous to a locus and the synonymous codons are analogous to alleles. Eqn. gif shows how ENC values are calculated.

  eqnarray2259

(The last line should read "]/[n-1]", the division is missing). A quantity analogous to the heterozygosity ( tex2html_wrap_inline3203 ) is calculated for each codon group (j). In the calculation of tex2html_wrap_inline3203 , n is the number of codons used in the gene for the amino acid and tex2html_wrap_inline2955 is the frequency (within the amino acid group) of the i tex2html_wrap_inline3213 codon ( tex2html_wrap_inline3215 ). ENC is somewhat dependent on the proteins amino acid composition if certain amino acids are rarely used. It is also subject to extreme statistical fluctuation for short genes.

   figure2269
Figure: Effective number of codons for 46 D. melanogaster genes on the X chromosome

ENC measures bias in an intuitively clear manner. The larger the variety of synonymous codons used by a gene, the larger is ENC. The minimum expected value is 20 and the maximum is 61 (though actual values sometimes fluctuate outside these limits). Random synonymous codon usage should lead to ENC values close to 60. Here random means with respect to choosing possible synonymous codons from an equally probable set. ENC will deviate from its maximum value if there is selection among synonymous codons (e.g., according to tRNA availability). It may also deviate if there is a mutation bias creating unequal nucleotide frequencies at the third codon position. ENC depends on the nucleotide composition at c3, since only if G=C=A=T=25% is it possible for ENC to be its maximum value of tex2html_wrap_inline3157 60. Of course, the nucleotide content at c3 and codon bias are inseparably related. Figure gif shows the distribution of ENC for 46 D. melanogaster genes on the X chromosome. ENC is highly correlated with GC-content because biased genes of Drosophila prefer C at c3 (nucleotide frequencies at each codon position are highly correlated). Only AT-rich genes approach random synonymous codon use.

Chi Square-like Statistics. Indices based on a chi square statistic have been proposed. This statistic measures the agreement between observed codon use (within amino acid groups) and that expected according to a null hypothesis. A difficulty is that it is usually not clear what null hypothesis should be used. Medigue et al. (1991) use a chi square statistic for each amino acid to cluster E. coli genes by factorial correspondence analysis. They claim that genes could be divided into three groups on this basis. One (26%) contained many highly expressed genes having large CAI values. A second group (58%) has more randomuse of synonymous codons. They proposed that the third group (16%) consisted of genes transferred into E. coli from other organisms.

Synonymous Codon Bias and the Divergence Rate of Genes

The degree of codon bias of a gene is correlated with its rate of synonymous substitution. Sharp and Li (1987) compared genes from E. coli and S. typhimurium. There was a strong negative correlation between tex2html_wrap_inline2661 and CAI. That is, more biased genes (large CAI) have low tex2html_wrap_inline2661 . The correlation between replacement substitutions and CAI, though of the same sign, was not significant. However, several studies have indicated correlations between tex2html_wrap_inline2661 and tex2html_wrap_inline2659 . Thus, genes that are more biased in their choice of synonymous codons tend to be more conserved. This is attributed to purifying selection on synonymous sites due to translational differences.


next up previous contents
Up: Exon Analysis Previous: Open Reading Frames