next up previous contents
Next: Codon Usage Up: Exon Analysis Previous: Exon Analysis

Open Reading Frames

The primary tool for locating protein-coding exons is the open reading frame. Many programs determine open reading frames, among them Translate on the ExPasy Molecular Biology Server.

There are a number of difficulties in determining if an ORF is actually used by the organism to code for protein. A powerful, but not often available, indication is conservation across related species. Potential problems are:

  1. Sequencing errors. If a sequencing error creates a stop codon, it is difficult to determine if an exon is present since most methods are based on continuous ORFs.
  2. Short exons. Short exons may be difficult to distinguish from intron DNA because they lack sufficient identifying features.
  3. The non-coding strand of exons often contains statistically significant ORFs. That is, the reverse complements of stop codons (i.e., TTA [TAA], CTA [TAG], TCA [TGA] are often avoided, creating ORFs on the complementary strand.
  4. Multiple transcripts and multiple proteins. Certain exons may only be used in a subset of transcripts. The D. melanogaster Adh gene, for example has different transcripts during larval and adult phases of growth. In this case, the protein-coding part of the exon is identical so the difficulty lies in identifying mRNA leaders for the two situations. Other genes (the D. melanogaster per gene, for example) code multiple proteins that are specified by differently spliced transcripts.

Some of these problems can be resolved if all the available sequence information is combined to make a model of the entire gene. Information about possible promoters, initiation (cap sites), translation signals (initiation, termination), splice signals, and transcription terminators can all be incorporated in order to reject unlikely ORFs and include likely ORFs in a consistent gene model.

Gene Recognition

Two general approaches are used to recognize genes within a DNA sequence. The global approach calculates a vector that estimates the protein-coding capacity of a window within the sequence. This vector is simply a one dimensional array of numbers which incorporate various features that the algorithm will use to determine protein-coding capacity. The measured vector is compared to an expected one obtained from a set of standard genes. The local approach attempts to identify gene signals (such as promoters, splice sites, stop codons, polyA sites) that surround a region identified as containing an ORF. In either case, an overall statistic is calculated and a decision made as to whether or not to present the model as a potential gene. Neural network methods are often used in which the algorithm is trained on a set of test genes and learns what weights should be assigned to the various measures in order give the best discrimination.

The ability of various approaches to identify protein-coding genes has been assessed by Fickett and Tung (1992) They identified several measures which are particularly useful.

  1. Codon usage. A codon usage vector (frequencies of 64 possible codons) for a potential exon is compared to that of a reference sets of genes, preferably from the same or closely related organism. Methods differ in how the reference set is obtained and how the measure of fit is calculated. Reference sets that incorporate information about the amino acid composition of the potential gene are superior to those that do not.
  2. In-phase words. A vector similar to the codon vector is calculated for longer words (oligonucleotide of length n). Hexamers have proven useful. These take into account tendencies of codon use to be correlated over short ranges (e.g., a codon ending in G tends not to be followed by one beginning in G).
  3. The presence of STOP codons. Most methods only consider ORFs. However, it is possible to incorporate stop codons into a measure of amino acid content.
  4. Amino acid content. Measures of protein function, such as vectors of amino acids, dipeptides and hydrophobicity, can be obtained a potential exon. Like the codon usage vectors, these are compared to a reference set. This, however, may limit identification to particular types of protein-coding genes.
  5. Nucleotide periodicity. Nucleotides do not appear at random in coding sequences (nor in non-coding ones). One property of valid exons is a tendency to have G in the first codon position. More general is the statistical average codon, RNY. Periodicity vectors are calculated for potential exons (e.g., using Fourier transforms or autocorrelation functions).

Software

Many programs and versions thereof are available to build gene models, such as FGENEH, GENMARK, GRAIL, GeneParser, etc. Burset and Guigo (1996) compared many of them and found that their accuracy is often overrated because it had been evaluated on genes similar to the test set used to build the discrimination functions. I have tried several programs on Drosophila sequences.

FGENEH was written by Victor Solovyevs group. It can be accessed at the CGG (Sanger Centre) web site where the version available (Mar. 1998) is called FGENES. Other versions exist, for example at the Baylor College of Medicine, however this one strips N's from the sequence. The Sanger site gives a nicer presentation (complete with cartoons). FGENEH is designed to identify and piece exons together to form a single gene model. FGENEH will not perform well if there are several genes within the sequence. If the entire gene is not within the sequence searched, one of its component programs should be used (FEXH - for flanking 5 tex2html_wrap_inline3123 and 3 tex2html_wrap_inline3123 exons or HEXON - for internal exons should be used which are combined into FEX at the Sanger site).

   figure2088
Figure: The EXON-INTRON boundries of the D. melanogaster Adh gene

FGENEH relies on an algorithm that identifies exon donor and acceptor splice sites as described by Solovyev et al. (1994) Flanking (5 tex2html_wrap_inline3123 and 3 tex2html_wrap_inline3123 ) and internal exons are treated with separate algorithms. The dinucleotides ..GT and AG... form the splice sites of almost all known exon-intron junctions (Figure gif, note that the alternate larval splice site of Adh is non-canonical). The program examines each ORF which terminates in a GT or begins with AG and calculates a linear discriminant function (eqn. gif). The discriminant function classifies an exon as valid (5 tex2html_wrap_inline3123 , internal or 3 tex2html_wrap_inline3123 depending on the program) when the discriminant value (z) is above some critical value determined from the analysis of test (learning) data.

  equation2103

The measures ( tex2html_wrap_inline3141 ) and weights ( tex2html_wrap_inline3143 ) in eqn. gif are chosen to best discriminate between the set of positive and negative learning exons. The power of the analysis is indicated by the distance between the two classes in the learning set. If there is a large distance between them, then the discriminate function is able to classify sequences of this type with few errors. The measures in the discriminate function used by FGENEH are oligonucleotides (grouped as triplets) frequencies at the exon-intron boundaries. Weights were computed from a learning set of ORFs bounded by GT or AG. Because triplet frequencies are organism dependent, discriminant function weights must be obtained for each species or those for a closely-related relative used instead.

   figure2111
Figure: Gene models of the D. melanogaster Adh and ase regions

Figure gif shows how FGENEH performed on the D. melanogaster Adh region (Sanger site, Drosophila option). All three protein-coding exons were precisely defined and combined to give the Adh gene. The correct amino acid sequence of ADH was deduced. The adult promoter was not located (perhaps because it is too far from the first protein exon), but the larval promoter was found as well several unknown promoters. Neither the portion of the outspread exon at the beginning of the sequence nor the adh-dup exons at its end were located. This was expected because FGENEH returns only one gene model for the sequence. In order to see how well the programs locates other exons, FEX was used and these results are also shown in Figure gif In addition to the 3 exons of Adh, 17 other, mostly short (5 - 80 amino acids) exons were proposed. None corresponded to known exons in this region. The outspread and adh-dup exons extend beyond the boundaries of the sequence so that is presumably why they were not found.

FGENE was also tested on a D. melanogaster gene, ase (Figure gif; sequence accession: X52892), lacking any introns. The protein coding portion of the exon was correctly defined and the 486 amino acid gene product returned. FEX located 7 additional potential exons (7 - 69 amino acids), none of which are part of any known gene. Considering that genes are known which lie within exons (e.g., adh), it may be premature to say that there are no other valid exons in the ase DNA sequence.

GENIE is a program written by the Computational Biology Group at the University of California, Santa Cruz and the Genomic Informatics Group at LBNL. It uses what is called a Generalized Hidden Markov Model to provide a gene model for a DNA sequence. Basically this means that conditional nucleotide frequencies are calculated within windows of the sequence. For example, the frequency of A followed by T, of G followed by C etc. Higher order words are also used (e.g., AT followed by GC, ATC followed by GTT). Expected probabilities are called transition probabilities. The distance between observed and expected matrices is used to obtain a score analogous to the determinant function score. The model is specifically applied to finding potential exons having splice sites which are then combined into a gene. Transition probability matrices are determined by optimizing the program with learning sequences. Genie located the three Adh introns in the correct location. Its performance is expected to be similar to FGENEH as it uses similar information about exon-intron splice site boundaries.

GENSCAN is a more recent program developed by Burge and Karlin (1997) It has several improvements over previous programs and although designed for human genes, it performs well on vertebrate genes, satisfactorily on Drosophila genes and has been modified to deal with plants. Unlike FGENEH and GENIE, it can find more than one gene within the sequence and looks on both strands. It incorporates a number of features of genes to build its model:

  1. Transcriptional and translational signals. These are evaluated by weight matrices. Potential signals are: polyadenylation, cap site, promoter (both TATA (15 bp) and TATA-less promoters are allowed with variable distance to the cap site), translational start sites (6 bp prior to start codons) and stop sites (3 bp following stop codons).
  2. Splice signals. A modified weight matrix method is used to examine potential splice sites (3 bp in exon, 6 bp in intron). The modified method takes into account correlations between positions.
  3. Exon models. Potential coding portions of exons are evaluated using a Markov model. This computes transition probability matrices for hexamers ending at each codon position. Scores are dependent on similarity between the GC-content of the training sequences and the sequence to be evaluated. GENSCAN uses one of two sets of expected transition probabilities that were generated from training sets having either GC<43% or GC>43%.

GENIE, FGENEH and GENSCAN derives their parameters from analyzing a group of training genes and can be expected to perform best when the target gene is similar to them. A large, non-redundant set of human genes ( tex2html_wrap_inline3149 nucleotides containing 1492 exons and 1254 introns) was used to develop GENSCAN. It has been tested successfully on other vertebrate and Drosophila sequences. The internet site has versions for other groups of organisms. GENSCAN was applied to the D. melanogaster Adh region (Figure gif). The Adh gene and larval promoter were correctly identified. The polyA site was incorrectly located in the 3 tex2html_wrap_inline3123 mRNA trailer, however, it is possible that other, shorter transcripts exist that use this site. GENSCAN was unable to locate the outspread exon. Like FGENEH, it performs poorly at the boundaries of sequences. It did, however, make a good attempt at the adh-dup exons, locating the beginning of the second exon correctly, but not the first. Interestingly, GENSCAN identified a potential exon at nucleotide pos. 1388-1566 (not found by FGENEH, presumably because it does not have a potential splice site). This region of the sequence has high complexity and GC content.

Summary

Gene location programs perform well when applied to sequences containing genes similar to those in the training set. They cannot locate transcribed but non-translated regions, nor do they give reliable predictions of promoter and polyA sites. We still need to do experiments to find these. They also propose interesting questions, such as what to make of the obviously complex, possibly protein-coding region at tex2html_wrap_inline3157 1400 in the Drosophila Adh sequence.


next up previous contents
Next: Codon Usage Up: Exon Analysis Previous: Exon Analysis