The primary tool for locating protein-coding exons is the open reading frame. Many programs determine open reading frames, among them Translate on the ExPasy Molecular Biology Server.
There are a number of difficulties in determining if an ORF is actually used by the organism to code for protein. A powerful, but not often available, indication is conservation across related species. Potential problems are:
Some of these problems can be resolved if all the available sequence information is combined to make a model of the entire gene. Information about possible promoters, initiation (cap sites), translation signals (initiation, termination), splice signals, and transcription terminators can all be incorporated in order to reject unlikely ORFs and include likely ORFs in a consistent gene model.
Two general approaches are used to recognize genes within a DNA sequence. The global approach calculates a vector that estimates the protein-coding capacity of a window within the sequence. This vector is simply a one dimensional array of numbers which incorporate various features that the algorithm will use to determine protein-coding capacity. The measured vector is compared to an expected one obtained from a set of standard genes. The local approach attempts to identify gene signals (such as promoters, splice sites, stop codons, polyA sites) that surround a region identified as containing an ORF. In either case, an overall statistic is calculated and a decision made as to whether or not to present the model as a potential gene. Neural network methods are often used in which the algorithm is trained on a set of test genes and learns what weights should be assigned to the various measures in order give the best discrimination.
The ability of various approaches to identify protein-coding genes has been assessed by Fickett and Tung (1992) They identified several measures which are particularly useful.
Many programs and versions thereof are available to build gene models, such as FGENEH, GENMARK, GRAIL, GeneParser, etc. Burset and Guigo (1996) compared many of them and found that their accuracy is often overrated because it had been evaluated on genes similar to the test set used to build the discrimination functions. I have tried several programs on Drosophila sequences.
FGENEH was written by Victor Solovyevs group. It can be accessed at
the CGG (Sanger Centre) web site
where the version available (Mar. 1998) is called FGENES. Other
versions exist, for example at the Baylor College of Medicine,
however this one strips N's from the sequence. The Sanger site gives a
nicer presentation (complete with cartoons). FGENEH is designed to
identify and piece exons together to form a single gene model. FGENEH
will not perform well if there are several genes within the sequence.
If the entire gene is not within the sequence searched, one of its
component programs should be used (FEXH - for flanking 5
and 3
exons
or HEXON - for internal exons should be used which are combined into
FEX at the Sanger site).
Figure: The EXON-INTRON boundries of the D. melanogaster Adh gene
FGENEH relies on an algorithm that identifies exon donor and acceptor
splice sites as described by
Solovyev
et al. (1994)
Flanking (5
and 3
) and
internal exons are treated with separate algorithms. The
dinucleotides ..GT and AG... form the splice sites of almost all
known exon-intron junctions (Figure
, note that the alternate larval
splice site of Adh is non-canonical). The program examines each ORF
which terminates in a GT or begins with AG and calculates a linear
discriminant function (eqn.
). The discriminant function classifies
an exon as valid (5
, internal or 3
depending on the program) when the
discriminant value (z) is above some critical value determined from the
analysis of test (learning) data.
The measures (
) and weights (
) in eqn.
are chosen to best
discriminate between the set of positive and negative learning exons.
The power of the analysis is indicated by the distance between the two
classes in the learning set. If there is a large distance between
them, then the discriminate function is able to classify sequences of
this type with few errors. The measures in the discriminate function
used by FGENEH are oligonucleotides (grouped as triplets) frequencies
at the exon-intron boundaries. Weights were computed from a learning
set of ORFs bounded by GT or AG. Because triplet frequencies are
organism dependent, discriminant function weights must be obtained for
each species or those for a closely-related relative used instead.
Figure: Gene models of the D. melanogaster Adh and ase regions
Figure
shows how FGENEH performed on the D. melanogaster Adh region
(Sanger site, Drosophila option). All three protein-coding exons were
precisely defined and combined to give the Adh gene. The correct amino
acid sequence of ADH was deduced. The adult promoter was not located
(perhaps because it is too far from the first protein exon), but the
larval promoter was found as well several unknown promoters. Neither
the portion of the outspread exon at the beginning of the sequence nor
the adh-dup exons at its end were located. This was expected because
FGENEH returns only one gene model for the sequence. In order to see
how well the programs locates other exons, FEX was used and these
results are also shown in Figure
In addition to the 3 exons of Adh,
17 other, mostly short (5 - 80 amino acids) exons were proposed. None
corresponded to known exons in this region. The outspread and
adh-dup exons extend beyond the boundaries of the sequence so that is
presumably why they were not found.
FGENE was also tested on a D. melanogaster gene, ase (Figure
;
sequence accession: X52892), lacking any introns. The protein coding
portion of the exon was correctly defined and the 486 amino acid gene
product returned. FEX located 7 additional potential exons (7 - 69
amino acids), none of which are part of any known gene. Considering
that genes are known which lie within exons (e.g., adh), it may be
premature to say that there are no other valid exons in the ase DNA
sequence.
GENIE is a program written by the Computational Biology Group at the University of California, Santa Cruz and the Genomic Informatics Group at LBNL. It uses what is called a Generalized Hidden Markov Model to provide a gene model for a DNA sequence. Basically this means that conditional nucleotide frequencies are calculated within windows of the sequence. For example, the frequency of A followed by T, of G followed by C etc. Higher order words are also used (e.g., AT followed by GC, ATC followed by GTT). Expected probabilities are called transition probabilities. The distance between observed and expected matrices is used to obtain a score analogous to the determinant function score. The model is specifically applied to finding potential exons having splice sites which are then combined into a gene. Transition probability matrices are determined by optimizing the program with learning sequences. Genie located the three Adh introns in the correct location. Its performance is expected to be similar to FGENEH as it uses similar information about exon-intron splice site boundaries.
GENSCAN is a more recent program developed by Burge and Karlin (1997) It has several improvements over previous programs and although designed for human genes, it performs well on vertebrate genes, satisfactorily on Drosophila genes and has been modified to deal with plants. Unlike FGENEH and GENIE, it can find more than one gene within the sequence and looks on both strands. It incorporates a number of features of genes to build its model:
GENIE, FGENEH and GENSCAN derives their parameters from analyzing a
group of training genes and can be expected to perform best when the
target gene is similar to them. A large, non-redundant set of human
genes (
nucleotides containing 1492 exons and 1254 introns) was
used to develop GENSCAN. It has been tested successfully on other
vertebrate and Drosophila sequences. The internet site has versions
for other groups of organisms.
GENSCAN was applied to
the D. melanogaster Adh region (Figure
). The Adh gene and larval
promoter were correctly identified. The polyA site was incorrectly
located in the 3
mRNA trailer, however, it is possible that
other, shorter transcripts exist that use this site. GENSCAN was
unable to locate the outspread exon. Like FGENEH, it performs poorly
at the boundaries of sequences. It did, however, make a good attempt
at the adh-dup exons, locating the beginning of the second exon
correctly, but not the first. Interestingly, GENSCAN identified a
potential exon at nucleotide pos. 1388-1566 (not found by FGENEH,
presumably because it does not have a potential splice site). This
region of the sequence has high complexity and GC content.
Gene location programs perform well when applied to sequences
containing genes similar to those in the training set. They cannot
locate transcribed but non-translated regions, nor do they give
reliable predictions of promoter and polyA sites. We still need to do
experiments to find these. They also propose interesting questions,
such as what to make of the obviously complex, possibly protein-coding
region at
1400 in the Drosophila Adh sequence.