Eukaryotic Genes

Because many genes in eukaryotes are interrupted by introns it can be difficult to identify the protein sequence of the gene. Furthermore, programs designed for recognizing intron/exon boundaries for a particular organism or group of organisms may not recognize all intron/exons boundaries. 

No single site should be used, rather a combinatorial approach should be taken, incorporating BLAST and the programs outlined below, when studying eukaryotic genes. 

The following programs identify intron-exon boundaries.  To help you assess the relative merits of each site I have attached GenBank files containing human, plant and Drosophila genes sequences, in which the submitters have designated the intron and exon sequences and the protein product.

DNA functional site miner (DNAFSMiner) - contains two software tools: TIS Miner which can be used to predict translation initiation site (TIS) in vertebrate mRNA, cDNA, or DNA sequences. The other is called Poly(A) Signal Miner which can be used to predict polyadenylation (poly(A)) signal in human DNA sequences (Reference: H. Liu, et al. (2005) Boinformatics 21: 671-673.

  FirstEF (R. Davuluri, I. Grosse, & M. Zhang, Cold Sring Harbor Laboratory, U.S.A.) is a first-exon and promoter prediction program for human DNA. 

red_bullet.gif (914 bytes) AUGUSTUS  - uses gene prediction in eukaryotic (Human, Drosophila, Arabidopsis, Brugia, Aedes, Coprinus, & Tribolium) sequences that is based on a generalized hidden Markov model, a probabilistic model of a sequence and its gene structure. The web server allows the user to impose constraints on the predicted gene structure (Reference: M. Stanke & B. Morgenstern. 2005. Nucl. Acids Res. 33: W465-W467).

TWINSCAN (M.Brent, Computational Genomics Group, Washington University, U.S.A.) - I am particularly impressed with the output which provides one with the location of the exons, a trimmed ORF and the protein sequence linked to BLASTN & P searches.  Can be used for analysis of mammalian (human and non-human), plant (Arabidopsis and non- Arabidopsis dicots), worm (Caenorhabditis elegans)  and fungus (Cryptococcus neoformans) gene analysis. 

GENSCAN (C. Burge, Massachusetts Institute of Technology, U.S.A.) can be accessed via the MIT Server or the Pasteur Institute Server.
 GenomeScan (C. Burge, MIT, U.S.A.) - The newer version of GENSCAN this can be used to predict vertebrate, Arabidopsis & maize genes. 

GeneMark (European Bioinformatics Institute, United Kingdom) - offers plenty of model systems (A. thaliana, C. elegans, D. melanogaster, G. gallus, H. sapiens, F. rubrupes,  M. musculus, R. norvegicus and S. cerevisiae) with many options.

  GlimmerM (The Institute for Genomic Research) - provides the following model organisms: Arabidopsis,Aspergillus, Brugia,  Cryptococcus, Entamoeba, Oryza, Plasmodium, Schistosoma and Theileria. Accepts 31kb by pasting and 200kb by uploading.

red_bullet.gif (914 bytes) Genome Analysis Pipeline (Genome Analysis and System Modeling Group, Life Sciences Division, Oak Ridge National Laboratory, U.S.A.) - offers both prokaryote (Generation and Glimmer) as well as higher eukaryotes and Saccharomyces (GrailEXP and Genscan) genome analysis.  One also has the option of choosing "Select all services". Very nice Java and html presentation of analysis results.

FGENESH (SoftBerry) - very fast HMM-based gene structure prediction program.

 geneid (Genome Informatics Research Lab, Universitat Pompeu Fabra, Spain)Prediction of human & Drosophila genes.

 GenLang (University of Pennsylvania, Computational Biology and Informatics Laboratory, U.S.A.) - Prediction of vertebrate, Drosophila and dicot genes.

 HMMgene (Anders Krogh, Center for Biological Sequence Analysis, Denmark) -  Prediction of vertebrate and C. elegans genes.

 NetPlantGene (Center for Biological Sequence Analysis, Denmark) - neural network predictions of splice sites in Arabidopsis thaliana DNA.

 Genie(Berkeley Drosophila Genome Project, U.S.A.) Gene finder based upon generalized Hidden Markov Models. Human & Drosophila genes.

 GeneSeqer (Bioinformatics, Computational Biology, and Biological Statistics, Iowa State University, U.S.A.) - select either human, mouse, rat, chicken, Drosophila, nematode, yeast, Aspergillus, Arabidopsis [default], or maize.

 GrailEXP (Computational Biology Program, Oak Ridge National Laboratory, U.S.A.) - human, mouse, Arabidopsis, maize plus many options.

 GeneWalker (Japan Science and Technology Corporation)

red_bullet.gif (914 bytes) FrameD - is a A noise-resistant gene finder for prokaryotic and matured eukaryotic sequences offering considerable flexibility in search strategies and output format (Reference: Schiex, T. et al. 2003. Nucl. Acids. Res. 31: 3738-374).

red_bullet.gif (914 bytes) SplicePort — An interactive splice-site analysis tool (Reference: R.I. Dogan et al. 2007. Nucl. Acids Res. 35: W285-W291).  Your might want to try the site out with Mus musculus PLRLC-C mRNA for myosin light chain 2 (X65981 ).

An alternative:

GeneMachine (I. Makalowska, J. F. Ryan, &  A.D. Baxevanis; Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, U.S.A.) is a suite of Perl programs and modules used to run MZEF, GENSCAN, GRAIL2, FGENES, RepeatMasker, Sputnik, BLASTX and BLASTN. N.B. You have to register to a free account.

If you want to express a gene in an organism having different different codon usage:

red_bullet.gif (914 bytes) JCat - Codon Adapter Tool - offers a complete range of eukaryotic & prokaryotic cells; and, the ability to select against rho-independent terminators and restriction sites. (Reference: A. Grote et al. 2005. Nucl. Acids Res. 33: W526-W531).