Eukaryotic Genes

Because many genes in eukaryotes are interrupted by introns it can be difficult to identify the protein sequence of the gene. Furthermore, programs designed for recognizing intron/exon boundaries for a particular organism or group of organisms may not recognize all intron/exons boundaries. 

No single site should be used, rather a combinatorial approach should be taken, incorporating BLAST and the programs outlined below, when studying eukaryotic genes. 

The following programs identify intron-exon boundaries.  To help you assess the relative merits of each site I have attached GenBank files containing human, plant and Drosophila genes sequences, in which the submitters have designated the intron and exon sequences and the protein product.

DNA functional site miner (DNAFSMiner) - contains two software tools: TIS Miner which can be used to predict translation initiation site (TIS) in vertebrate mRNA, cDNA, or DNA sequences. The other is called Poly(A) Signal Miner which can be used to predict polyadenylation (poly(A)) signal in human DNA sequences (Reference: H. Liu, et al. (2005) Boinformatics 21: 671-673.

red_bullet.gif (914 bytes) AUGUSTUS  - uses gene prediction in eukaryotic (Human, Drosophila, Arabidopsis, Brugia, Aedes, Coprinus, & Tribolium)sequences that is based on a generalized hidden Markov model, a probabilistic model of a sequence and its gene structure. The web server allows the user to impose constraints on the predicted gene structure (Reference: M. Stanke & B. Morgenstern. 2005. Nucl. Acids Res. 33: W465-W467). WebAUGUSTUS is an updated version which provides an interface for training AUGUSTUS for predicting genes in genomes of novel species. It also enables you to predict genes in a genome sequence with already trained parameters.(Reference: K.J. Hoff & M. Stanke. 2013. Nucl. Acids Res. 41(Web Server issue):W123-8.).

GENSCAN (C. Burge, Massachusetts Institute of Technology, U.S.A.)
 GenomeScan (C. Burge, MIT, U.S.A.) - The newer version of GENSCAN this can be used to predict vertebrate, Arabidopsis & maize genes. 

 GeneMark (Georgia Institute of Technology, U.S.A.) - For several species pre-trained model parameters are ready and available through the GeneMark.hmm page. For metagenomic analysis use MetaGeneMark (Reference: Zhu, W. et al. 2010. Nucleic Acids Research; 38: e132)

red_bullet.gif (914 bytes) Softberry Tools (SoftBerry) - FGENES (Pattern based human gene structure prediction (multiple genes, both chains)); Fgenesh-M (Prediction of multiple (alternative splicing) variants of potential genes in genomic DNA);and, FGENESH_GC (HMM-based human gene prediction that allows to predict genes containing minor variants of donor splice sites (GC sites)) .

 geneid (Genome Informatics Research Lab, Universitat Pompeu Fabra, Spain) -  Prediction of human & Drosophila genes.

 HMMgene (Anders Krogh, Center for Biological Sequence Analysis, Denmark) -  Prediction of vertebrate and C. elegans genes.

 NetPlantGene (Center for Biological Sequence Analysis, Denmark) - neural network predictions of splice sites in Arabidopsis thaliana DNA.

 Genie(Berkeley Drosophila Genome Project, U.S.A.) -  Gene finder based upon generalized Hidden Markov Models. Human & Drosophila genes.

 SplicePort: An Interactive Splice Site Analysis Tool - for splice-site analysis that allows the user to make splice-site predictions for submitted sequences. In addition, the user can also browse the rich catalog of features that underlies these predictions, and which we have found capable of providing high classification accuracy on human splice sites. Feature selection is optimized for human splice sites, but the selected features are likely to be predictive for other mammals as well. With our interactive feature browsing and visualization tool, the user can view and explore subsets of features used in splice-site prediction (either the features that account for the classification of a specific input sequence or the complete collection of features). Selected feature sets can be searched, ranked or displayed easily. The user can group features into clusters and frequency plot WebLogos can be generated. (Reference: Dogan, R.I. et al. 2007. Nucl. Acids Res. 35(Web Server issue): W285–W291).  

 SpliceRover - is a predictive deep learning approach that outperforms the state-of-the-art in splice site prediction. SpliceRover uses convolutional neural networks (CNNs), which have been shown to obtain cutting edge performance on a wide variety of prediction tasks. We adapted this approach to deal with genomic sequence inputs, and show it consistently outperforms already existing approaches, with relative improvements in prediction effectiveness of up to 80.9% when measured in terms of false discovery rate. However, a major criticism of CNNs concerns their 'black box' nature, as mechanisms to obtain insight into their reasoning processes are limited. To facilitate interpretability of the SpliceRover models, we introduce an approach to visualize the biologically relevant information learnt. (Reference: Zuallaert J et al. (2018) Bioinformatics; 34(24): 4180-4188).

 iSS-PC (identifying splicing sites via physical-chemical properties using deep sparse auto-encoder)  -  involves twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. (Reference: Chen W et al. Biomed Research International 2014: 623149).

 HSF 3.0 Human SplicingFinder (Aix Marseille Université, France) - this  system combines 12 different algorithms to identify and predict mutations’ effect onsplicing motifs including the acceptor and donor splice sites, the branch point and auxiliary sequences known to either enhance or repress splicing: ExonicSplicing Enhancers (ESE) and Exonic Splicing Silencers (ESS). These algorithms are based on either PWM matrices, Maximum Entropy principle or MotifComparison method. is a tool to predict the effects of mutations on splicing signals or to identify splicing motifs in any human sequence. It contains all available matrices for auxiliary sequence prediction as well as new ones for binding sites of the 9G8 and Tra2-beta Serine-Arginine proteins and the hnRNP A1 ribonucleoprotein. We also developed new Position Weight Matrices to assess the strength of 5' and 3' splice sites and branch points.  (Reference: FO Desmet et al. 2009. Nucleic Acid Research 37:e67).

 RegRNA 2.0 - A Regulatory RNA Motifs and Element Finder (Reference: Chang TH et al. BMC bioinformatics 2013, 14 Suppl 2:S4).

 ASSEDA (Automated Splice Site and Exon Definition Analyses) - is a tool to predict the effects of sequence changes that alter mRNA splicing in human diseases. We designed the system to evaluate changes in splice site strength based on information theory-based models of donor and acceptor splice sites. N.B. You need to register.

 NetGene2 - produces neural network predictions of splice sites in human, C. elegans and A. thaliana DNA. Restrictions: at most one sequence not less than 200 and not more than 100,000 nucleotides.(Reference: S.M. Hebsgaard et al. 1996. Nucl. Acids Res. 24:3439-3452).

 Spliceman  - predicts how likely distant mutations around annotated splice sites were to disrupt splicing. Spliceman takes a set of DNA sequences with point mutations and returns a ranked list to predict the effects of point mutations on pre-mRNA splicing. The current implementation included the analyses of 11 genomes: human, chimp, rhesus, mouse, rat, dog, cat, chicken, guinea pig, frog and zebrafish.  (Reference: Lim, K.H. & Fairbrother, W. 2012. Bioinformatics 28: 1031-1032). Version 2 can be found here.

If you want to express a gene in an organism having different codon usage:

red_bullet.gif (914 bytes) JCat - Codon Adapter Tool - offers a complete range of eukaryotic & prokaryotic cells; and, the ability to select against rho-independent terminators and restriction sites. (Reference: A. Grote et al. 2005. Nucl. Acids Res. 33: W526-W531).