Molecular Biology JAVA and Perl Programs

A. DNA sequence analyses
B. Genomic analyses
C. Primer design
D. Microarray analyses
E. Protein analyses
F. Alignments
G. Motifs
H. Phylogeny
I.   Miscellaneous

J. Graphic packages

DNA sequence analysis:

 FastQC (Simon Andrews, Bioinformatics Group, Babraham Institute, UK) - aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are to (a)  Import of data from BAM, SAM or FastQ files (any variant), (b)  Provide a quick overview to tell you in which areas there may be problems, and (c)  present summary graphs and tables to quickly assess your data. 

Sequence Manipulation Suite  - is an incredible set of programs for manipulating DNA and protein sequences. (Reference: P. Stothard. (2000).  Biotechniques 28: 1102-1104)

SeWeR (SEquence analysis using WEb Resources)  is an integrated portal to common web-based services in bioinformatics. (Reference: M.K. Basu. (2001). Bioinformatics. 17: 577-578)

JEMBOSS (Java version of the European Molecular Biology Open Source Software Suite) (Open Bio Foundation, England). (Reference: T. Carver & A. Bleasby. 2003. Bioinformatics 19: 1837-1843). Can also be downloaded.

 GEMBASSY is an EMBOSS associated software (EMBASSY) package of 53 tools implemented with methods from G-language GAE. The tools can perform powerful whole genome analyses such as prediction of replication origins and termini, estimation of gene expression from codon usage, and genome information visualization. GEMBASSY is available for installation from source and ready for use through the EMBOSS Explorer interface.

 StarORF (Massachusetts Institute of Technology, U.S.A.) - the open reading frame finder.

 prfectBLAST is a multiplatform (MS Windows, Mac OS X, Linux) graphical user interface (GUI) for the stand alone BLAST+ suite of applications. It allows researchers to do nucleotide or amino acid sequence similarity searches against public (or user-customized) databases locally stored. It does not require any dependencies or installation and can be used from a portable flash drive. (Reference: Santiago-Sotelo P, Ramirez-Prado JH. 2012. Biotechniques. 53(5):299-300).

 SSTAR, a Stand-Alone Easy-To-Use Antimicrobial Resistance Gene Predictor - combines a locally executed BLASTN search against a customizable database with an intuitive graphical user interface for identifying antimicrobial resistance (AR) genes from genomic data. Although the database is initially populated from a public repository of acquired resistance determinants (i.e., ARG-ANNOT), it can be customized for particular pathogen groups and resistance mechanisms. (Reference: T. J. B. de Man &  I.M. Limbago. mSphere 10.1128/mSphere.00050-15)

 SynteView allows a fast and easy visualization of conservation of gene adjacency in many prokaryotic genomes for which orthology and neighbourhood data have been computed and stored in SynteBase, a dedicated relational database. (Reference: Lemoine, F. et al. BMC Bioinformatics, 2008, 9: 536).

Genome Viewers:

 GeneViTo (Genome Visualization Tool) (Reference: Vernikos GS et al. 2003. BMC Bioinformatics 4:53).

 IGV (Integrative Genomics Viewer) -  is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. It is available in multiple forms, including: the original IGV - a Java desktop application, and IGV-Web - a web application. (Reference: Thorvaldsdóttir H et al. (2013) Brief Bioinformatics 14: 178-192).

 BugView  - is a genome browser for comparing the arrangement of genes on a pair of related genomes, and can also be used to view individual genomes. (Reference: D.P. Leader. (2004) Bioinformatics 20: 129-130) .

CGView - this program uses files such as an NCBI ptt file to generate high quality, zoomable maps of circular genomes. CGView converts the input into a graphical map (PNG, JPG, or Scalable Vector Graphics format), complete with labels, a title, legends, and footnotes. In addition to the default full view map, the program can generate a series of hyperlinked maps showing expanded views. The linked maps can be explored using any web browser, allowing rapid genome browsing, and facilitating data sharing. The feature labels in maps can be hyperlinked to external resources, allowing CGView maps to be integrated with existing web site content or databases. (Reference: P. Stothard, & D.S. Wishart. (2005). Bioinformatics 21: 537 - 539) .

 progressiveMauve - Multiple Genome Alignments -  (Reference: A.E. Darling et al 2010. PloS one 5: e11147) - this program is designed for efficient multiple genomic alignment.  It is ideally suited for closely related genomes where large scale events such as rearrangements and deletions have occurred.

ACT - Artemis Comparison Tool (The Welcome Trust Sanger Institute, United Kingdom) - allows an interactive visualization of comparisons between complete genome sequences and associated annotations. This brilliant tool is based on Artemis.  (Reference: T. Carver et al. 2005. Bioinformatics 21: 3422-3423)

 Gepard (GEnome PAir - Rapid Dotter) allows the calculation of dotplots even for large sequences like chromosomes or bacterial genomes (Reference: J. Krumsiek et al. 2007. Bioinformatics 23: 1026-1028).

 J-Circos - Circos plots are graphical outputs that display three dimensional chromosomal interactions and fusion transcripts. However, the Circos plot tool isnot an interactive visualization tool, but rather a figure generator. This team has developed a Circos plot tool (J-Circos) that is an interactivevisualization tool that can plot Circos figures, as well as being able to dynamically add data to the figure, and providing information for specific datapoints using mouse hover display and zoom in/out functions.  Users can input data into J-Circos using flat data formats, as well as from the GUI. (Reference: An J et al. 2015. Bioinformatics 31:1463-1465).

Apollo Genome Browser (collaborative project between the Berkeley Drosophila Genome Project and Ensembl) - Display of genomic sequence and any associated start and stop codons; annotations can be created and edited; zoomable and scrollable feature display down to sequence level optimized for display of large regions of genome; Searchable for feature names or sequence string.

 AnnotationSketch drawing library module is a versatile and efficient C-based drawing library for GFF3-compatible genomic annotations. It is included in the GenomeTools distribution.(Reference: S. Steinbiss et al. Bioinformatics 2009, 25(4): 533–534).

 GenoViz - is an open source, Java-based framework designed for rapid assembly of visualization software applications for genomics. The Genoviz SDK framework provides a mechanism for incorporating adaptive, dynamic zooming into applications, a desirable feature of genome viewers. Visualization capabilities of the Genoviz SDK include automated layout of features along genetic or genomic axes; support for user interactions with graphical elements (Glyphs) in a map; a variety of Glyph sub-classes that promote experimentation with new ways of representing data in graphical formats. (Reference: G.A. Helt et al. 2009. BMC Bioinformatics 10:266)

 DNAPlotter - is an interactive Java application for generating circular and linear representations of genomes. Making use of the Artemis libraries to provide a user-friendly method of loading in sequence files (EMBL, GenBank, GFF) as well as data from relational databases, it filters features of interest to display on separate user-definable tracks. It can be used to produce publication quality images for papers or web pages.(Reference: Carver, T. et al. 2008. Bioinformatics 25:119-120)

PCR primer design:

PerlPrimer - calculates primer melting temperature using J. SantaLucia's extensive nearest-neighbour thermodynamic parameters. To adjust for the salt conditions of the PCR, PerlPrimer uses the empirical formula derived by von Ahsen, et al. (2001) and allows the user to specify the concentration of Mg2+, dNTPs and primers, or use standard PCR conditions. The result is a highly accurate prediction of primer melting temperature, giving rise to a maximum yield of product when amplified. It calculates for possible primer-dimers and allows BLAST searches at NCBI or on a local server.  In addition, results can be saved or optionally exported in a tab-delimited format that is compatible with most spreadsheet applications. (Reference: O.J. Marshall (2004) Bioinformatics 20: 2471-2472).

Picky is an oligo microarray design program that identifies probes that are very unique and specific to input sequences. These calculations are based on parameters inputted by the user including optimal probe length, ideal percentage of guanine and cytosine content, target-melting temperature, salt concentration and the maximum length to which a target sequence matches any non-target sequence. (Reference: H.-H. Chou et al. (2004) Bioinformatics 20: 2893-2902). Download genome *.ffn files from GenBank for use with this program. N.B. Unfortunately these files do not include the gene names only their coordinates.

Microarray analysis:

MAExplorer (MicroArray Explorer) - is a tool for data mining gene expression patterns. (Reference: P.F. Lemkin et al. (2000). Nucleic Acid Research 28: 4452-4459).

 MAGIC Tool (MicroArray Genome Imaging & Clustering Tool) - A teaching resource developed at Davidson College (U.S.A.) by Laurie Heyer and her undergraduate students. (Reference: L. J. Heyer et al. 2005. Bioinformatics 21: 2114 - 2115).

 VAMPIRE microarray suite is a collection of Java tools designed to perform Bayesian statistical analysis of gene expression array data. (Reference: Hsiao, A et al. 2005. Nucleic Acids Res. 33: W627-32).

Protein analysis:

GelScape -  allows analysis of standard 1D and 2D protein gels.  It  uses advanced concepts in "network computing" enabling one to upload, download, save, print, view, annotate, edit, label, compare and spot mark just about any gel image.  GelScape also allows one to calculate spot intensity, prepare HTML image maps and archive annotated gels to a public database (GelBank) - all using an easy-to-use, intuitive browser interface.  (Reference: N. Young et al. (2004). Bioinformatics 20: 976 - 978)

FLICKER - is an open-source stand-alone computer program for visually comparing 2D gel images. (Reference: P.F. Lemkin and G.Thornwall. (2002). J. Walker (ed), The Protein Protocols Handbook, Second edition; Humana Press, Totowa, NJ)

TMRPres2D (TransMembrane protein Re-Presentation in 2 Dimensions tool) - takes data from a variety of protein folding servers and creates uniform, two-dimensional, high analysis graphical images/models of alpha-helical or beta-barrel transmembrane proteins. (Reference: I.C. Spyropoulos et al. (2004) Bioinformatics 20: 3258-3260).

MPEx (Membrane Protein Explorer) (Stephen White Laboratory, University of California Irvine, U.S.A.) - is a tool for exploring the topology and other features of membrane proteins by means of  hydropathy plots using thermodynamic principles. MPEx can also be installed on your computer as a stand-alone or Web Start application. 

BALLView - is a molecular viewer and modeling tool which combines state-of-the-art visualization capabilities with powerful modeling functionality including implementations of force field methods and continuum electrostatics models. (Reference: Moll et al. 2006. Bioinformatics 22: 365-366).

JMV (Java Molecular Viewer) - JMV is a molecular viewer written in Java and Java3D. JMV is designed to be an easy-to-use platform neutral molecular visualization tool, which can be used standalone or integrated into other programs. JMV provides several molecular representations, multiple coloring styles, lighting controls, and stereoscopic rendering capabilities. JMV loads PDB format molecular structure files over the web, from the RCSB protein databank, from BioCoRE filesystems, and from local filesystems.

QuteMol is an open source (GPL), interactive, high quality molecular visualization system which exploits the current GPU capabilites through OpenGL shaders to offers an array of innovative visual effects (Ball and Sticks, Space-Fill and Liquorice visualization modes & Depth Aware Silhouette Enhancement). It's visualization techniques are aimed at improving clarity and an easier understanding of the 3D shape and structure of large molecules or complex proteins. (Reference: M. Tarini et al. 2006. IEEE Transaction on Visualization & Computer Graphics 12[5]).

Sequence Alignments:


PFAAT Protein Family Alignment Annotation Tool (Neogenesis Drug Discovery and Pfizer, Inc.) is a protein sequence alignment application designed to facilitate the analysis, curation, and annotation of large protein sequence families. Key features include the ability to align collections of sequences, cluster and/or group sequences into subfamilies, analyze sequences based on a number of similarity criteria, visualize protein structure, and annotate sequences and specific residue positions with text descriptions. (Reference: J.M. Johnson et al. (2003) Bioinformatics 19: 544-545).  

Jalview - Analysis and Manipulation of Multiple Sequence Alignments (M. Clamp, J. Cuff, S. Searle & G.  Barton. EMBL-EBI, United Kingdom) 
JAligner (Ahmed Moustafa) - is an open source Java implementation of the dynamic programming algorithm Smith-Waterman for biological local pairwise sequence alignment.

 ISHAN - is a flexible platform for performing fast homology analysis and molecular phylogenetic studies on proteins and DNA sequences, by bringing together all the relevant tools under a single package. Since the framework facilitates speedy alignments and compilation of data, evolutionary tracing of proteins and genes can be carried out in a faster way using ISHAN. (Reference: P. Shil et al. 2006. In Silico Biol. 6: 0035)

 BigFoot - extends a combined alignment and phylogenetic footprinting approach to analyze larger amounts of sequence data using MCMC. It implements an MCMC sampling approach to jointly estimate a DNA multiple sequence alignment and the locations of slowly evolving regions that may represent subsequences undergoing purifying selection. While the insertion-deletion model is fixed, this program is flexible to pair any implemented substitution model with the insertion-deletion model. (Reference: R. Satija et al. 2009. BMC Evolutionary Biol. 2009, 9:217)

 SynteBase/SynteView - allows a fast and easy visualization of conservation of gene adjacency in many prokaryotic genomes for which orthology and neighbourhood data have been computed and stored in SynteBase, a dedicated relational database. (Reference: Lemoine, F. et al. 2008. BMC Bioinformatics, 9: 536)

Phylogeny:

 ANI is widely used to classify and identify bacteria, OrthoANI was developed to overcome the large differences in reciprocal ANI values associated with the ANI algorithm. Furthermore, OrthoANIu tool employees USEARCH over BLAST for its OrthoANI calculations which increases the number of comparative studies and substantially decrease computational time. (Reference: Yoon, S. H. et al. (2017).  Antonie van Leeuwenhoek. 110:1281–1286).

 Orthologous Average Nucleotide Identity Tool (OAT) - OAT uses OrthoANI to measure the overall similarity between two genome sequences. ANI and OrthoANI are comparable algorithms: they share the same species demarcation cut-off at 95~96% and large comparison studies have demonstrated both algorithms to produce near identical reciprocal similarities. Details of the OrthoANI algorithm is given in (Lee et al. 2015). OAT employs an easy-to-follow Graphical User Interface that allow researchers to calculate OrthoANI values between genomes of interest without unfamiliar Command Line Environments. (Reference: Lee, I. et al. (2015).  Int J Syst Evol Microbiol. 66: 1100-1103).

 SNAP Workbench - this program manages and coordinates a series of analysis programs for making inferences on population processes. It allows the user to customize the implementation of complex console programs and functions for the purpose of automating and enhancing data exploration. The workbench facilitates population parameter estimation by ensuring that the assumptions and program limitations of each analysis method are met and by providing a step-by-step methodology to effectively integrate both summary-statistic methods and coalescent-based population genetic models. (Reference: E.W. Price & I. Carbone (2005) Bioinformatics 21: 402-404).  It is now online here.

Phylogenetic Tree Reconciler - permits finding gene duplications in phylogenetic trees, in order to improve gene function inferences. The algorithm is applicable to realistic data, especially n-ary species tree and unrooted phylogenetic tree. The algorithm also takes branch lengths into account.  (Reference: J-F. Dufayard et al. 2005. Bioinformatics 21: 2596 - 2603).

Pintail - is a tool for identifying anomalies such as chimeras within 16S rDNA sequences.  In essence, the program works by comparing evolutionary distances between a query and subject sequence over the length of the 16S rRNA gene (small subunit rRNA), by employing a sampling window of specified size, progressing a fixed number of bases at a time along the length of the gene. (Reference: K.E. Ashelford et al. 2005. Appl. Environ. Microbiol. 71: 7724-7736).

TreeIllustrator - is a program for displaying and manipulating phylogenetic trees. It gives you powerful means to customise your phylogenetic trees and compare them with the current classification of organisms. Handle trees with up to thousands of leafs; imports NEXUS(PAUP*) and NEWICK(PHYLIP) files; permits different Tree shapes: radial, radial logarithmic, phylogram, rectangular cladogram, radial cladogram and slanted cladogram; allows export to Bitmap (JPEG) an vector (PostScript) formats. (Reference: G. Trooskens et al. 2005. Bioinformatics 21: 3801-3802).

 CTree - CTree has been designed for the quantification of clusters within viral phylogenetic tree topologies. Clusters are stored as individual data structures from which statistical data, such as the Subtype Diversity Ratio (SDR), Subtype Diversity Variance (SDV) and pairwise distances can be extracted. (Reference: Archer J. & Robertson DL.2007. Bioinformatics. 23: 2952-2953)

 JSpecies is an easy to use, biologist-centric software designed to measure the probability if two genomes belonging to the same species or not. (Reference: Richter M, & Rosselló-Móra R. 2009.  Proc Natl Acad Sci U S A. 106:19126-31).

 PareTree 1.0.2 - This command-line Java program allows users to ‘pare’ down their tree by either removing unwanted leaves (tip-nodes), removing bootstrap information from the tree, or removing branch lengths from the tree – or any combination. Both of these functions can be accomplished in languages like R or Perl, but Java allows very large trees to be pared down quickly, efficiently, and easily!

 SplitsTree5 is the leading application for computing unrooted phylogenetic networks from molecular sequence data. Given an alignment of sequences, a distance matrix or a set of trees, the program will compute a phylogenetic tree or network using methods such as split decomposition, neighbor-net, consensus network, super networks methods or methods for computing hybridization or simple recombination networks. (Reference: Huson DH. Bioinformatics. 1998;14(1): 68-73).

 Dendroscope 3 - is a program for working with rooted phylogenetic trees and networks. It provides a number of methods for drawing and comparing rooted phylogenetic networks, and for computing them from rooted trees.  (Reference: Huson DH & Scornavacca C (2012) Syst. Biol. 61(6):1061–1067).

 TreeGraph 2  (Reference: Stöver BC & Müller KF. BMC Bioinformatics 2010, 11:7).

Other:

JavaScript DNA Translator - A small Java program written by William L Perry III that permits one to layer amino acid sequence on DNA sequence.

 RNAknot - is a new method for predicting RNA secondary structure that contains the following components: stems, hairpin loops, multi-branched loops or multi-loops, bulge loops, and internal loops, in addition to two types of pseudoknots, H-type pseudoknot and Hairpin kissing. RNAknot is based on a genetic algorithm and Greedy Randomized Adaptive Search Procedure (GRASP), and it uses the free energy as fitness function to evaluate the obtained structures. In order to validate the performance of the presented method 131 tests have been performed using two datasets of 26 and 105 RNA sequences, which have been taken from the two data bases RNAstrand and Pseudobase respectively. (Reference: El Fatmi A et al. (2019) J Bioinform Comput Biol. 17(5): 1950031). 

red_bullet.gif (914 bytes) Motifs:

Two Sample Logo - detects and displays statistically significant differences in position-specific symbol compositions between two sets of multiple sequence alignments. In a typical scenario, two groups of aligned sequences will share a common motif but will differ in their functional annotation. Also available as an online tool. (Reference: Vacic, V. et al. 2006. Bioinformatics 22: 1536-1537).

  CiiiDER - is a user-friendly tool for predicting and analysing transcription factor binding sites, designed with biologists in mind. CiiiDER predicts transcription factor binding sites (TFBSs) across regulatory regions of interest, such as promoters and enhancers derived from any species. It can perform an enrichment analysis to identify TFs that are significantly over- or under-represented in comparison to a bespoke background set and thereby elucidate pathways regulating sets of genes of pathophysiological importance. (Reference: Gearing LJ et al. (2019) Plos ONE  doi.org/10.1371/journal.pone.0215495).

Graphic packages:

ImageJ - is a public domain Java image processing program inspired by NIH Image. It It can display, edit, analyze, process, save and print 8-bit, 16-bit and 32-bit images. It can read many image formats including TIFF, GIF, JPEG, etc. It can calculate area and pixel value statistics of user-defined selections. It can measure distances and angles. It can create density histograms and line profile plots. It supports standard image processing functions such as contrast manipulation, sharpening, smoothing, edge detection and median filtering.