N.B.Many of the tools that one needs for the analysis of genomes can be found in the DNA Sequence Analysis section. Here we have unique tools for genomic analysis which do not fit easily in that section.
1. DNA sequencing
2. Sequencing errors
3. Genome annotation
4. Correcting genome annotations
5. Specialized annotation - general (inteins, plasmids, typing, vaccine candidates)
6. Two-component and other regulatory proteins
7. Orthologous genes/proteins
8. Specialized annotation - CRISPR
9. Specialized annotation - virulence determinants
10. Specialized annotation - Genomic Islands
11. Genome comparisons and synteny
12. Phylogeny (AAI and ANI)
13. Genome visualization
14. Synthetic genes
16. Meta sites
DNA Sequence Quality - Phred - provides base calling, chromatogram display and high quality sequence region evaluation and presentation for up to five sequences simultaneously.
Sequence assembly - you don't need your own contig assembly program when you can use:
EGassember - aligns and merges sequence fragments resulting from shotgun sequencing or gene transcripts (EST) fragments in order to reconstruct the original segment or gene (Reference: A. Masoudi-Nejad et al. 2006. Nucl. Acids Res. 34: W459-462).
CAP3 (PBIL, France ), (Reference: Huang,X. & Madan A. 1999. Genome Res. 9: 868-877).
CAP EST Assembler (Istituto FIRC di Oncologia Molecolare, Italy) - Maximum sequence length for each sequence is 30 kb - Maximum number of sequences 10 kb
Divide-and-Conquer Multiple Sequence Alignment (Universitat Bielefeld, Germany)
Sequencing errors: - if your DNA sequence doesn't match the expected protein sequence you can check for errors at GenWise (EMBL-EBI) which compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors. Other programs include:
GENIO/frame (Genio/scan N. Mache)
FrameD (Reference: T. Schliex et al. 2003. Nucl. Acids Res. 31: 3738-3741)
Shift - a web server for hidden stops in frameshiften translation
path :: protein back-translation and alignment - addresses the problem of finding distant protein homologies where the divergence is the result of frameshift mutations and substitutions. Given two input protein sequences, the method implicitly aligns all the possible pairs of DNA sequences that encode them, by manipulating memory-efficient graph representations of the complete set of putative DNA sequences for each protein. (Reference: Gîrdea M et al. 2010. Algorithms for Molecular Biology 5:)
In-silico.com (Dr. Joseba Bikandi & co-workers, Faculty of Pharmacy, in the University of the Basque Country) - allows in silico experiments including theoretical PCR amplification, AFLP-PCR , restriction analysis and pulsed field gel electrophoresis [PFGE] with bacterial & archael genomes found in the public database.
NCBI Prokaryotic Genomes Automatic Annotation Pipeline. This will completely annotate your bacterial genome and provide you with a Sequin submission file. N.B. an NCBI Phage Automatic Annotation Pipeline is in developement.
IGS Prokaryotic Annotation Pipeline (Institute for Genome Sciences, University of Maryland, U.S.A.) IGS has developed a comprehensive automated pipeline for use with Bacteria and Archaea. The pipeline predicts protein-coding genes as well as non-coding RNAs. Similarity evidence is collected for predicted proteins with a variety of methods including pairwise alignments, HMM searches, and multiple motif prediction tools. A hierarchical rule-based system is used to assign annotation to each protein based on the highest quality available evidence. Results are loaded into a relational database and can be viewed using the Manatee annotation visualization and curation tool. Results are also available in multiple standard flat file formats.In the case of IGS and NCBI one has to contact them by email requesting their services.
RAST (Rapid Annotation using Subsystem Technology) is a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. Requires registration. (Reference: Aziz, RK et al. 2008. BMC Genomics 9:75.). See also MyRAST under Molecular Biology Freeware for Windows .
BASys Bacterial Annotation Tool - this incredible tool supports automated, in-depth annotation of bacterial genomic sequences. It accepts raw DNA sequence data and an optional list of gene identification information (Glimmer) and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. (Reference: G.H. Van Domselaar et al. 2005. Nucl. Acids Res. 33(Web Server issue):W455-W459).
MAKER Web Annotation Service (MWAS) is an easily configurable web-accesible genome annotation pipeline. It's purpose is to allow research groups with small to intermediate amounts of eukaryotic and prokaryotic genome sequence (i.e. BAC clones, small whole genomes, preliminary sequencing data, etc.) to independently annotate and analyse their data and produce output that can be loaded into a genome database. (Reference: Holt, C. & Yandell, M. 2011. BMC Bioinformatics 12:491).
xBASE bacterial genome annotation service - To use, submit a file containing one or more FASTA-formatted nucleotide sequences (contigs produced by a whole genome assembler, for example). To guide gene prediction and annotation, select the closest complete reference sequence for your genome. (Reference: Chaudhuri RR, et al. 2008. Nucleic Acids Res.36(Database issue): D543-546.)
MITOS - a pipeline is designed to provide consistent and high quality de novo annotation of Metazoan mitochondrial genomes sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. (Reference: M. Bernt et al. 2013. Molecular Phylogenetics & Evolution 69:313-319).
GenSAS - Genome Sequence Annotation Server provides a one-stop website with a single graphical interface for running multiple structural and functional annotation tools, enabling visualization and manual curation of genome sequences. Users can upload sequences into their account and run gene prediction programs, protein homology searches, map ESTs, identify repeats, ORFs and SSRs with custom parameter settings. Each analysis is displayed on separate tracks of the graphical interface with custom editabe tracks to select final annotation of features and create gff3 files for upload to genome browsers such as GBrowse. Additional programs can be easily added using this Drupal based software.
Viral Genome ORF Reader (VIGOR) - supports high throughput feature prediction and annotation. VIGOR employs an extrinsic strategy and boasts sensitivity and specificity greater than 98% for the RNA viral genomes we tested. Genome-specific features identified by VIGOR include frameshifts, ribosomal slippage, RNA editing, stop codon read-through, overlapping genes, embedded genes, and mature peptide cleavage sites. Genotyping capability for influenza and rotavirus is built into the program.
(Reference: S. Wang et al. 2011. BMC Bioinformatics 2010, 11:451)
FLAN (FLu ANnotation) is an NCBI web server for genome annotation of influenza virus is a tool for user-provided influenza A virus or influenza B virus sequences. It can validate and predict protein sequences encoded by an input flu sequence.(Reference: Y. Bao et al. 2007. Nucleic Acids Res. Web Server issue) 35: W280-W284.)
CpGAVAS (Chloroplast Genome Annotation, Visualization, Analysis and GenBank Submission Tool) - allows accurate chloroplast genome annotation, the generation of circular maps, the provision of useful analysis results of the annotated genome, the creation of files that can be submitted to GenBank directly. (Reference: C. Liu et al. 2012. BMC Genomics 13: 715)
Genome Annotation Transfer Utility (GATU) annotates a genome based on a very closely related reference genome. The proteins/mature peptides of the reference genome are BLASTed against the genome to be annotated in order to find the genes/mature peptides in the genome to be annotated (Reference: T. Tcherepanov et al. 2006. BMC Genomics 7:150.)
BioGPS (The Scripps Research Institute, USA) - is a one-stop gene annotation portal that emphasizes user-customizability and community-extensibility It is a customizable gene annotation portal and a complete resource for learning about gene and protein function.
BAGEL (Groningen Biomolecular Sciences and Biotechnology Institute, Haren, the Netherlands) - will determine from an existing or non submitted GenBank file the presence of bacteriocins based on a database containing information of known bacteriocins and adjacent genes involved in bacteriocin activity. An alternative site for bacteriocins is BACTIBASE which is a data repository of bacteriocin natural antimicrobial peptides
MICheck (MIcrobial genome Checker) - enables rapid verification of sets of annotated genes and frameshifts in previously published bacterial genomes, or genomes for which the user has a *.gbk file. This tool can be seen as a preliminary step before the functional re-annotation step to check quickly for missing or wrongly annotated genes. It worked nicely with phage genomes from 43-135kb. (Reference: S. Cruveiller et al. 2005. Nucl. Acids Res. 33: W471- W479).
WebGeSTer - Genome Scanner for Terminators - my favourite terminator search program is finally web enabled. Please note that if you want to analyze data from a *.gbk file you need to use their conversion program "GenBank2GeSTer" first. A complete description of each terminator including a diagram is produced by this program. This site linked to an extensive database of transcriptional terminators in bacterial genome (WebGeSTer DB) (Reference: Mitra A. et al. 2011.
RibEx: Riboswitch Explorer - scans <40kb DNA for potential genes (which are linked to BLASTP) and several hundred regulatory elements, including riboswitches. If you click on the "search for attenuators" it finds terminators and antiterminators. It presents the capculated genes and perits BLAST analysis at NCBI (Reference: C. Abreu-Goodger & E. Merino. 2005. Nucl. Acids Res. 33: W690-W692).
tRNAs: tRNAscan-SE - is incredibly sensitive & also provides secondary structure diagrams of the tRNA molecules (Reference: Schattner, P. et al. 2005. Nucleic Acids Res. 33: W686-689). Alternatively use ARAGORN (Reference: Laslett, D. & Canback. 2004. Nucleic Acids Research 32:11-16).
LTR_Finder - is an efficient program for finding full-length LTR retrotranspsons in genome sequences. The size of input file is now limited to 50MB (Reference: Z. Xu & H. Wang. 2007. Nucl. Acids Res.35(Web Server issue): W265-W268).
RTAnalyzer - finds retrotransposons and detects L1 retrotransposition signatures (Reference: J-F. Lucier et al. 2007. Nucl. Acids Res. 35(Web Server issue):W269-W274
FancyGene - is a fast and user-friendly web-based tool for producing images of one or more genes directly on the corresponding genomic locus. Starting from a variety of input formats, FancyGene rebuilds the basic components of a gene (UTRs, intron, exons). Once the initial representation is obtained, the user can superimpose additional features—such as protein domains and/or a variety of biological markers—in specific positions. (Reference: D. Rambaldi & F.D. Ciccarelli. 2009. Bioinformatics 25: 2281-2282).
Ori-Finder finds oriCs in bacterial genomes based on an integrated method comprising the analysis of base composition asymmetry using the Z-curve method, distribution of DnaA boxes, and the occurrence of genes frequently close to oriCs. (Reference: F. Gao & C.-T. Zhang. 2008. BMC Bioinformatics. 9:79).
MG-RAST (Metagenome Rapid Annotation using Subsystem Technology) is a fully-automated service for annotating metagenome samples. It provides annotation of sequence fragments, their phylogenetic classification and an initial metabolic reconstruction. The service also provides means for comparing phylogenetic classifications and metabolic reconstructions of metagenomes (Reference: F. Meyer et al. 2008. BMC Bioinformatics 9: 386).
Correcting genome annotations:
gbk2tbl (Andre Villegas, Public Health Ontario) - One of the problems with GenBank is that scientists do not update their submission data nor correct errors. In part this is due to laziness; but is also due to the fact that GenBank is, in most cases, unwilling to accept a new version of the Sequin file. Tbl2asn is a command-line program that automates the creation of sequence records for submission to GenBank but, from my perspective, it is not easy to use. Gbk2tbl will generate a five-column table of the genome features, which can be easily edited in Notepad.
Specialized annotation - general
ResFinder 2.1 (Acquired antimicrobial resistance gene finder) - uses BLAST for identification of acquired antimicrobial resistance genes in whole-genome data. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. Tested with 1411 different resistance genes with 100% identity. (Reference: Zankari E et al. 2012. J Antimicrob Chemother. 67:2640-2644)
PlasmidFinder 1.3 - identifies plasmids in total or partial sequenced isolates of bacteria. The method uses BLAST for identification of replicons of plasmids belonging to the major incompatibility (Inc) groups of Enterobacteriaceae. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. See also pMLST (Reference: Carattoli A et al. 2014. Antimicrob. Agents Chemother. 58: 3895-903)
PHACTS can be used to quickly classify the lifestyle of a phage (temperate or lytic). All that is needed is the proteome of the phage to be classified and PHACTS will predict the lifestyle of that phage and return a confidence value for that prediction. (Reference: K. McNair et al. 2012. Bioinformatics 28: 614-618).
SpeciesFinder 1.0 (Danish Technical University) - predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the 16S rRNA gene.
CSI Phylogeny 1.1 (Call SNPs & Infer Phylogeny) - calls SNPs, filters the SNPs, does site validation and infers a phylogeny based on the concatenated alignment of the high quality* SNPs. (Reference: Kaas, R.S. et al. PLoS ONE 2014; 9: e104984.)
KmerFinder 2.0 – predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the number of co-occurring k-mers (substrings of k nucleotides in DNA sequence data, in this case 16-mers) between the genomes of reference bacteria in a database and the genome provided by the user. (Reference: Hasman H et al. 2013. J Clin Microbiol. 52:139-146)
VIOLIN: Vaccine Investigation and Online Information Network - allows easy curation, comparison and analysis of vaccine-related research data across various human pathogens VIOLIN is expected to become a centralized source of vaccine information and to provide investigators in basic and clinical sciences with curated data and bioinformatics tools for vaccine research and development. VBLAST: Customized BLAST Search for Vaccine Research allows various search strategies against against 77 genomes of 34 pathogens. (Reference: He, Y. et al. 2014. Nucleic Acids Res. 42 (Database issue):D1124-32).
MLST 1.8 (MultiLocus Sequence Typing) - currently only works with assembled genomes and contigs (Reference: Larsen MV et al. 2012. J. Clin. Micobiol. 50: 1355-1361).
SISTR: Salmonella In Silico Typing Resource - (Public Health Agency of Canada, Laboratory for Foodborne Zoonoses) is a bioinformatics resource for rapidly interpreting in silico data for multiple Salmonella subtyping methods from draft bacterial genome assemblies. In addition to performing serovar prediction by genoserotyping, this resource integrates sequence-based typing analyses for: Multi-Locus Sequence Typing (MLST), ribosomal MLST (rMLST), and core genome MLST (cgMLST). Google Chrome is recommended; Firefox is also supported but the SVG visualizations within this app may not be as responsive. Internet Explorer is unsupported.
FSFinder2 (Frameshift Signal Finder) - Programmed ribosomal frameshifting is involved in the expression of certain genes from a wide range of organisms such as virus, bacteria and eukaryotes including human. In programmed frameshifting, the ribosome switches to an alternative frame at a specific site in response to a special signal in a messanger RNA. Programmed frameshift plays role in viral particle morphogenesis, autogenous control, and alternative enzymatic activities. The common frameshift is a -1 frameshift, in which the ribosome shifts a single nucleotide in the upstream direction. The major elements of -1 frameshifting consist of a slippery site, where the ribosome changes reading frames, and a stimulatory RNA structure such as pseudoknot or stem-loop located a few nucleotides downstream. +1 frameshifts are much less common than -1 frameshifting but are observed in diverse organisms.
InBase, The Intein Database and Registry - Protein splicing is defined as the excision of an intervening protein sequence (the INTEIN) from a protein precursor and the concomitant ligation of the flanking protein fragments (the EXTEINS) to form a mature extein host protein and the free intein (Perler 1994). Protein splicing results in a native peptide bond between the ligated exteins. This is a database site which permits BLAST analysis. (Reference: Perler, F.B. 2002. Nucleic Acids Res. 30: 383-384).
Two-component and other regulatory proteins:
P2RP (Predicted Prokaryotic Regulatory Proteins) - users can input amino acid or genomic DNA sequences, and predicted proteins therein are scanned for the possession of DNA-binding domains and/or two-component system domains. RPs identified in this manner are categorised into families, unambiguously annotated. (Reference: Barakat M, et al. 2013. BMC Genomics 14:269).
P2CS (Prokaryotic 2-Component Systems) is a comprehensive resource for the analysis of Prokaryotic Two-Component Systems (TCSs). TCSs are comprised of a receptor histidine kinase (HK) and a partner response regulator (RR) and control important prokaryotic behaviors. It can be searched using BLASTP. (Reference: P. Ortet et al. 2015. Nucl. Acids Res. 43 (D1): D536-D541).
COG analysis - Clusters of Orthologous Groups - COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages and likely corresponds to an ancient conserved domain (CloVR) . Sites which offer this analysis include:
WebMGA (Reference: S. Wu et al. 2011. BMC Genomics 12:444), RAST (Reference: Aziz RK et al. 2008. BMC Genomics 9:75), and BASys (Bacterial Annotation System; Reference: Van Domselaar GH et al. 2005. Nucleic Acids Res. 33(Web Server issue):W455-459.) and JGI IMG (Integrated Microbial Genomes; Reference: Markowitz VM et al. 2014. Nucl. Acids Res. 42: D560-D567. )
Discover EggNOG 4.1 - A database of orthologous groups and functional annotation that derives Nonsupervised Orthologous Groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. (Reference: Powell S et al. 2014. Nucleic Acids Res. 42 (D1): D231-D239
OrthoMCL - is another algorithm for grouping proteins into ortholog groups based on their sequence similarity. The process usually takes between 6 and 72 hours.(Reference: Fischer S et al. 2011. Curr Protoc Bioinformatics; Chapter 6:Unit 6.12.1-19).
arCOGs (Archaeal Clusters of Orthologous Genes - can be used to classify genes and provide improved functional annotation specific to archaeal genomes.(Reference: Makarova KS et al. 2007. Biology Direct 2:33).
KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways. (Reference: Moriya Y et al. 2007. Nucleic Acids Res. 35(Web Server issue):W182-185).
PSP - Prokaryotic Selection Pressure - is an easy-to-use web tool for rapid identification of orthologous genes with positive selection from set of multiple, closely related prokaryotic genomes. It provides several interesting functions for in-depth analysis of evolutionary selection: retrieving the orthologous groups, removing the affection of gene recombination, generation of codon-delimited alignment, building phylogenetic tree and estimation of ? under different models. It also facilitates efficient exploration of the identified orthologous genes with positive selection at metabolic-pathway level by enrichment of KEGG Orthology and/or COG. (Reference: Su, F. et al. 2013. BMC Genomics 14:924).
Specialized annotation - CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats):
CRISPRfinder - enables the easy detection of CRISPRs in locally-produced data and consultation of CRISPRs present in the database. It also gives information on the presence of CRISPR-associated (cas) genes when they have been annotated as such. . (Reference: I. Grissa et al. 2007. Nucl. Acids Res. 35 (Web Server issue): W52-W57).
CRISPRmap -provides a quick and detailed insight into repeat conservation and diversity of both bacterial and archaeal systems. It comprises the largest dataset of CRISPRs to date and enables comprehensive independent clustering analyses to determine conserved sequence families, potential structure motifs for endoribonucleases, and evolutionary relationships. (Reference: S.J. Lange et al. 2013. Nucleic Acids Research, 41: 8034-8044).
CRISPI : a CRISPR Interactive database - includes a complete repertory of associated CRISPR-associated genes (CAS). A user-friendly web interface with many graphical tools and functions allows users to extract results, find CRISPR in personal sequences or calculate sequence similarity with spacers.(Reference: Rousseau C et al. 2009. Bioinformatics. 25: 3317–3318).
CRISPRTarget - that predicts the most likely targets of CRISPR RNAs. This can be used to discover targets in newly sequenced genomic or metagenomic data. (Reference: Biswas A et al. 2013. RNA Biol. 10:817-827)
Specialized annotation - virulence determinants: This is of particular interest to those working on bacteriophages for therapy
VirulenceFinder (Danish Technical University) – identification of virulence genes. The method uses BLAST for identification of known virulence genes in Escherichia coli. The method is being extended to also include virulence genes for Enterococcus and Staphylococcus aureus. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms.
ClanTox: a classifier of short animal toxins - predicts whether each sequence is toxin-like and provides a ranked list of positively predicted candidates according to statistical confidence. For each protein, additional information is presented including the presence of a signal peptide, the number of cysteine residues and the associated functional annotations. (Reference: G. Naamati et al. 2009. Nucleic Acids Res. 37(Web Server issue): W363–W368).
BTXpred - aim is to predict bacterial toxins and its function from primary amino acid sequence using SVM, HMM and PSI-Blast. It allows users to (a) predict bacterial toxins with 96.07% accuracy; (b) classify bacterial toxins into exotoxins and endotoxins with an accuracy of 95.71%; and (c) classify exotoxins into seven different functions depending upon their molecular targets i) activate adenylate cyclase, ii) activate guanylate cyclase, iii) food poisioning iv) neurotoxins, v) macrophage cytotoxin, vi) vacuolating cytotoxin and vii) thiol activated cytotoxin with 100% overall accuracy. (Reference: Saha S & Raghava GP. 2007. In Silico Biol. 7: 405-412).
t3db the Toxin and Toxin Target Database - combines detailed toxin data with comprehensive toxin target information. The database currently houses 3,053 toxins which are linked to 1,670 corresponding toxin target records. Each toxin record (ToxCard) contains over 50 data fields and holds information such as chemical properties and descriptors, toxicity values, molecular and cellular interactions, and medical information. (Reference: Lim E et al. 2010. Nucleic Acids Res. 38(Database issue): D781-786).
TADB: a web-based resource for Type 2 toxin-antitoxin loci in Bacteria and Archaea (Reference: Y. Shao et al. 2011. Nucleic Acids Research 39: D606-D611). TAfinder: a web-based tool to identify Type II toxin-antitoxin loci in bacterial genome.
DBETH Database of Bacterial ExoToxins for Humans is a database of sequences, structures, interaction networks and analytical results for 229 exotoxins, from 26 different human pathogenic bacterial genus. All toxins are classified into 24 different Toxin classes. The aim of DBETH is to provide a comprehensive database for human pathogenic bacterial exotoxins. (Reference: Chakraborty A et al. 2012. Nucleic Acids Res. 40(Database issue): D615-620).
VFDB - is an integrated and comprehensive database of virulence factors for bacterial pathogens (also including Chlamydia and Mycoplasma). (Reference: L.H. Chen et al. 2012. Nucleic Acids Res. 40(Database issue): D641-D645).
MvirDB - LLNL Virulence Database - allows BLAST searching against a database of sequences representing known toxins, virulence factors, and antibiotic resistance genes
(Reference: C. E. Zhou et al. 2007. Nucl. Acids Res. 35 (suppl 1): D391-D394).
PAIDB (Pathogenicity Island Database) - Pathogenicity islands (PAIs) and resistance islands (REIs) are key to the evolution of pathogens and appear to play complimentary roles in the process of bacterial infection. While PAIs promote disease development, REIs give a fitness advantage to the host against multiple antimicrobial agents. An anncillary program, PAI Finder, identifies PAI-like regions or REI-like regions in a multi-sequence query. (Reference: S.H Yoon et al. 2015. Nucl. Acids Res. 43 (D1): D624-D630).
IslandViewer - includes a new interactive genome visualization tool, IslandPlot, and expanded virulence factor, antimicrobial resistance gene, and pathogen-associated gene annotations, as well as homologs of these genes in closely related genomes. Notably, incomplete genomes are accepted as input in IslandViewer 3, though they strongly urge users to use complete genomes whenever possible. (Reference: B.K. Dhillon et al. 2015. Nucl. Acids Res. 43 (W1): W104-W108).
Gypsy Database - an open editable database about the evolutionary relationship of viruses, mobile genetic elements (MGEs; Ty3/Gypsy, Retroviridae, Ty1/Copia and Bel/Pao LTR retroelements and the Caulimoviridae pararetroviruses of plants) and other genomic repeats. Equipped for BLAST and HMM searches. (Reference: Llorens, C et al. 2011. Nucl. Acids Res. 39(suppl 1): D70-D74).
PanDaTox (Pan Genomic Database for Genomic Elements Toxic to Bacteria) - is a database of genes and intergenic regions that are unclonable in E. coli, to aid n the discovery of new antibiotics and biotechnologically beneficial functional genes. It is also designed to improve the efficiency of metabolic engineering. BLAST Search feature included. (Reference: Mitai G & Sorek R. 2012. Bioengineered, 3: 218-221.)
PathogenFinder (predicts pathogenic potential) – Based on complete genomes from 513 bacteria annotated as human non-pathogens and 372 bacteria annotated as human pathogens, a database of protein families, which are either mainly associated with non-pathogens or with pathogens have been created. This database is then used for predicting the pathogenic potential of bacteria. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. (Reference: Cosentino S et al. 2013. PLoS ONE 8: e77302)
VirulentPred - is a SVM based method to predict bacterial virulent proteins sequences, which can be used to screen virulent proteins in proteomes. Together with experimentally verified virulent proteins, several putative, non annotated and hypothetical protein sequences have been predicted to be high scoring virulent proteins by the prediction method. (Reference: Garg A & Gupta G. 2008. BMC Bioinformatics 9: 62).
The Type III secretion system (T3SS) is an essential mechanism for host-pathogen interaction in the infection process. The proteins secreted through the T3SSmachinery of many Gram-negative bacteria are known as T3SS effectors (T3SEs). These can either be localized subcellularly in the host, or be part of the needle tip of the T3SS that interacts directly with the host membrane to bring other effectors into the target cell. T3SEdb represents such an effort to assemble a comprehensive database of all experimentally determined and putative T3SEs into a web-accessible site. BLAST search is available. (Reference: Tay DM et al. 2010. BMC Bioinformatics. 11 Suppl 7:S4).
Effective (University of Vienna, Austria & Technical University of Munich, Germany) - Bacterial protein secretion is the key virulence mechanism of symbiotic and pathogenic bacteria.Thereby effector proteins are transported from the bacterial cytosol into the extracellular medium or directly into the eukaryotic host cell. The Effective portal provides precalculated predictions on bacterial effectors in all publicly available pathogenic and symbiontic genomes as well as the possibility for the user to predict effectors in own protein sequence data.
SIEVE Server is a public web tool for prediction of type III secreted effectors. The SIEVE Server scores potential secreted effectors from genomes of bacterial pathogens with type III secretion systems using a model learned from known secreted proteins. The SIEVE Server requires only protein sequences of proteins to be screened and returns a conservative probability that each input protein is a type III secreted effector. (Reference: McDermott JE et al. 2011. Infect Immun. 79:23-32).
T3SS - Type III secretion system effector prediction (Reference: Löwer M, & Schneider G. 2009. PLoS One. 4:e5917. Erratum in: PLoS One. 2009;4(7).
Specialized annotation - Genomic Islands:
MobilomeFINDER: web-based tools for in silico and experimental discovery of bacterial genomic islands (Reference: H-Y. Ou et al. Nucl. Acids Res. 35 Web Server issue W97-W104)
Phage_Finder - was created to identify prophage regions in completed bacterial genomes. Using a test dataset of 42 bacterial genomes whose prophages have been manually identified, Phage_Finder found 91% of the regions, resulting in 7% false positive and 9% false negative prophages. A search of 302 complete bacterial genomes predicted 403 putative prophage regions, accounting for 2.7% of the total bacterial DNA. Analysis of the 285 putative attachment sites revealed tRNAs are targets for integration slightly more frequently (33%) than intergenic (31%) or intragenic (28%) regions, while tmRNAs were targeted in 8% of the regions. (Reference: D.E. Fouts. 2006. Nucleic Acids Res. 34: 5839–5851).
Prophinder - is the tool used for detecting prophages in bacterial genomes. Select a GenBank formatted file.
PHAST (PHAge Search Tool) - is designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage “cornerstone” feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage “quality” and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views.Furthermore, tests indicate that PHAST is as accurate or slightly more accurate than all available phage finding tools, with sensitivity of 85.4% and positive predictive value of 94.2%. (Reference: Zhou, Y. et al. 2011. Nucl. Acids Res. 39(suppl 2): W347-W352).
IslandViewer - integrates two sequence composition GI prediction methods SIGI-HMM and IslandPath-DIMOB, and a single comparative GI prediction method IslandPick (Reference: Langille et al. 2008. BMC Bioinformatics 9: 329).
PAIDB (PAthogenicity Island DataBase) has made an effort to collect known PAIs and to detect the potential PAI regions in the prokaryotic complete genomes. Pathogenicity islands (PAIs) are distinct genetic elements of pathogens encoding various virulence factors. (Reference: Yoon SH et al. 2007. Nucleic Acids Res. 35 (Database Issue): D395-D400).
Genome comparisons and synteny:
SyntTax - is a web server linking synteny to prokaryotic taxonomy. SyntTax incorporates a full hierarchical taxonomic tree allowing intuitive access to all completely sequenced prokaryotes (Archaea and Bacteria). Single or multiple organisms can be chosen on the basis of their lineage by selecting the corresponding rank nodes in the tree. This is my favourite among the synteny programs (Reference: Oberto J. 2013. BMC Bioinformatics. 14:4). The results below were generated using the heat-shock sigma factor (RpoH) from Salmonella Typhimurium against the Pseudomonadales.
Cinteny Server for Synteny Identification and Analysis of Genome Rearrangement (A. U. Sinha & J. Meller, University of Cincinnati, USA) - this server can be used for finding regions syntenic across multiple genomes and measuring the extent of genome rearrangement using reversal distance as a measure. You may create a project and upload your own data or work with pre-loaded prokaryote or eukaryote data.
AutoGRAPH is an integrated web server for multi-species comparative genomic analysis. It is designed for constructing and visualizing synteny maps between two or three species, determination and display of macrosynteny and microsynteny relationships among species, and for highlighting evolutionary breakpoints.
The web server constructs synteny maps by pairwise comparison of marker/anchor orders between a reference chromosome and one or two tested genome(s). It permits users to visualize and characterize several features: Conserved segments (CS), Conserved Segments Ordered (CSO) and breakpoints. (Reference: Derrien T et al. 2007. Bioinformatics 23:498-499).
Sibelia (University of California San Diego, USA) - is a tool for finding synteny blocks in multiple closely related microbial genomes using iterative de Bruijn graphs. Unlike most other tools, Sibelia can find synteny blocks that are repeated within genomes as well as blocks shared by multiple genomes. It represents synteny blocks in a hierarchy structure with multiple layers, each of which representing a different granularity level.
Kablammo helps you create interactive visualizations of BLAST results from your web browser. Find your most interesting alignments, list detailed parametersfor each, and export a publication-ready vector image. Incredibly easy to use - here are the results for a BLASTN comparison to Escherichia phages T1 (query) and ADB-2. (Reference: Wintersinger JA et al. Bioinformatics 31:1305-1306).
GeneOrder 2.0 (D. Seto, Bioinformatics & Computational Biology, George Mason Univ., U.S.A.) is ideal for comparing small GenBank genomes (up to 0.25 Mb), while GeneOrder 3.0 extends the limits to approx. 2.0Mb. Each gene from the Query sequence is compared to all of the genes from the Reference database using BLASTP. There are two display formats: graphical and tabular. Currently the graph is an applet and must be saved as a "SCREEN SHOT".
CoreGenes (D. Seto, Bioinformatics & Computational Biology, George Mason Univ., U.S.A.) is designed to analyze two to five genomes simultaneously, generating a table of related genes - orthologs and putative orthologs. These entries are linked to their GenBank data. It has a limit of 0.35 Mb, while the newer version CoreGenes 2.0 extends the limit to approx. 2.0Mb. If your data is not present in GenBank use this site.
CoreGenes 3 (D. Seto & P. Mahadevan, Bioinformatics & Computational Biology, George Mason Univ., U.S.A) - tallies the total number of genes in common between the two genomes being compared; displays the percent value of genes in common with a specific genome; determines the unique genes contained in a pair of proteomes. CoreGenes 3.5 is the batch CoreGenes server. I have made extensive use of this suite of programs in the classification of bacterial viruses.
WebACT - this is the web version of ACT (Artemis Comparison Tool) a DNA sequence comparison viewer based on Artemis (Reference:
WebGMAP - is a public web service for annotating and mapping individual cDNA sequences to the genomes of many eukaryote species, currently including Arabidopsis thaliana, Chlamydomonas reinhardtii, Glycine max, Oryza sativa, Physcomitrella patens and Populus trichocarpa. (Reference: C. Liang et al. 2009. Nucl. Acids Res. 37(Web Server issue):W77-W83)
Panseq (Chad Laing, Public Health Agency of Canada) - a group of tools for the analysis of the 'pan genome' of a group of genomic sequences. The pan-genome of a bacterial species consists of a core genome and an accessory gene pool, the latter of which allows subpopulations of the organism to adapt to specific environments. These include Novel Region Finder, which will find sequences that are unique to a strain or group of strains with respect to another strain or group of strains. Pan-genome Analysis identifies the pan-genome among your sequences; and, finds SNPs in the core genome and determine the distribution of accessory genomic regions.Loci Selector identifies loci that offer the best discrimination among your dataset. (Reference: Laing, C. et al. 2010. BMC Bioinformatics. 11: 461).
PARIGA - enables users to perform all-against-all BLAST searches on two sets of sequences selected by the user. Moreover, since it stores the two BLAST output in a python-serialized-objects database, results can be filtered according to several parameters in real-time fashion, without re-running the process and avoiding additional programming efforts. (Reference: Orsini M. et al. 2013. PLoS One 8(5):e62224).
EDGAR (Efficient Database framework for comparative Genome Analyses using BLAST score Ratios) - EDGAR is designed to automatically perform genome comparisons in a high throughput approach and can be used for core genome, pan genome and singleton analysis, and Venn diagram construction. (Reference: Blom J. et al. 2009. BMC Bioinformatics 10: 154).
OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species. It provides coverage of vertebrates, metazoa, protists, fungi, plants and bacteria for the comparison of orthologous clusters and also supports uploading of customized protein sequences from user-defined species. An interactive Venn diagram, summary counts, and functional summaries of the disjunction and intersection of clusters shared between species are displayed as part of the OrthoVenn result. OrthoVenn also includes in-depth views of the clusters using various sequence analysis tools. Furthermore, it identifies orthologous clusters of single copy genes and allows for a customized search of clusters of specific genes through key words or BLAST. (Reference: Y. Yang et al. 2015. Nucl. Acids Res. 43 (W1): W78-W84). Also found here.
Phylogeny (AAI and ANI)
ANI (Average Nucleotide Identity) calculator - estimates the average nucleotide identity using both best hits (one-way ANI) and reciprocal best hits (two-way ANI) between two genomic datasets. Typically, the ANI values between genomes of the same species are above 95% (e.g., Escherichia coli). Values below 75% are not to be trusted, and AAI should be used instead. This tool supports both complete and draft genomes (multi-fasta). (Reference: Goris J et al. 2007. Int J Syst Evol Microbiol. 57(Pt 1): 81-91).
GGDC (Genome-To-Genome Distance Calculator) - provides methods for inferring whole-genome distances which are well able to mimic DNA-DNA hybridization (DDH). Values calculated with GGDC yield a somewhat better correlation with wet-lab DDH values than alternative approaches such as "ANI". These distance functions can also cope with heavily reduced genomes and repetitive sequence regions. Some of them are also very robust against missing fractions of genomic information (due to incomplete genome sequencing). Thus, this web service can be used for genome-based species delineation. (Reference: Meier-Kolthoff JP et al. 2013. BMC Bioinformatics 14: 60).
POGO-DB - Based on computationally intensive whole-genome BLASTs, POGO-DB provides several metrics on pairwise genome: (a) Average Amino Acid Identity of all bi-directional best blast hits that covered at least 70% of the sequence and had 30% sequence identity; (b) Genomic Fluidity that estimates the similarity in gene content between two genomes; (c) Number of orthologs shared between two genomes (as defined by two criteria); (d) Pairwise identity of the most similar 16S rRNA genes; (e) Pairwise identity of 73 additional globally-conserved marker genes (which were determined by us to exist in at least 90% of all the genomes). (Reference: Lan Y et al. 2014. Nucl. Acids Res. 42 (D1): D625-D632).
CGView Server - is a comparative genomics tool for circular genomes that allows sequence feature information to be visualized in the context of sequence analysis results. A genome sequence is supplied to the program in FASTA, GenBank, EMBL or raw format. Up to three comparison sequences (or sequence sets) in FASTA format can also be submitted. The CGView Server uses BLAST to compare the genome sequence to the comparison sequences, and then converts the results and any available feature information (from the GenBank, EMBL or optional GFF file) or analysis information (from an optional GFF file) into a high-quality graphical map showing the entire genome sequence, or a zoomed view of a region of interest. Several options are available for specifying how the BLAST comparisons are conducted, and for controlling how results are displayed.(Reference: Grant JR & Stothard P. 2008. Nucleic Acids Res. 36(Web Server issue): W181-184)
Jena Prokaryotic Genome Viewer (JPGV) - from a GenBank flatfile (*.gbk) generates linear or circular plots; including if desired GC content, GC skew, purine excess and keto excess can be displayed. Also allows BLAST analysis against related genomes. Requires free registration.
GenomeVx - makes editable, publication-quality, maps of mitochondrial and chloroplast genomes and of large plasmids. These maps show the location of genes and chromosomal features as well as a position scale. The program takes as input either raw feature positions or GenBank records. In the latter case, features are automatically extracted and colored, an example of which is given. Output is in the Adobe Portable Document Format (PDF) and can be edited by programs such as Adobe Illustrator.(Reference: G. Conant & K. Woolfe. 2008. Bioinformatics 24:861-862)
DNAPlotter - is an interactive Java application for generating circular and linear representations of genomes. Making use of the Artemis libraries to provide a user-friendly method of loading in sequence files (EMBL, GenBank, GFF) as well as data from relational databases, it filters features of interest to display on separate user-definable tracks. It can be used to produce publication quality images for papers or web pages.(Reference: Carver, T. et al. 2008. Bioinformatics 25:119-120)
GeneWiz (Center for Biological Sequence Analysis, Danish Technical University) produces linear or circular genome altases such as the one below. They have ready name ones for most bacteria, but by uploading custom data in GenBank format (.gbk) one can make one's own diagram showing the genetic and physical properties of your genome.
OrganellarGenomeDRAW - is a suite of software tools that enable users to create high-quality visualrepresentations of both circular and linear annotated genome sequences provided as GenBank files oraccession numbers. Although all types of DNA sequences are accepted as input, the software has beenspecifically optimized to properly depict features of organellar genomes. A recent extension facilitates theplotting of quantitative gene expression data, such as transcript or protein abundance data, directly ontothe genome map.
PlasmaDNA - Starting with a primary DNA sequence, PlasmaDNA looks for restriction sites, open reading frames, primer annealing sequences, and various common domains. The databases are easily expandable by the user to fit his most common cloning needs. PlasmaDNA can manage and graphically represent multiple sequences at the same time, and keeps in memory the overhangs at the end of the sequences if any. This means that it is possible to virtually digest fragments, to add the digestion products to the project, and to ligate together fragments with compatible ends to generate the new sequences. Excellent package for plasmids. (Reference: Angers-Loustau A et al. 2007. BMC Mol Biol. 2007; 8:77).
Gene Structure Display Server (GSDS2.0) is designed for the visualization of gene features, such as the composition and position of exons, introns, and conserved elements .etc. The input could be sequences, GenBank Accession Number (or GI), or features in BED/GTF/GFF3 formats. After inputting gene features, a high-quality image can be generated. Shape and color for features can be customized by users and further modifying functions on figures are provided. To facilitate evolutionary analysis, a phylogenetic tree can be uploaded and added on the figure. (Reference: B. Hu et al. 2015. Bioinformatics 31:1296-1297).
GeneDesign - is an excellent resource for designing synthetic genes. It includes tools for codon optimization and removal of restriction sites (Reference: Richarson, S.M. et al. 2006. Genome Research 16:550-556)
Orphelia - Orphelia is a metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin. Orphelia is based on a two-stage machine learning approach that was recently introduced by our group. After the initial extraction of ORFs, linear discriminants are used to extract features from those ORFs. Subsequently, an artificial neural network combines the features and computes a gene probability for each ORF in a fragment. A greedy strategy computes a likely combination of high scoring ORFs with an overlap constraint. (Reference: K.J. Hoff et al. 2009. Nucl. Acids Res. 37(Web Server issue:W101-W105)
DNAATLAS (DNA2.0 Inc., U.S.A.) - A place for all your sequences. Easily import all your constructs including Genbank, Gene Designer, Excel, Word, and nearly any text-based format. DNA Atlas immediately parses your upload files and infers whether each sequence is a feature, construct, primer, DNA or amino acid. Upload features and primers to see them annotated in your sequences. Instantly view constructs annotated with our curated list of over 1000 features, or add your own. Use the BLAST-based sequence search to quickly align and compare your sequences.Keep track of your sequences, features, and primers. Categorize them using tags - from freezer locations to characterization data. (requires registration).
Genome Compiler (Genome Compiler Corp., U.S.A.) - An all-in-one online software platform for DNA design & visualization, data management, collaboration and seamless DNA ordering, and collaboration, which is suitable for cloning, primers design, sequence alignment and annotation, virtual digest, gel simulation, etc.
SuperPhy (Chad Laing & Vic Gannon, Public Health Agency of Canada) is an online tool for the predictive genomics of Escherichia coli. The platform integrates the analyses tools and genome sequence data for all publicly available E. coli genomes and facilitates the upload of new genome sequences from users under public or private settings. SuperPhy provides real-time analyses of thousands of genome sequences based on strain metadata, including geospatial and phylogenetic context.