DNA MOTIFS

While one can use established lists of motifs to search one's DNA sequence one can also discover them directly. In order to do this one has to derive a consensus sequence or probability matrix.  In the case of bacterial proteins for which the binding sites have been determined a good place to start is the  E. coli DNA-Binding Site Matrices (A.M. McGuire, Harvard University, U.S.A.).  The following sites provides one with a training set which can be used to derive a Gibbs screening matrix.

See additional pages on Promoters, Terminators, and Transcriptional Factors.

An assessment of a set of motif identifiers can be found in Nature Biotechnology, 2005, 23(1):137-144.

Gibbs Motif Sampler Homepage (E.C. Rouchka and B. Thompson, Bioinformatics Laboratory of  Wadsworth Center, U.S.A.) - I have linked to the prokaryotic DNA default setting page. On the next page I have presented data the IHF-binding site (consensus: WWWTCAA[N4]TTR).

RSA-tools - Gibbs (A. Neuwald & Jacques van Helden, Service de Conformation des Macromolécules Biologiques et de Bioinformatique, Université Libre de Bruxelles, Belgium) - type in the matrix size desired and deselect "add reverse complement strand."  After running the program once I would delete those sequences from the discovery set which align imperfectly.

 BindGene (C. Lockwood, University of Manchester, United Kingdom) - I have found this site particularly useful.  If your sequence is less than 2kb use the default settings.  If you paste 10kb of data, I suggest changing the "shuffled matrices" to "0."

 TESS - String Search Page (Center for Bioinformatics, University of Pennsylvania, U.S.A.) - this site requires that one enter the motif consensus sequence (Search My Site Strings), and is limited to 2000 nt per search. N.B. This site also permits searching TRANSFAC Strings. Choose "small Javal applets" to view the results of the search.

Create Matrix File (J. Zheng, Queen's University, Canada) - creates a matrix from a DNA Clustal alignment and also presents the consensus:

Number of sequences: 11
Length of alignment: 29
Consensus sequence representing: 80% matching base(s)

A 0  9 11 10 0 4 1 1 2 1 0 1 2 5 1 2 2 2 1 2 1  0  0  11 0  10 8 3 4 
C 0  1 0  0  2 0 1 2 2 5 7 5 3 1 4 4 6 3 1 0 10 11 0  0  1  0  0 4 3 
G 0  0 0  1  8 1 0 0 3 3 1 2 4 2 1 2 3 3 3 9 0  0  11 0  0  1  1 0 2 
T 11 1 0  0  1 6 9 8 4 2 3 3 2 3 5 3 0 3 6 0 0  0  0  0  10 0  2 4 2 

  T  A A  A  S W T Y D B Y B V D Y H S B K G C  C  G  A  T  A  W H V

DNA Motifs Gibbs Sampler - SeSiMCMC - the Sequence Similarities by Markov Chain Monte-Carlo algorithm finds DNA motifs of unknown length and complicated structure in a set of unaligned DNA sequences. It uses an improved motif length estimator and careful Bayesian analysis of the possibility of a site absence in a sequence. Reference: A.V. Favorov et al.. 2005.  Bioinformatics 21: 2240-2245.

PromScan (D.J. Studholme & R. Dixon. 2003. Bacteriol. 185:1757-67; as modified by S. Richards, Queen's University, Canada). Scans small genomes for potential factor-binding sites including IHF-binding sites. If a *.ptt file is included the results will indicate the position of the promoter relative to the nearest gene.

FindTerm (Softberry Inc.) - only two tools exist on the internet for mapping rho-independent terminators FindTerm and TransTerm. You might consider using the advanced feature options and minimally increase the default energy threshold to -12.0.

TransTerm (Michael Nuhn, Nano+Bio-Center) - TransTerm searches for rho-independent terminators in the vicinity of annotated genes. This TIGR program can be accessed online in two ways. If you have the genome in GenBank format to use "Sequence Analysis" and choose TransTerm since it will only look for terminators in the vicinity of the annotated genes. If the genome has not been annotated choose "Annotation" and Glimmer2.02, RBSfinder & TransTerm . The latter site combines Glimmer and RBSfinder with TransTerm.

Tools to find motif clusters in DNA sequences - one should probably start at ZLAB (Dr. Zhiping Weng, Boston University, U.S.A) which has developed a  wide range of tools to interaction between regulatory proteins and their DNA/RNA target sites including:

 Cluster-Buster
 Comet
 Cister

Find short split motifs in DNA sequences with YMF (Reference: Sinha, S. & Tompa, M. 2002. Nucleic Acids Research
Motif Sampler - tries to find over-represented motifs (cis-acting regulatory elements) in the upstream region of a set of co- regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Be sure to "uncheck" the appropriate box if you don't want the complementary strand included in the analysis. (Reference: G. Thijs et al. 2002. J. Comput. Biol. 9: 447-464.)
Melina - Motif Elucidator in Nucleotide Sequence Assembly (
Human Genome Center, University of Tokyo, Japan) - helps one extract a set of common motifs shared by functionally-related DNA sequences. It  utilizes CONSENSUS, GIBBS DNA, MEME and Coresearch  which are considered to be the most progressive motif search algorithms. Each algorithms is supplied with an impressive set of selection parameters. 

BioProspector  Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes (Stanford University AI Lab, U.S.A.) - uses a Gibbs sampling strategy to examine the upstream region of genes in the same gene expression pattern group and looks for regulatory sequence motifs. BioProspector uses Markov background to model the base dependencies of non-motif bases, which greatly improved the specificity of the reported motifs.
BioOptimizer - is an algorithm designed to clean up the motifs found by BioProspector, Consensus, AlignACE & MEME by finding the configuration of motif start sites that maximizes a scoring function (Reference: S.T. Jensen & J.S. Liu 2004 Bioinformatics 20:1557-1564).

red_bullet.gif (914 bytes) SCOPE (Suite for Computational identification Of Promoter Elements), an ensemble of programs aimed at identifying novel cis-regulatory elements from groups of upstream sequences. (Reference: J.M. Carlson et al. 2007. Nucl. Acids Res. 35: W259-W264)