DNA MOTIFS

While one can use established lists of motifs to search one's DNA sequence one can also discover them directly. In order to do this one has to derive a consensus sequence or probability matrix.  In the case of bacterial proteins for which the binding sites have been determined good places to start are the  E. coli DNA-Binding Site Matrices (A.M. McGuire, Harvard University, U.S.A.), and,   DBTBS: a database of transcriptional regulation in Bacillus subtilis (University of Torkyo, Japan). The following sites provides one with a training set which can be used to derive a Gibbs screening matrix.

See additional pages on Promoters, Terminators, and Transcriptional Factors.

An assessment of a set of motif identifiers can be found in Nature Biotechnology, 2005, 23(1):137-144.

Gibbs Motif Sampler Homepage (E.C. Rouchka and B. Thompson, Bioinformatics Laboratory of  Wadsworth Center, U.S.A.) - I have linked to the prokaryotic DNA default setting page. On the next page I have presented data the IHF-binding site (consensus: WWWTCAA[N4]TTR).

RSA-tools - info-Gibbs (A. Neuwald & Jacques van Helden, Service de Conformation des Macromolécules Biologiques et de Bioinformatique, Université Libre de Bruxelles, Belgium) - type in the matrix size desired and deselect "add reverse complement strand."  After running the program once I would delete those sequences from the discovery set which align imperfectly.

 TESS - String Search Page (Center for Bioinformatics, University of Pennsylvania, U.S.A.) - this site requires that one enter the motif consensus sequence (Search My Site Strings), and is limited to 2000 nt per search. N.B. This site also permits searching TRANSFAC Strings. Choose "small Javal applets" to view the results of the search.

Create Matrix File (J. Zheng, Queen's University, Canada) - creates a matrix from a DNA Clustal alignment and also presents the consensus:

Number of sequences: 11
Length of alignment: 29
Consensus sequence representing: 80% matching base(s)

A 0  9 11 10 0 4 1 1 2 1 0 1 2 5 1 2 2 2 1 2 1  0  0  11 0  10 8 3 4 
C 0  1 0  0  2 0 1 2 2 5 7 5 3 1 4 4 6 3 1 0 10 11 0  0  1  0  0 4 3 
G 0  0 0  1  8 1 0 0 3 3 1 2 4 2 1 2 3 3 3 9 0  0  11 0  0  1  1 0 2 
T 11 1 0  0  1 6 9 8 4 2 3 3 2 3 5 3 0 3 6 0 0  0  0  0  10 0  2 4 2 

  T  A A  A  S W T Y D B Y B V D Y H S B K G C  C  G  A  T  A  W H V

DNA Motifs Gibbs Sampler - SeSiMCMC - the Sequence Similarities by Markov Chain Monte-Carlo algorithm finds DNA motifs of unknown length and complicated structure in a set of unaligned DNA sequences. It uses an improved motif length estimator and careful Bayesian analysis of the possibility of a site absence in a sequence. Reference: A.V. Favorov et al.. 2005.  Bioinformatics 21: 2240-2245.

You may also want to consider the MEME Suite

FindTerm (Softberry Inc.) - only two tools exist on the internet for mapping rho-independent terminators FindTerm and TransTerm. You might consider using the advanced feature options and minimally increase the default energy threshold to -12.0.

Tools to find motif clusters in DNA sequences - one should probably start at ZLAB (Dr. Zhiping Weng, Boston University, U.S.A) which has developed a  wide range of tools to interaction between regulatory proteins and their DNA/RNA target sites including:

 Cluster-Buster
 Comet
 Cister

Find short split motifs in DNA sequences with YMF (Reference: Sinha, S. & Tompa, M. 2002. Nucl.Acids Res.)
Motif Sampler - tries to find over-represented motifs (cis-acting regulatory elements) in the upstream region of a set of co- regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Be sure to "uncheck" the appropriate box if you don't want the complementary strand included in the analysis. (Reference: G. Thijs et al. 2002. J. Comput. Biol. 9: 447-464.)

 extractUpStreamDNA (A. Villegas, Public Health Agency of Canada, Laboratory for Foodborne Zoonoses) - takes a Genbank flatfile (*.gbk) as input and parses through and for every CDS that it finds, it extracts a pre-determined length of DNA upstream (length will be an argument; and will include 3 nt for the initiation codon). Output will be an FFN file of these upstream DNA sequences.  N.B. this only WORKS for prokaryotic sequences because it does not handle Splits or Joins found in eukaryotic.  This data then can be analyzed with pprograms such as MEME.


MelinaII - Motif Elucidator in Nucleotide Sequence Assembly (
Human Genome Center, University of Tokyo, Japan) - helps one extract a set of common motifs shared by functionally-related DNA sequences. It  utilizes CONSENSUS, GIBBS DNA, MEME and Coresearch  which are considered to be the most progressive motif search algorithms. Each algorithms is supplied with an impressive set of selection parameters. 

red_bullet.gif (914 bytes) SCOPE (Suite for Computational identification Of Promoter Elements), an ensemble of programs aimed at identifying novel cis-regulatory elements from groups of upstream sequences. (Reference: J.M. Carlson et al. 2007. Nucl. Acids Res. 35: W259-W264)

 

Updated: October 28, 2012