While one can use established lists of motifs to search one's DNA sequence one can also discover them directly. In order to do this one has to derive a consensus sequence or probability matrix. In the case of bacterial proteins for which the binding sites have been determined good places to start are the E. coli DNA-Binding Site Matrices (A.M. McGuire, Harvard University, U.S.A.), and, DBTBS: a database of transcriptional regulation in Bacillus subtilis (University of Torkyo, Japan). The following sites provides one with a training set which can be used to derive a Gibbs screening matrix.
See additional pages on Promoters, Terminators, and Transcriptional Factors. Recent Review of Different Sequence Motif Finding Algorithms (Reference: Hashim FA et al. Avicenna J Med Biotechnol. 2019; 11(2):130-148).
RSAT (Regulatory Sequence Analysis Tools) - is a suite of modular tools for the detection and the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, including from genome-wide datasets like ChIP-seq/ATAC-seq, (ii) motif scanning, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations, (v) comparative genomics. (Reference: Santana-Garcia W et al. Nucleic Acids Res. 2022. 50(Web Server issue):W670–W676). This provides links to the following specialized sites:
RSAT Fungi
RSAT Prokaryotes
RSAT Metazoa
RSAT Protists
RSAT Plants
You may also want to consider the MEME Suite
.
Motif Sampler - tries to find over-represented motifs (cis-acting regulatory elements) in the upstream region of a set of co- regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Be sure to "uncheck" the appropriate box if you don't want the complementary strand included in the analysis. (Reference: Thijs G et al. 2002. J. Comput. Biol. 9: 447-464).
BaMM offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database (Reference: Kiesel A et al (2018) Nucleic Acids Research, 46: W215–W220
STAMP: a web tool for exploring DNA-binding motif similarities (Reference: Mahony S & Benos PV. 2007. Nucl Acids Res. 35: W253–W258).
P2RP (Predicted Prokaryotic Regulatory Proteins) - including transcription factors (TFs) and two-component systems (TCSs) based upon analysis of DNA or protein sequences. (Reference: Barakat M., 2013. BMC Genomics 14: 269)
kmer analysis - K-mers are short DNA sequences (a substring of length k) that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment.
KmerFinder 3.2 – predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the number of co-occurring k-mers (substrings of k nucleotides in DNA sequence data, in this case 16-mers) between the genomes of reference bacteria in a database and the genome provided by the user. (Reference: Hasman H et al. 2013. J Clin Microbiol. 52:139-146)
kpLogo - motifs of only 1–4 letters can play important roles when present at key locations within macromolecules. Because existing motif-discovery tools typically miss these position-specific short motifs, we developed kpLogo, a probability-based logo tool for integrated detection and visualization of position-specific ultra-short motifs from a set of aligned sequences. (Reference: X. Wu, & D.P. Bartel (2017) Nucleic Acids Res 45 (Issue W1): W534–W538)
KmerKeys: a web resource for searching indexed genome assemblies and variants (Reference: Pavlichin DS et al. (2017) Nucleic Acids Res 50 (Issue W1): W448–W453)
UPDATED: July, 2025