MOTIFS: USING DATABASES & CREATING YOUR OWN
SEARCHING MOTIF DATABASES
BACKGROUND INFORMATION: Proteins having related functions may not show overall high homology yet may contain sequences of amino acid residues that are highly conserved. For background information on this see PROSITE at ExPASy. N.B. I recommend that you check your protein sequence with at least two different search engines. Alternatively, use a meta site such as MOTIF (GenomeNet, Institute for Chemical Research, Kyoto University, Japan) to simultaneously carry out Prosite, Blocks, ProDom, Prints and Pfam search
Several great sites including the first four which are meta sites:
Motif Scan – (MyHits, SIB, Switzerland) includes Prosite, Pfam and HAMAP profiles.
InterPro Search (European Bioinformatics Institute, United Kingdom) includes BlastProDom, FPrintScan, HMMPIR, HMMPfam, HMMSmart, HMMTigr, ProfileScan, HAMAP, patternScan, SuperFamily, SignalPHMM, TMHMM, HMMPanther & Gene3D (or a subset). This service is also available here or here.
MOTIF (GenomeNet, Japan) - I recommend this for the protein analysis, I have tried phage genomes against the DNA motif database without success. Offers 6 motif databases and the possibility of using your own.
CDD or CD-Search (Conserved Domain Databases) - (NCBI) includes Smart and Pfam and is invoked when one uses BLASTP. This tool can also be accessed here.
Pfam - (Sanger Institute) or here (Howard Hughes Medical Institute, Janelia Farm)
ScanProsite – (ExPASy)
Block Searcher – (Fred Hutchinson Cancer Research Center, U.S.A.)
PRINTS for all information on the motifs and the scan engines. FingerPRINTScan is now a subset of P-val FingerPRINTScan which offers access to MULScan, GRAPHScan, FPScan and FPScan_fam.
ProDom (INRA)
SMART Simple Modular Architecture Research Tool (EMBL, Universitat Heidelberg) - searches sequence for the domains/ sequences listed in the homepage. Try selecting/deselecting the default settings.
iProClass (Protein Information Resource, Georgetown University Medical Centre, U.S.A.) - is an integrated resource that provides comprehensive family relationships and structural/functional features of proteins.
PipeAlign (Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire,France ) offers an integrated approach to protein family analysis through a cascade of five different sequence analysis programs (BALLAST, DbClustal multiple alignment program, Rascal alignment analysis, removal of any sequences that do not belong to the protein family are performed by the NorMD, and clustered into potential functional subfamilies using Secator or DPC. Reference: F. Plewniak et al. 2003. Nucleic Acids Research, 31: 3829-3832.
MEROPS BLAST - permits one to screen protein sequences against an extensive database of characterized peptidases (Rawlings, N.D., O'Brien, E. A. & Barrett, A.J. (2002) MEROPS: the protease database. Nucleic Acids Res. 30, 343-346).
For specific protein modifications or site detection consult the following sites:
DNA binding - motifs: (A good tutorial resource can be found here )
GYM - the most recent program for analysis of helix-turn-helix motifs in proteins. N.B. the next site dates from 1990. (Reference: Narasimhan, G. et al. 2002. J. Computational Biol. 9:707-720)
Helix-turn-Helix Motif Prediction - (Institut de Biologie et Chemie des Proteines, Lyon, France)
HTHQuery - is another site for detection of the DNA-binding helix-turn-helix motif. This program has a true positive rate of 83.5% and a false positive rate of 0.8%. Unfortunately it only takes a pdb file. (Reference:
C. Ferrer-Costa et al. 2005. Bioinformatics 2005 21: 3679-3680).
Leucine zippers – (Deutsches Krebsforschungszentrum, Germany) or 2ZIP - described by Bornberg-Bauer,E. et al. (1998) Nucleic Acids Res. 26:2740-2746.
DBD: Transcription factor prediction database - permits searching against the database of different DNA-binding protein. Please note the the searches often last several minutes.
Glycosylation:
NetOGlyc (Center for Biological Sequence Analysis, Technical University of Denmark) - produces neural network predictions of mucin type GalNAc O-glycosylation sites in mammalian proteins. SignalP is automatically run on all sequences. A warning is displayed if a signal peptide is not detected. In transmembrane proteins, only extracellular domains may be O-glycosylated with mucin-type GalNAc.
NetNGlyc (Center for Biological Sequence Analysis, Technical University of Denmark) - predicts N-Glycosylation sites in human proteins using artificial neural networks that examine the sequence context of Asn-Xaa-Ser /Thr sequons.
YinOYang (Center for Biological Sequence Analysis, Technical University of Denmark) - produces neural network predictions for O-ß-GlcNAc attachment sites in eukaryotic protein sequences. This server can also use NetPhos, to mark possible phosphorylated sites and hence identify "Yin-Yang" sites.
Fatty acylation:
LipoP 1.0 (Center for Biological Sequence Analysis Technical University of Denmark) - allows prediction of where signal peptidases I & II cleavage sites from Gram negative bacteria will cleave a protein.
NMT - The MYR Predictor (IMP [Research Institute of Molecular Pathology] Bioinformatics Group, Austria) - predicts N-terminal N-myristoylation. Generally, the enzyme NMT requires an N-terminal glycine (leading methionines are cleaved prior to myristoylation). However, also internal glycines may become N-terminal as a result of proteolytic processing of proproteins.
Myristoylator (ExPASy, Switzerland) - predicts N-terminal myristoylation of proteins by neural networks. Only N-terminal glycines are myristoylated (leading methionines are cleaved prior to myristoylation).
Phosphorylation:
GPS (Group-based Phosphorylation Scoring method) - prediction encompases 71 Protein Kinase (PK) families/PK groups (Reference: Y. Xue et al. 2005. Nucl. Acids Res. 33: W184-W187).
KinasePhos - this method is purported to have higher accuracy and provides not only the location of the phosphorylation sites, but also the corresponding catalytic protein kinases. (Reference: H.-D. Huang et al. 2005. Nucl. Acids Res. 33: W226-W229).
NetPhos (Center for Biological Sequence Analysis, Technical University of Denmark) - predicts Ser, Thr and Tyr phosphorylation sites in eukaryotic proteins.
Sulfation:
The Sulfinator (ExPASy, Switzerland) predicts tyrosine sulfation sites in protein sequences.
DISCOVER YOUR OWN MOTIFS:
The MEME Suite- Motif-based sequence analysis tools (University of California at San Diego, CA, U.S.A.). N.B. After doing a BLASTP search create a FASTA-formated document containing three or four of the most homologous proteins (training set) and submit to MEME ( Multiple EM for Motif Elicitation) or GLAM2 (Gapped Local Alignments of Motifs). This service is additionally available at National Biomedical Computation Resource (NBCR).
In the case of MEME I usually specify 5 as the "Maximum number of motifs" to find. You will receive a message by E-mail entitled "MEME Submission Information (job app......)," verifies that the UCSD received and is processing your request. If you click on the hyperlink "You can view your job results at: http://meme..." you will see:
The "MAST output as HTML" provides the motifs, a motif alignment graphic and the alignment of the motifs with the individual sequences in the training set. The "MEME output as HTML" file contains a detailed analysis of each of the motifs plus their Sequence Logos.
At the top of the life is a buttom labelled "Search sequence databases for the best combined matches with these motifs using MAST." This will take you to theMAST (Motif Alignment and Search Tool) submission form. Click on the NCBI nonredundant protein database. You will receive an E-mail entitled "MAST Submission Information (job app ...)."
Use great caution before printing the second set of data can be >20 pages (Reference: Bailey, T.L. et al. 2009. Nucl. Acids Res. 37(Web Server issue): W202-W208).
Block Maker - (Fred Hutchinson Cancer Research Center, U.S.A.) Block Maker finds conserved blocks in a group of two or more unaligned related protein sequences. One can use the motifs to in data base searches, to construct trees and web logos.
WebLogo - a great graphical way of representing and visualizing consensus sequence data developed by Tom Schneider and Mike Stephens. For nucleotide logos see RNA Structure Logo (The Technical University of Denmark)
LogoMat-M (Sanger Institute, United Kingdom) - this tool (also available here) is for visualization of profile HMMs and creates so-called HMM Logos, which are related to Sequence Logos. In order to get this program to work one needs a *.hmm formatted file. This can be generated from an *.aln ClustalW file using HMMER : hmmbuild at the Pasteur Institute. From this URL select "A Scientist" and "Software for biology." Under "Sequences Alignments and Comparisons". select "Structural alignment" and then "HMMER." The output of LogoMat-M is worth the effort in getting it. For an easier approach paste your ClustalW alignment (& title) into HMMBUILD (Pôle Bioinformatique Lyonnais, France) and use the output (saved as a *.hmm file) at LogoMat-M.
Two Sample Logo - detects and displays statistically significant differences in position-specific symbol compositions between two sets of multiple sequence alignments. In a typical scenario, two groups of aligned sequences will share a common motif but will differ in their functional annotation. Also available as a Java tool. (Reference:
Vacic, V. et al. 2006. Bioinformatics 22: 1536-1537)NUCLEIC ACID MOTIFS: (See also here)
Rfam (Welcome Trust Sanger Institute, England) - permits one to analyze 2 kb of DNA for 36 structural or functional RNAs such as 5S rRNA, tRNA, tmRNA, group I & II catalytic introns, hammerhead ribozymes, signal recognition particles.