Downloading Selected Sequences from GenBank

A. Whole Genomes

This can be accomplished in several ways:

1. Downloading a single file

  1. On the NCBI home page choose "Nucleotide" or "Genome" and paste in the accession number. Alternatively, type in the name and search. It is sometimes useful to use the Advanced Search Builder.
  2. Choose the item that you are interested in and its format "FASTA" or "GenBank" and click on it.
  3. There are two main ways of downloading the data:
    • (a) "Copy" and "Paste" into Notepad or a text editor and save in the appropriate file format; or,
    • (b) Choose "send to:" → "Choose Destination" → "File" and select either GenBank (full) or FASTA from the "Format" list.
    Click "Create File" to download sequence.gb and sequence.fasta files, respectively.

    N.B. The same can be done from the FASTA document in NCBI

2. Downloading multiple files

  1. On the NCBI home page choose "Nucleotide" or "Genome" and paste in the required accession numbers (limit 100). Alternatively, typing in the name and search. Again, the Advanced Search Builder can be useful.
  2. You can use the "send to:" command as in A.1.ii. to save all the sequences.
  3. Alternatively, you can click on the boxes associated with the desired records to download a selected subset of data.

B. Selected Proteins

  1. You can approach the selection of a specific protein for downloading in much the same manner as described for a GenBank flatfile (*.gbk) or fasta-formated nucleotide in the way described for genomes and nucleotides as described above. Alternative, you can go to the Protein database and make your selection.
  2. You may want to download the results of a BLASTp search for subsequent phylogenetic analysis. This can be accomplished within the BLASTp search results by choosing "Select All" or selecting the boxes for the sequences you are interested in.
  3. Select "Download" → choose "FASTA (complete sequence)" → click "Continue." The downloaded file will be called seqdump.txt.
  4. Open the file in Notepad or another text editor.
  5. The data may look like the following:
    (a) single record:
    
    >ANJ65251.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Rexella]
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDVKFGIDLLVQMALHKRCDLQTLVGTLRHHCESAQEVVNNILKCA
    EADLVDYNVSLGIFIVRCTISNDVQEELDRFQYPLPMVVEPKKITNNKQSGYLLNNKSIILKDNHHEDDVCLDHINRLNKIKFRINFDTA
    RMVKNEWRNLDKRKEGETQADFMKRKKAFEKYDSTARDVMEVLHKVSDTFHLTHSYDKRLRTYAQGYHVNYQGTAWNKAVIEFAEEEVTNG
    
    
    (b) Multi-hit record:
    
    >YP_009286151.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Frozen]
    >ANJ65154.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Frozen]
    >ANJ65337.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Gutmeister]
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDVKFGIDLLVQMALHKRCDLQTLVGTLRHHCESAQEVVNNILKCA
    EADLVDYNVSLGIFIVRCTISNDVQEELDRFQYPLPMVVEPKKITNNKQSGYLLNNKSIILKDNHHEDDVCLDHINRLNKIKFRINFDTA
    RMVKNEWRNLDKRKEGETQADFIKRKKAFEKYDSTARDVMEVLHKVSDTFHLTHSYDKRLRTYAQGYHVNYQGTAWNKAVIEFAEEEVTNG
              
  6. Use your text editor to duplicate the latter record and remove nonessential test so that you end up with a poly-fasta document in the following format:
    > ANJ65251
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
    
    
    > ANJ65154
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
    
    
    > ANJ65337
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
              
    OR
    > Erwinia phage vB_EamP_Rexella
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
    
    
    > Erwinia phage vB_EamP_Frozen
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
    
    
    > Erwinia phage vB_EamP_Gutmeister
    
    MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
              

C. Using BioEdit to Edit File Formats

  1. Run BioEdit and open sequence.gb.
  2. The GenBank DEFINITION line will appear in the left column, and the associated sequence appears in the right column.
  3. You can change this to give the ACCESSION number by choosing "Edit" → "Select all sequences" → under "Sequence" select "rename" → "with ACCESSION."

    N.B. if you get >one accession number this may indicate that one of the sequences may have been replaced in GenBank.

  4. In either of these cases place cursor in the left column and under "Edit" choose "Select All Sequences."
  5. To Export to Excel:
    Choose "Edit" → "Export" → "tab-delimited text" → save as *.tab.
    Then open the file in Excel → "Delimited" → "Next" → "Finish."
    This will give Column A = names, Column B = sequences.
  6. To Export as Individual Files:
    Under "File" choose "Export" → "split into individual fasta files."
    Save the files in a new folder.

    Warning: Names may be too similar (example: Acinetobacter phage YMC11/12/R1215) files may not be generated.

D. Splitting Poly-Fasta Protein Files using EMBOSS Explorer seqretsplit

  1. EMBOSS seqretsplit can be accessed here or here.
  2. Upload your seqdump.txt or sequence.fasta file and leave "Output sequence format" as default "Pearson FASTA".
  3. When the results come up in your Internet browser search for the fasta symbol (>) and right click to download the separate files. These will be identified by their accession numbers. Unfortunately, the latter are named ab123456.fasta and not AB123456.fasta.

E. Extract Protein Sequences from GenBank Flatfiles

  1. Use the Rocap GenBank/EMBL to FASTA Conversion Tool to convert GenBank flat file (*.gbk) into fasta-formatted amino acid sequence file (*.faa).

Updated: November, 2025