Downloading Selected Sequences from GenBank
A. Whole Genomes
This can be accomplished in several ways:
1. Downloading a single file
- On the NCBI home page choose "Nucleotide" or "Genome" and paste in the accession number. Alternatively, type in the name and search. It is sometimes useful to use the Advanced Search Builder.
- Choose the item that you are interested in and its format "FASTA" or "GenBank" and click on it.
-
There are two main ways of downloading the data:
- (a) "Copy" and "Paste" into Notepad or a text editor and save in the appropriate file format; or,
- (b) Choose "send to:" → "Choose Destination" → "File" and select either GenBank (full) or FASTA from the "Format" list.
sequence.gbandsequence.fastafiles, respectively.N.B. The same can be done from the FASTA document in NCBI
2. Downloading multiple files
- On the NCBI home page choose "Nucleotide" or "Genome" and paste in the required accession numbers (limit 100). Alternatively, typing in the name and search. Again, the Advanced Search Builder can be useful.
- You can use the "send to:" command as in A.1.ii. to save all the sequences.
- Alternatively, you can click on the boxes associated with the desired records to download a selected subset of data.
B. Selected Proteins
- You can approach the selection of a specific protein for downloading in much the same manner as described for a GenBank flatfile (*.gbk) or fasta-formated nucleotide in the way described for genomes and nucleotides as described above. Alternative, you can go to the Protein database and make your selection.
- You may want to download the results of a BLASTp search for subsequent phylogenetic analysis. This can be accomplished within the BLASTp search results by choosing "Select All" or selecting the boxes for the sequences you are interested in.
-
Select "Download" → choose "FASTA (complete sequence)" → click
"Continue." The downloaded file will be called
seqdump.txt. - Open the file in Notepad or another text editor.
-
The data may look like the following:
(a) single record: >ANJ65251.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Rexella] MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDVKFGIDLLVQMALHKRCDLQTLVGTLRHHCESAQEVVNNILKCA EADLVDYNVSLGIFIVRCTISNDVQEELDRFQYPLPMVVEPKKITNNKQSGYLLNNKSIILKDNHHEDDVCLDHINRLNKIKFRINFDTA RMVKNEWRNLDKRKEGETQADFMKRKKAFEKYDSTARDVMEVLHKVSDTFHLTHSYDKRLRTYAQGYHVNYQGTAWNKAVIEFAEEEVTNG (b) Multi-hit record: >YP_009286151.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Frozen] >ANJ65154.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Frozen] >ANJ65337.1 putative RNA polymerase 1 [Erwinia phage vB_EamP_Gutmeister] MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDVKFGIDLLVQMALHKRCDLQTLVGTLRHHCESAQEVVNNILKCA EADLVDYNVSLGIFIVRCTISNDVQEELDRFQYPLPMVVEPKKITNNKQSGYLLNNKSIILKDNHHEDDVCLDHINRLNKIKFRINFDTA RMVKNEWRNLDKRKEGETQADFIKRKKAFEKYDSTARDVMEVLHKVSDTFHLTHSYDKRLRTYAQGYHVNYQGTAWNKAVIEFAEEEVTNG -
Use your text editor to duplicate the latter record and remove
nonessential test so that you end up with a poly-fasta document in the
following format:
> ANJ65251 MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc > ANJ65154 MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc > ANJ65337 MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etcOR> Erwinia phage vB_EamP_Rexella MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc > Erwinia phage vB_EamP_Frozen MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc > Erwinia phage vB_EamP_Gutmeister MDQLTEHQTRLEELFSNNQLMPRMRKEFTECESFDFTKYLEHKAIDV etc
C. Using BioEdit to Edit File Formats
-
Run BioEdit and
open
sequence.gb. - The GenBank DEFINITION line will appear in the left column, and the associated sequence appears in the right column.
-
You can change this to give the ACCESSION number by choosing "Edit" →
"Select all sequences" → under "Sequence" select "rename" → "with
ACCESSION."
N.B. if you get >one accession number this may indicate that one of the sequences may have been replaced in GenBank.
- In either of these cases place cursor in the left column and under "Edit" choose "Select All Sequences."
-
To Export to Excel:
Choose "Edit" → "Export" → "tab-delimited text" → save as *.tab.
Then open the file in Excel → "Delimited" → "Next" → "Finish."
This will give Column A = names, Column B = sequences. -
To Export as Individual Files:
Under "File" choose "Export" → "split into individual fasta files."
Save the files in a new folder.Warning: Names may be too similar (example: Acinetobacter phage YMC11/12/R1215) files may not be generated.
D. Splitting Poly-Fasta Protein Files using EMBOSS Explorer seqretsplit
- EMBOSS seqretsplit can be accessed here or here.
-
Upload your
seqdump.txtorsequence.fastafile and leave "Output sequence format" as default "Pearson FASTA". - When the results come up in your Internet browser search for the fasta symbol (>) and right click to download the separate files. These will be identified by their accession numbers. Unfortunately, the latter are named ab123456.fasta and not AB123456.fasta.
E. Extract Protein Sequences from GenBank Flatfiles
- Use the Rocap GenBank/EMBL to FASTA Conversion Tool to convert GenBank flat file (*.gbk) into fasta-formatted amino acid sequence file (*.faa).
Updated: November, 2025