Dear all, I have many lists of GTDB taxonomy names of bacterial species that I am interested in (i.e. d__Archaea;p__Methanobacteriota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter;s__Methanobrevibacter sp900314635) and every list contain at least 20 species. How can I obtain large quantities of the corresponding reference genome sequences based on the lists of GTDB taxonomy names?
You can download all 85,205 GTDB species representative genomes here (~75 GB):
You can also use the Advance Search feature to search for a given GTDB taxon and then use the “Genomes” button with a down arrow to download a script which will pull the genome assembly files from NCBI, e.g. for Archaea:
Thank you for your answer! You mean I should download the database for matches if I need representative sequences from GTDB, and should use the advanced search to retrieve gene assemblies if I need them from NCBI, right?
You can use the Advance Search and the Genome download button to get different types of files from NCBI (e.g. genome assembly, GFF, CDS). This is the most flexible way to obtain genomes and associated data.
Thanks for this crucial discussion. I downloaded a text file with the information related to my interest of genomes using advance search tool. Now how to use this file for downloading genomes (.fa files).
The file generated by the
Genomes download button is a shell script. You can run it using
./gtdb-adv-search-genomes.sh which will download the data you requested from NCBI.
I appreciate your kind response. Thank you