GTDB Forum

194,600 genomes r95 download

Are all the 194, 600 genomes that went into the making of r95 available for download somewhere?

for R95, I believe they are all available through Genbank/RefSeq.

(For R89, some of them were not available in the archives, but I believe that all of those were/are available through the GTDB-Tk database download.)

1 Like

(for large scale download from NCBI, we’ve been trying out genome_updater - our mostly positive experiences here)

1 Like

Hi,

All genomes are available from NCBI so we have elected not to replicate the data on the GTDB. The assembly report files provided on the NCBI FTP site give the URL for the root directory of the genome data at NCBI:
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt

These are the files we use to download and rsync data with NCBI.

Cheers,
Donovan

1 Like

Thanks, Titus, most helpful!

Thanks, Donovan, this is very helpful

Hi. I want to download all the genomes from the GTDB R202 and found this discussion very useful. Thank you for the suggestion on using genome_update to achieve this but I have a question on the parameters to use for downloading this.

Currently, I set the options/flags in such a way that the organism group was set as bacteria and archaea from both the refseq and genbank databases. I also used the -z flag which will only download records from the latest GTDB release.

While doing this, the download summary showed that 251K entries (251,749 to be exact) are being downloaded. When compared to the statistics for this release of GTDB, it falls short by about 7K (258,406 is the figure from the stats).

Where am I missing out these 7K entries? Is there any option I am missing out on the genome_update tool? I am mainly trying to replicate this process in order to get the entire set of genomes. Thanks in advance!

Hi. I am not familiar with the genome_update tool so can’t speak directly to it. One possibility is that these genomes have been deprecated at NCBI since the release of GTDB R202. Possible, though 7K seems like a lot. Perhaps confirming that some of the missing genomes are at NCBI is a good start. You can also download genomes from NCBI using the URLs provided in the ASSEMBLY_REPORTS file:
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt