194,600 genomes r95 download

abandla · October 5, 2020, 5:05pm

Are all the 194, 600 genomes that went into the making of r95 available for download somewhere?

ctb · October 5, 2020, 5:39pm

for R95, I believe they are all available through Genbank/RefSeq.

(For R89, some of them were not available in the archives, but I believe that all of those were/are available through the GTDB-Tk database download.)

ctb · October 5, 2020, 5:41pm

(for large scale download from NCBI, we’ve been trying out genome_updater - our mostly positive experiences here)

donovan.parks · October 5, 2020, 11:23pm

Hi,

All genomes are available from NCBI so we have elected not to replicate the data on the GTDB. The assembly report files provided on the NCBI FTP site give the URL for the root directory of the genome data at NCBI:
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt

These are the files we use to download and rsync data with NCBI.

Cheers,
Donovan

abandla · October 6, 2020, 1:16am

Thanks, Titus, most helpful!

abandla · October 6, 2020, 1:16am

Thanks, Donovan, this is very helpful

Ashwin_S_Sudarshan · June 6, 2021, 6:14pm

Hi. I want to download all the genomes from the GTDB R202 and found this discussion very useful. Thank you for the suggestion on using genome_update to achieve this but I have a question on the parameters to use for downloading this.

Currently, I set the options/flags in such a way that the organism group was set as bacteria and archaea from both the refseq and genbank databases. I also used the -z flag which will only download records from the latest GTDB release.

While doing this, the download summary showed that 251K entries (251,749 to be exact) are being downloaded. When compared to the statistics for this release of GTDB, it falls short by about 7K (258,406 is the figure from the stats).

Where am I missing out these 7K entries? Is there any option I am missing out on the genome_update tool? I am mainly trying to replicate this process in order to get the entire set of genomes. Thanks in advance!

donovan.parks · June 9, 2021, 12:42am

Hi. I am not familiar with the genome_update tool so can’t speak directly to it. One possibility is that these genomes have been deprecated at NCBI since the release of GTDB R202. Possible, though 7K seems like a lot. Perhaps confirming that some of the missing genomes are at NCBI is a good start. You can also download genomes from NCBI using the URLs provided in the ASSEMBLY_REPORTS file:
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

Bhavay_Aggarwal · October 27, 2021, 6:34am

@donovan.parks I was facing a similar issue where I was able to download the files from NCBI but some of the files are missing. I found some of them under the gbrs_paired_asm and so I extracted their identical genomes but still about 2175 files are missing. Some of them (like GCF_000024745.1) are no where to be found either on the NCBI files or the website whereas some files like (like GCA_014647455.1) are mentioned as GCA_014647455.2 but I am unsure whether to download them.

donovan.parks · October 27, 2021, 2:56pm

Hi,

My understanding is that NCBI archives all genome assemblies. However, assemblies do get updated which is indicated by the point number increasing (e.g. GCA_014647455.1 to GCA_014647455.2). Assemblies are also occasionally suppressed for various reasons (e.g. GCF_000024745.1). These genomes are still available though:
https://www.ncbi.nlm.nih.gov/assembly/GCF_000024745.1/
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/024/745/GCA_000024745.1_ASM2474v1/

GTDB uses the genomes available at NCBI during each bi-annual update and will thus necessarily use genomes that later get updated or suppressed.

In general, I’d recommend using the latest version of a genome assembly and ignoring all suppressed genomes. This is the policy used by GTDB during each update.

Cheers,
Donovan

pirovc · May 12, 2022, 2:36pm

Hi, genome_updater dev here. Maybe too late for you but may help others looking at this thread. I just released a new genome_updater version (0.5.0) with an overall better GTDB support which can properly get all GTDB assemblies from the current release + tax. integration. As mentioned by @donovan.parks, many entries were suppressed or updated and were not showing on the assembly_summary.txt files (but they can still be found in the assembly_summary_historical.txt files, e.g. https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq_historical.txt) causing the difference.

To get all GTDB assemblies with genome_updater v0.5.0 :

./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_complete" -M "gtdb" -t 12 -m

Filters are also possible based on GTDB nodes. For example:

./genome_updater.sh -d "refseq,genbank" -g "bacteria" -T "g__Escherichia" -f "genomic.fna.gz" -o "GTDB_Escherichia" -M "gtdb" -t 12 -m