All genomes in gtdb v207 (not representative but all 317,542 genomes)

Dear GTDB team,

Where Can I find all gtdb v207 genomes, not representatives because I want to cluster at a larger ANI like 99% or so for testing purposes.

Thanks

Jianshu

Hi Jianshu,

We don’t provide genome assemblies through GTDB as all genomes are available from NCBI.

Cheers,
Donovan

Hello Donovan,

Is there a very fast and efficient way do download all NCBI genomes assemblies (preferable in parallel). I remember there are serial python tools but none of them are parallelized or something, which is very slow for so many genomes assemblies.

Thanks,

Jianshu

Hi Jianshu,

We rsync with NCBI. I don’t have any personal experience, but you could try this:

Cheers,
Donovan

Hello Donovan,

Thanks for the suggestions. This is very helpful.

Thanks,

Jianshu

Hello Donovan,

The tool works nicely and is very fast due to process-level parallelism (about 3 hours to download all NCBI genomes, refseq+Genebank). I have now 315,686 genomes, however, according to GTDB website, there are 317,542. Are those additional genomes archaea that were generated by the GTDB team and are not submitted to NCBI? When can I find them(not for publication purpose, just want a full collection for testing purpose)?

Thanks,

Jianshu

Hi,

GTDB only has bacterial and archaeal genome that are at NCBI. NCBI is an ever changing repository though. Some genome in GTDB R207 may have been deprecated at NCBI. I’m very surprised that there aren’t more genomes at NCBI than what is in GTDB R207 since this release is almost a year old at this point.

Can you figure out what genomes you are missing and go to their genome assembly records at NCBI? I suspect they may be deprecated genomes and I recommend you setup your pipeline/investigation to be robust to such missing genomes.

Cheers,
Donovan