GTDB does give an NCBI tax string and there are also files comparing the two taxonomies. However, I am unable to find which NCBI taxonomy version (or taxdump version so you wish) each GTDB release uses. I especially need the information about the NCBI taxonomy string in the all metadata file for the versions r202 and r207. Hope you can help me
I don’t believe the NCBI Taxonomy is versioned. We based each GTDB release off a specific RefSeq release (e.g., GTDB Release 07-RS207 uses genomes and the NCBI Taxonomy as it was shortly after the release of RefSeq 207).
You can get the NCBI Taxonomy string for each GTDB genome in the GTDB metadata files. Unfortunately, I’m not sure there is an easy way to get the exact taxdump files used for a given GTDB release.
What is your use case that requires this information?
The usecase was that I initially used the taxonomy string from gtdb plus the marker genes for a marker gene based taxonomic profiler. When I went to benchmark the profiler (and others) with the cami data, I ran into the problem that all of them used different versions of the taxonomy and in the end I opted for adjusting the “gold standard reference” according to the taxonomy each individual tool uses. I then realized I do not know the ncbi taxonomy version used for the taxonomy string in the meta file. I solved the problem by just using the lowest available ncbi taxid for each genome representative and added them to a custom ncbi taxdump so I had control over the version.
For the future, it might be helpful for gtdb releases to use one of the monthly taxdumps from ncbi and then report that
Thanks for your quick reply and have a nice weekend!
Don’t worry about which NCBI taxdump version was used, we can just convert all profiles into new ones with a specified taxdump version using taxonkit profile2cami before comparison.
The taxonomic profiles of the gold standards and all taxonomic
profiles were converted to CAMI format with the NCBI taxonomy dump file of Dec 6, 2021 using the TaxonKit
I’ve flagged the need for the NCBI taxdump file used to generate a given GTDB release. In the future, we will include the actual taxdump file we used and indicate the date at which we downloaded it from NCBI.