When GTDB metadata doesn't agree with the NCBI

Niolo9 · April 6, 2021, 9:08am

Dear authors,

I have been using GTDB to create Kraken2 databases. For this, I use the information in the metadata file https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz to download genomes.

I noticed the following; occasionally the information in the columns titled ‘ncbi_*’ doesn’t agree with the latest NCBI information found online. An example is GB_GCA_002290415.1

According to the metadata file, this is ncbi_organism_name “shigella flexneri” with ncbi_taxid 623. However, when I use the online ncbi taxonomy browser, GCA_002290415.1 is classified as Escherichia coli.

Since it are only a few genomes, it seems insignificant. However, when I automate my database building process with the Struo tool, then this e. coli genome will be put in the database under the name shigella flexneri. If I then try to classify reads belonging to e.coli, they are being classified as shigella flexneri.

Since I want to remove these genomes from my database building process, I was wondering how this disgreement came to exist. Can it for example be a version issue, where older ncbi information is used for the metadata file?

Thanks in advance

donovan.parks · April 7, 2021, 10:06pm

Hi. Yes, this is a version (history) issue. Data at NCBI, including species classifications, are not static. When GTDB R95 was being put together, NCBI had the metadata shown in bac120_metadata.tar.gz.

Niolo9 · April 8, 2021, 8:10am

Hi Donovan, thanks for the answer, problem solved