I have been using GTDB to create Kraken2 databases. For this, I use the information in the metadata file https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz to download genomes.
I noticed the following; occasionally the information in the columns titled ‘ncbi_*’ doesn’t agree with the latest NCBI information found online. An example is GB_GCA_002290415.1
According to the metadata file, this is ncbi_organism_name “shigella flexneri” with ncbi_taxid 623. However, when I use the online ncbi taxonomy browser, GCA_002290415.1 is classified as Escherichia coli.
Since it are only a few genomes, it seems insignificant. However, when I automate my database building process with the Struo tool, then this e. coli genome will be put in the database under the name shigella flexneri. If I then try to classify reads belonging to e.coli, they are being classified as shigella flexneri.
Since I want to remove these genomes from my database building process, I was wondering how this disgreement came to exist. Can it for example be a version issue, where older ncbi information is used for the metadata file?
Thanks in advance