Misannotation in the GTDB226 Ref Database

Hi,

Apologies in advance for the long post. So I recently tried the GTDB226 classifier for the PacBio 16S full-length data. I am using the qiime2-amplicon-2025.7 version. Based on this reference - https://forum.qiime2.org/t/how-to-train-a-gtdb-ssu-classifier-using-rescript/25725, I generated the GTDB226 reference database that is compatible with the qiime2-amplicon-2025.7 version.

After I generated the data, I noticed around 697 ASVs that were only classified as “d__Bacteria”. I am not very surprised but there is this one ASV 8559b4a519962a4a022d270379a16656 that had an overall abundance of around 72,000 reads across all the samples. Its length is 1,434 bases.

When I BLASTed the sequence with the core-nt database, the sequence came up to be genus Akkermansia. So, I was confused why this sequence was not even getting annotated to the genus or family level. I investigated within the reference database and found that there are two identical sequences within the GTDB 226 reference database -

  1. RS_GCF_002885135.1~NZ_PJKM01000003.1 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiia;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia massiliensis [location=109183..110691] [ssu_len=1509] [contig_len=357201]
  2. GB_GCA_937980385.1~CALJPV010000086.1 d__Bacteria;p__Actinomycetota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__Collinsella sp937980385 [location=5203..6711] [ssu_len=1509] [contig_len=7155]

Both these sequences are 100% identical to the ASV reference sequence I had. It is pretty clear that because there are two identical sequences in the database, and that too belonging to different phyla, the classifier is getting confused and not calling it altogether. Also, the d__Bacteria;p__Actinomycetota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__Collinsella sp937980385 seems like a MAG based annotation and not a reliable one. I believe this reference should be removed from the GTDB226 database.

I also raised this issue on the QIIME2 forum - QIIME2-amplicon-2025.7 16S PacBio classification issues/discussion - Technical Support - QIIME 2 Forum

Please let me know if my understanding is correct. Thank you.

1 Like

Hi,

Thank you for your email. Unfortunately, the 16S rRNA sequence provided on the GTDB FTP site are not QC’ed in any manner. This means one should expect such errors to exist as incorrect binning of 16S sequence is not uncommon. We are working on a tool to QC large marker gene DBs that will address this issue, but this is several months away from being completed. In the meantime, you might consider using a 3rd party 16S DB with GTDB annotations such as provided by IDTAXA: DECIPHER - Downloads.

Cheers,
Donovan

1 Like

Hi Donovan,

Thank you for your response. Based on the checks we did, would it be safe to remove the GB_GCA_937980385.1~CALJPV010000086.1 annotation? I ask this because :-

  1. The sequences are exactly identical.
  2. We expect Akkermansia in the environment we are trying to analyze.
  3. When we NCBI BLAST the representative sequence, it does not even match with the phylum Actinomycetota.

Thank you.

Hi. Yes, but you should fully expect that there are other misclassified 16S sequences.

Yes, I totally understand that. All the sequences that were not annotated to even phylum level, I BLAST them against the core_nt database. That gives me a slight understanding of whether we are missing something big. Most of the times, the unannotated sequences are uncultured bacterium or host sequences. Taking this one sequence at a time. :grinning_face: Thank you for your support.