Hi,
Apologies in advance for the long post. So I recently tried the GTDB226 classifier for the PacBio 16S full-length data. I am using the qiime2-amplicon-2025.7 version. Based on this reference - https://forum.qiime2.org/t/how-to-train-a-gtdb-ssu-classifier-using-rescript/25725, I generated the GTDB226 reference database that is compatible with the qiime2-amplicon-2025.7 version.
After I generated the data, I noticed around 697 ASVs that were only classified as “d__Bacteria”. I am not very surprised but there is this one ASV 8559b4a519962a4a022d270379a16656 that had an overall abundance of around 72,000 reads across all the samples. Its length is 1,434 bases.
When I BLASTed the sequence with the core-nt database, the sequence came up to be genus Akkermansia. So, I was confused why this sequence was not even getting annotated to the genus or family level. I investigated within the reference database and found that there are two identical sequences within the GTDB 226 reference database -
- RS_GCF_002885135.1~NZ_PJKM01000003.1 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiia;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia massiliensis [location=109183..110691] [ssu_len=1509] [contig_len=357201]
- GB_GCA_937980385.1~CALJPV010000086.1 d__Bacteria;p__Actinomycetota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__Collinsella sp937980385 [location=5203..6711] [ssu_len=1509] [contig_len=7155]
Both these sequences are 100% identical to the ASV reference sequence I had. It is pretty clear that because there are two identical sequences in the database, and that too belonging to different phyla, the classifier is getting confused and not calling it altogether. Also, the d__Bacteria;p__Actinomycetota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__Collinsella sp937980385 seems like a MAG based annotation and not a reliable one. I believe this reference should be removed from the GTDB226 database.
I also raised this issue on the QIIME2 forum - QIIME2-amplicon-2025.7 16S PacBio classification issues/discussion - Technical Support - QIIME 2 Forum
Please let me know if my understanding is correct. Thank you.