Wrong bins in the Thermosynechococcaceae (cyanobacteria)

Hello devs!

I appreciate the work you do on GTDB.

I recently had some trouble with GTDB-tk mislabeling some bins to the order level. In the process of debugging, I discovered two entries in the database that are either misplaced or highly questionable.

CADCWO01 (GCA_902806435.1) - This entry is to blame for the mislabeling in my dataset. At first sight, it appears a high-quality MAG based on completion/contamination scores. However, looking at the actual file in NCBI (whether in the graphic view, or by comparing the proteome size to the genome size), it becomes clear that this is not the case. Roughly a third/half of all the genes are listed as pseudogenes, which is clearly a poor approximation of a real genome even among MAGs. (Clearly some NCBI quality check process failed to allow this). I believe the high divergence of the pseudogenes are to blame for its (wrong) classification as Thermosynechococcaceae, since when I build genome trees, my bins of interest branch correctly (Leptolyngbyales) when this bin is absent, and wrongly (Thermosynechococcaceae) when it is present. This assignment of the bins of interest is suported by AAI calculations (64-68% when compared to various Leptolyngbyales, <64% for Thermosynechococales). I would also suggest that all the 96 bins from the same bioproject (PRJEB36534) could be investigated for quality issues as, when I randomly picked 3 of them, and then picked the first 3-4 contigs of the assembly to view, I noticed a rather large number of pseudogenes (But had no time to go about it systematically)

Altericista sp. CCNU0014 (GCF_044923475.1) - I have no issues about the quality of this genome (it is even complete!), but I do believe it to be mislabeled as Thermosynechococcaceae when it should be, in fact, somewhere in the Elainellales. I assume that the mislabeling was caused by the high divergence of the taxon. There is little sequence data in NCBI assigned to this genus (https://www.ncbi.nlm.nih.gov/nuccore/?term=altericista). However, what genes do exist (16S rRNA, rbcL, and a couple of genes to do with a light acclimation response) unambiguously point towards the Elainellales.

Thank you for your time!

Hi Laura,

Thanks for bringing this to our attention!

Re GCA_902806435.1:

It is very concerning to hear about the misclassification at the order level. However, this needs to be investigated further to rule out other potential issues that may not be linked to the high pseudogene count in that assembly. This genome has been consistently placed within the order ‘Thermosynechococcales’ in the last six releases of GTDB. It does not appear to be on a long branch in our reference trees, and there is nothing suspicious from a phylogenetic point. It is the sole representative of its genus, and while it sounds like none of the other genomes caused issues in your tree, it is difficult for us to assess without seeing your dataset and trees, and why this specific genome may be causing misclassifications. It would be helpful if you could provide further details, such as which version of GTDB-Tk was used, how many representatives were included in the trees, and which query genomes were included.

We do not include pseudogene counts in our quality checks for various reasons: the majority will be legitimate, and we expect low numbers to be part of the MSA built from the concatenation of the core marker genes used to infer the phylogeny. There were also no reliable ways to distinguish true pseudogenes from assembly artifacts or annotation errors. Based on the recent literature, 3.6% of RefSeq assemblies appear to suffer from significantly elevated pseudogene counts resulting from assembly errors (https://doi.org/10.1186/s12864-024-10137-0), but this does not necessarily mean that the phylogeny will be affected. Since there is only one genome in this cluster, it is difficult to determine whether the elevated count stems from assembly artifacts. In summary, this issue needs to be investigated systematically, and if you can provide the details indicated above, we will try to look into it when time permits.

P.S. Below is our latest reference tree (R232) showing the coding density of the genome GCA_902806435.1 which seems to fall within the range of its neighbours.

Re GCF_044923475.1:

This is a new genome in our database, and while individual gene trees could place it within a different order, there are well-documented reasons why this can be misleading, both 16S rRNA and functional genes can conflict with genome-based phylogenies (consider multiple gene copies, HGT, and low phylogenetic signal). We infer our trees from concatenated alignments of single-copy orthologous genes that are strictly vertically inherited, which we consider the most reliable evidence for stable taxonomy. Note that this genome is also a representative of a genus containing 9 other closely related species, all of which are placed with 100% bootstrap support within the family Thermosynechococcaceae, making its placement even more reliable.

Hope this helps!
Please don’t hesitate to ask any further questions and we will do our best to help.

Best wishes,

Masha