I have recently noticed that some of the classifications in the GTDB seem to be occasionally skewed due to the long branch attraction.
For example, highly reduced endosymbiont Legionella polyplacis, shown to be a member of Legionella genus by Říhová et al. 2017 (10.1093/gbe/evx217), is now forming separate order (o__G002776555) in r226. Manual examination of the GTDB tree revealed that L. polyplacis is grouped together with vast number of highly reduced insect symbionts. Some other small genomes thought to be belonging to various taxonomic groups, such as Azoamicus (proposed to be a order UBA6186 representative by Speth et al. 2024; 10.1038/s41467-024-54047-x) also fall within the same grouping. Together, this seems to suggest an LBA artifact.
Are there any plans to adress this issue in the GTDB in any way?
Thank you for reaching out and for your observations about potential LBA artifacts in our trees.
We are well aware of this issue, and we appreciate you bringing specific examples to our attention. The cases you mention, Ca. Legionella polyplacis and Azoamicus, are good illustrations of the challenges we face when dealing with highly reduced genomes. However, Ca. L. polyplacis has never been assigned to the true Legionella genus in GTDB, despite being classified within the same family in previous releases.
You are absolutely right that LBA is problematic, and we are actively working on improving our methods to better handle this. However, this remains one of the technically challenging aspects of large-scale phylogenies. We hope to improve our tree reconstruction method to mitigate these issues in the next release in 2026.
Thank you again for using GTDB, and for taking the time to provide this feedback!
thank you for the explanation, it is good to know that this issue is dealt with.
If I may suggest, maybe you will consider mentioning this in documentation for users? Someone may not think about this until finding some genomes missing from expected taxa (as otherwise GTDB will often be considered a gold standard in taxonomy).
I’ll just add that we use a conservative approach to assigning names to the reference tree to avoid regions of known or suspected LBA. That’s served us well to date, but as Masha said, we are trying to improve the reference tree to address the issue more directly.