Marker gene phylogeny vs ANI and AF

jfritscher · June 6, 2024, 3:51pm

Hey,

I have some mild confusion about the marker gene phylogeny vs ANI and AF species clusters.

In the first publication it was said that the tree was created from the reduced alignment of all dereplicated 120 marker proteins. However in the next one (I think) you introduced species clusters based on ANI and AF. So GTDB-tk uses both whole genome similarity and marker gene based placement in tree, if I understand correctly.

My questions:

Are my assumptions correct?
How different is the marker gene tree placement from the whole genome ANI placement?
Where can I find more information about this?

I apologize if I just missed the information in one of the publications. I was just not able to find where the ANI-based classification meets the marker gene tree and how they work together for genomes that are not species representatives.

Best,
Jogi

donovan.parks · June 8, 2024, 3:09pm

Hi Jogi,

GTDB-Tk uses ANI/AF in order to assign your genomes to species clusters. If genomes can’t be assigned to an existing species clusters, phylogenetically informative marker genes are used to place your genomes into the GTDB reference tree.

Information about GTDB and GTDB-Tk can be found in the citations listed on the GTDB About page (GTDB - About). Latest information about the GTDB methodology can be found on the Methods page (GTDB - Methods). The latest information about GTDB-Tk can be found on the GitHub page (GitHub - Ecogenomics/GTDBTk: GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.).

Cheers,
Donovan

jfritscher · June 12, 2024, 3:32pm

Hi Donovan,

thanks for the reply. Unfortunately the resources were not sufficient to clear my confusion. Let me rephrase my question and add some detail:

I see that parts of marker gene sequences from members of species A align with higher similarity to marker gene sequences of species representatives of neighbouring species, say B or C. Is this because the species clustering based on phylogeny of the marker proteins is different from the ANI and AF based clustering?
For example, a lot of reads from marker genes of members of s__Bacteroides ovatus have better alignment with the species representative marker genes of s__Bacteroides xylanisolves than the species representative marker genes of s__Bacteroides ovatus. I also see this in a lot of Collinsella species where on an individual marker gene level, the species clusters are not clearly delineated.
Is this a biological signal or a technical issue with marker gene similarity not directly reflecting the whole-genome ANI ? Or could this be an issue with the genome alignments of the isolates or MAGs?

I hope this is more clear, but I would not ask if I had succeeded getting this information out of the publications.

Best,
Jogi