Why is Bartonella classified as Bartonella_A and Bartonella_B

Juliana_B · October 21, 2024, 9:32am

Dear GTDB Team,

I hope this message finds you well.

I am reaching out with a question regarding the genus Bartonella in the GTDB. I noticed that it has been subdivided into three groups: g_Bartonella_A, g_Bartonella_B, and g_Bartonella. Could you kindly explain the reasoning behind this subdivision, rather than using just g_Bartonella?

Thank you very much for your time and assistance. I look forward to your response.

Best regards,
Juliana

donovan.parks · October 30, 2024, 1:02pm

Hi Juliana,

Genus names ending with an alphabetic suffix indicate genera that are i) polyphyletic according to the current GTDB reference tree, or ii) subdivided based on taxonomic rank normalisation according to the current GTDB reference tree.

Cheers,
Donovan

Juliana_B · October 30, 2024, 2:08pm

Dear @donovan.parks

Thank you very much for your response.

I am currently analyzing genomes within the Rhizobiales order and am particularly interested in understanding the RED values associated with these genomes. I noticed that the classify_wf_output/classify/intermediate_results folder contains a set of RED values, and I would like to better understand what these values represent.

While I have the RED dictionary indicating a genus RED value of 0.92, I see some specific RED values, such as:

Brucella melitensis: 0.86

I expected the RED value for Brucella melitensis to be higher, especially since this bacterium belongs to a well-established genus. Could you clarify if the RED values in this classification differ from those generated by (for example) PhyloRank or if there is another factor influencing this classification? I noticed that the pplacer taxonomy is different from the gtdb taxonomy.
I would like to use phylorank but I have issues with the number of genomes that I have to use to create a tree. you mentioned needing at least two phyla, with at least two classes within each phylum. Is there a recommended number of genomes per class or additional diversity guidelines to improve the accuracy of the resulting tree?
Thank you very much in advance

donovan.parks · October 30, 2024, 2:45pm

Hi Juliana,

I would recommend not using data in the intermediate_results directory. This data is not intended for general use, the file formats are not formally specified, and all this data is subject to change with each GTDB-Tk release.

GTDB-Tk uses pplacer to place your genomes into the GTDB reference tree. The pplacer_taxonomy indicate the taxonomic classification of your genomes based purely on its placement in this tree. The classification field also accounts for RED and is the recommended GTDB taxonomic classification of your genomes.

The recommended way to obtain RED values is PhyloRank. The more fleshed out the tree your provided PhyloRank the better. I appreciate computation of inferring large trees is a challenge. I’d recommend at least two genomes from each GTDB-define Class at a minimum, much better is two genomes from each GTDB family, and using all GTDB species representative is best.

Cheers,
Donovan