Table of Rank by RED

Hi, is there a table somewhere that ranks each genome in GTDB by the relative evolutionary distance?

Many thanks,

Vincent

Hi Vincent,

Relative evolutionary divergence (RED) is a relative measure between taxa that only makes sense on a specific tree. As such, we don’t currently have a public table indicating this information. If you are looking to classify genomes relative to the GTDB, you can use the GTDB-Tk (GitHub - Ecogenomics/GTDBTk: GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.). You can calculate RED values on your own trees using PhyloRank (GitHub - dparks1134/PhyloRank: Assign taxonomic ranks based on evolutionary divergence.).

Cheers,
Donovan

Hi Donovan,

Thank you for the information and for the quick reply. I am looking for such a table so that I can extract “highly divergent” genomes from GTDB for a benchmarking analysis. I tried the CAMI dataset but it doesn’t really check the right boxes with respect to what I need in terms of a dataset.

Is there any way to pull out the most divergent genomes from GTDB? Or the most divergent genomes within the tree of each individual phylum?

Many thanks,

Vincent

Hi Vincent,

I don’t think there is a direct way to do this. My suggestion would be to use the GTDB bacterial and archaeal reference trees and look for genomes that have an atypically large patristic distance (sum of branch lengths) to their closest neighbour. This would get you divergent species. You could do something similar for other ranks (i.e. looking at the patristic distance from one genus to its closest neighbouring genus).

If you are a Python programmer, we generally use DendroPy for exploring trees.

Cheers,
Donovan

Hi again,

Another (easier) approach would be to use the GTDB taxonomy to establish the divergence of genomes. For example, a genome that is the only representative of an order is highly divergent. A genome that is the only representative of a genus is also divergent, but far less so than the previous genome. This works within the GTDB taxonomy since the taxonomic ranks are normalized by RED so all genera, families, … have been normalized to be roughly equivalent in terms of time of divergence.

Cheers,
Donovan

Hi Donovan,

Thanks for this, I don’t have much experience with python so I think I’ll go with the second suggestion. Thanks again for your quick replies and helpful suggestions.

All the best,

Vincent

Vincent, in case this is of use –

we’ve generated a subsampled GTDB data set where we’ve chosen leaf nodes (species) and then systematically selected genomes that share genus, family, class, order, and phylum. I think the latest version of the code is here (from Dr. Tessa Pierce) and we’d be happy to answer questions and do some light customization.

Drop us a note at ntpierce@gmail.com (tessa) and ctbrown@ucdavis.edu if you want to chat!

(I would not recommend using this code as-is without talking to us, because there may be some confounding assumptions baked in. But of course you are welcome to, as long as you don’t blame us for any infelicities :wink: :wink:

–titus