Table of Rank by RED

Hi, is there a table somewhere that ranks each genome in GTDB by the relative evolutionary distance?

Many thanks,

Vincent

Hi Vincent,

Relative evolutionary divergence (RED) is a relative measure between taxa that only makes sense on a specific tree. As such, we don’t currently have a public table indicating this information. If you are looking to classify genomes relative to the GTDB, you can use the GTDB-Tk (GitHub - Ecogenomics/GTDBTk: GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.). You can calculate RED values on your own trees using PhyloRank (GitHub - dparks1134/PhyloRank: Assign taxonomic ranks based on evolutionary divergence.).

Cheers,
Donovan

Hi Donovan,

Thank you for the information and for the quick reply. I am looking for such a table so that I can extract “highly divergent” genomes from GTDB for a benchmarking analysis. I tried the CAMI dataset but it doesn’t really check the right boxes with respect to what I need in terms of a dataset.

Is there any way to pull out the most divergent genomes from GTDB? Or the most divergent genomes within the tree of each individual phylum?

Many thanks,

Vincent

Hi Vincent,

I don’t think there is a direct way to do this. My suggestion would be to use the GTDB bacterial and archaeal reference trees and look for genomes that have an atypically large patristic distance (sum of branch lengths) to their closest neighbour. This would get you divergent species. You could do something similar for other ranks (i.e. looking at the patristic distance from one genus to its closest neighbouring genus).

If you are a Python programmer, we generally use DendroPy for exploring trees.

Cheers,
Donovan

Hi again,

Another (easier) approach would be to use the GTDB taxonomy to establish the divergence of genomes. For example, a genome that is the only representative of an order is highly divergent. A genome that is the only representative of a genus is also divergent, but far less so than the previous genome. This works within the GTDB taxonomy since the taxonomic ranks are normalized by RED so all genera, families, … have been normalized to be roughly equivalent in terms of time of divergence.

Cheers,
Donovan

Hi Donovan,

Thanks for this, I don’t have much experience with python so I think I’ll go with the second suggestion. Thanks again for your quick replies and helpful suggestions.

All the best,

Vincent

Vincent, in case this is of use –

we’ve generated a subsampled GTDB data set where we’ve chosen leaf nodes (species) and then systematically selected genomes that share genus, family, class, order, and phylum. I think the latest version of the code is here (from Dr. Tessa Pierce) and we’d be happy to answer questions and do some light customization.

Drop us a note at ntpierce@gmail.com (tessa) and ctbrown@ucdavis.edu if you want to chat!

(I would not recommend using this code as-is without talking to us, because there may be some confounding assumptions baked in. But of course you are welcome to, as long as you don’t blame us for any infelicities :wink: :wink:

–titus

@donovan.parks Are the RED values in the Relative Evolutionary Divergence section of the stats page computed using the GTDB bacterial and archaeal reference trees? If so, are those RED values available anywhere? Relatedly, where can I find the reference trees?
Somewhat similar to xtremmicrobe, I’m hoping to run comparisons between SCG proteins from different HQ GTDB representatives and compare the results to their relatedness (same phylum, same class, same order, etc.). I was planning on simply grouping the scores by the highest shared rank, but would using RED scores be valid in this case?

Hi Calvin,

Yes, the RED values on the stats page are using the bacterial and archaeal reference trees. You can find the bacterial and archaeal reference trees as part of the GTDB data for each release: GTDB Data - /releases/release226/226.0/. Specifically,
ar53_r226.tree.gz
and
bac120_r226.tree.gz
. These are NOT normalized by the RED criterion. You can normalize them using PhyloRank outliers command.

RED values are specific to a given tree. In general, it is not meaningful to compare RED values on the GTDB domain-specific reference trees to RED values from other trees. That said, there are reasonable situations where comparing RED values across trees may be insightful. One should take care that the trees are inferred across the same set of extent taxa and the equivalent internal nodes are being compared across trees.

Thank you so much! I understand that inter-tree RED values aren’t meaningful, but intra-tree comparisons are fine, yes? Or is it just the case that RED values are useful for defining ranks, but quantifying “divergence” between taxa in a tree is best done by branch length?

Intra-tree RED value are most certainly meaningful and are a comparison of transformed branch lengths that aim to estimate divergence times.