I am using the bac120.tree downloaded from GTDB. I see that it has 1132 leaves, that would correspond to the genomes of the reference species, 1 genome per GTDB species cluster. However when downloading the reference species (sp_clusters.tsv for example) I get more than 40k genomes.
How can I get a list of the genomes that are represented in the tree? How are these 1132 chosen from the more than 40k included as reference genomes from clusters?
Thank you
Hi,
GTDB provides an archaeal and bacterial reference tree which contain the reference genome for each GTDB species cluster. The latest GTDB (R06-RS202) archaeal tree has 2,339 leaves and the bacterial tree has 45,555 leaves. The latest trees can always be downloaded from:
https://data.gtdb.ecogenomic.org/releases/latest/
We recommend viewing the trees in Dendroscope. If you are a Python programmer, we use DendroPy to parse these trees.
Cheers,
Donovan