Multiple gtdbtk classify_wf output treefile

mengzhao · May 11, 2022, 11:50am

Hi,
Sorry for the dumb question, it is the first time I use GTDB-tk: I used gtdbtk classify_wf command, and got multiple treefiles in the output, and I could not figure out why, especially what is “gtdbtk.high.bac120.classify.tree”?

I got:
gtdbtk.bac120.classify.tree.11.tree
gtdbtk.bac120.classify.tree.6.tree
gtdbtk.high.bac120.classify.tree
gtdbtk.bac120.classify.tree.13.tree
gtdbtk.bac120.classify.tree.19.tree

Thanks in advance for the explanation.

p.chaumeil · May 16, 2022, 12:25am

Hello,
Starting with version 2.0.0, gtdbtk classify_wf pipeline uses the divide-and-conquer approach by default.
With this approach the placement of genomes is now divided in 2 steps:

The first one is to place the user genomes onto a backbone tree representing one genome per family in GTDB (gtdbtk.high.bac120.classify.tree) .
The genomes classified with a class level (order level for GTDB-Tk v2.0.0) on the backbone tree are then placed in a second tree (class-level subtrees) Those trees are made by a subsets of classes and are indexed for 1 to n ( gtdbtk.bac120.classify.tree..tree) .
This second placement will finalise the taxonomy of the user genome down to species level ( if possible)

We are preparing a manuscript with the details.

Please note, from version 2.1.0, gtdbtk.high.bac120.classify.tree has been renamed to gtdbtk.backbone.bac120.classify.tree

Shradha_Sharma · August 4, 2023, 4:51am

Hello,
Thanks for the explanation. I was struggling with the same problem. i have one more doubt though.
I want to visualise the tree generated. in my tree.mapping.tsv file, i got 3 organisms that were mapped to classify.6.tree

when i am visualising i am getting following problems:
the tree generated has about 60 organisms while my file of classify.6.tree should have just 3.i don’t understand which one is mine.
my data is very new. recently sequenced.
thanks in advance.

p.chaumeil · August 4, 2023, 5:09am

Hi,
Can you attached the classify.6.tree and the mapping file in this thread?

Thanks

Shradha_Sharma · August 4, 2023, 5:33am

Hi,
i am attaching drive link. it has three files. one is summary of all the bacteria classified. another is the tree mapping file. and other one is classify.7.tree.

https://drive.google.com/drive/folders/1olbJI8l_dZRd_AqEaqMEltaqUuCdmYkR?usp=sharing

Shradha_Sharma · August 4, 2023, 5:38am

this is the tree that iam getting

p.chaumeil · August 4, 2023, 5:50am

subset.7.tree (same as subset.6.tree) covers a subset of classes from the full GTDB reference tree. It consists of 9711 (10509 in subset.6.tree) reference genomes. so the output files will cover 9718 genomes for classify.7.tree ( 9711 + 7 user genomes based on the mapping file) and 10512 genomes for classify.6.tree ( 10509 + 3 user genomes based on the mapping file). Attached is a screenshot I have from classify7.tree where I highlight one of your genome (K_3_bin_13).

I would recommend using a tool like Dendroscope to open this tree files.

Kalonji_Abondance · August 16, 2023, 12:54pm

Hello I have a similar Question it’s my first time using GTDB-TK. i run the classify_wf on my MGAs. i have 82 of them i would like create a tree that only show my MAGs.

I would like to create something similar to what is done on this paper:
Fig. 2 | Scientific Data (nature.com)

here is my Classify_WF output:

kutti_biotech · August 21, 2024, 6:07am

i also have some doubt… this is the output tsv file from gtdbtk classify_wf.

bin	closest_genome_reference
bin.1.strict	GCA_000435755.1
bin.2.permissive	GCF_025147485.1
bin.4.strict	GCF_025148445.1
bin.5.strict	GCF_002834235.1
bin.6.permissive	GCA_000432435.1
bin.7.permissive	GCF_003269275.1
bin.8.strict	GCA_003514385.1

In the tree, the IDs are something different, and I want to rename them based on the organism. How can I do this? The IDs shown in the trees are GB_GCA_027858855.1, GB_GCA_020721905.1, GB_GCA_001829495.1, GB_GCA_016934735.1, GB_GCA_029974015.1, GB_GCA_029562645.1, and GB_GCA_004297825.1."