Sorry I stepped out of the convo the last weeks due to some travel and deadlines. I still don’t think my issue is clear or maybe I’m not clear on the expected files and output. But quickly - @rothmanj which file are you referring to as the decorated tree? Just to be clear.
I’ll provide some screenshots and examples as best I can…
I installed in conda/bioconda and use the conda environment to run…
Here is my workflow call (forcing pplacer to use 1 cpu otherwise it errors and terminates):
4. Run the classify workflow
conda activate gtdbtk
gtdbtk classify_wf \
--genome_dir $WKDIR/gtdb_analysis/refseq_fnas_461 \
--out_dir $WKDIR/gtdb_analysis/classify_out \
-x gz \
--cpus 8 \
--pplacer_cpus 1 \
Here is the directory output. Based on the GTDB documentation, I understand that the “classify.tree” (or the “gtdbtk.bac120.classify.tree” as it is) is the tree of query sequences placed by pplacer.
I would expect this tree thus to have the names of my original queries, but it has only the GTDB IDs (GB_GCA_002347485.1 and etc…)
Maybe this is expected? Maybe this is just the bac120 reference genomes tree? But then I do not understand how that is helpful. I’m also still looking for where is the association key for my query genomes and the GTDB placement taxa name/ID (i.e. mygenome1 = GCA_123456789; mygenom2 = GCA_987654321… etc). I don’t see this provided in the summary.tsv (or “gtdbtk.bac120.summary.tsv” as it is).
I thought the identification and association to the new set of references provided by GTDB was a major motivation of this tool, so I’m assuming I completely misunderstand something and the output files here?
Thanks for your attention and time in helping.