GTDBTK speed on multiple CPUs

m.bernt · March 5, 2024, 8:51pm

I’m trying to get GTDBTk running on multiple CPUs, but it always has only an efficiency of a little more than 10%. I looked at top while running, and notices for instance that pplacer never uses more than 100% (but it’s called with -j …).

I’m using the latest version from biocontainers Quay (i.e. bioconda).

Input are 20 sequences with a total length of 300M.

Any ideas what I could try?

donovan.parks · March 6, 2024, 4:39pm

Hi. Are you running GTDB-Tk through a batch queueing system? We have noted odd behaviour from pplacer in the past in such cases:
https://ecogenomics.github.io/GTDBTk/faq.html?highlight=pplacer#gtdb-tk-reaches-the-memory-limit-pplacer-crashes

m.bernt · March 6, 2024, 6:03pm

I tried both. I ran it directly and via SLURM with the same effect. In both cases there was no memory problem (the program successfully finished).

I tried it again and have now realized that only in the Step 8 or 9 of the 9 steps more than 1 core is used. For my data the first pplacer steps that use only 1 core take a few minutes and the parallel step a few seconds. In summary this results in the observed low efficiency.

Is this typical? Or maybe a problem with IO…

Maybe this excerpt from the output gives a clue:

[2024-03-06 18:55:29] WARNING: 2 of 5 genomes have a warning (see summary file).
[2024-03-06 18:55:29] TASK: Placing 10 bacterial genomes into backbone reference tree with pplacer using 4 CPUs (be patient).
[2024-03-06 18:55:29] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on output_dir/align/gtdbtk.bac120.user_msa.fasta.gz==> Step 2 of 9: Pre-masking sequences.                                                                 [2024-03-06 18:57:46] INFO: Calculating RED values based on reference tree.                
[2024-03-06 18:57:47] INFO: 9 out of 10 have an class assignments. Those genomes will be reclassified.
[2024-03-06 18:57:47] TASK: Placing 2 bacterial genomes into class-level reference tree 7 (1/6) with pplacer using 4 CPUs (be patient).
==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on output_dir/classify/intermediate_results/pplacer==> Step 2 of 9: Pre-masking sequences.                                                                 ==>[2024-03-06 19:02:52] INFO: Calculating RED values based on reference tree.                
[2024-03-06 19:02:54] TASK: Traversing tree to determine classification method.
[2024-03-06 19:02:54] INFO: Completed 2 genomes in 0.00 seconds (14,926.35 genomes/second).
[2024-03-06 19:02:54] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2024-03-06 19:02:56] INFO: Completed 4 comparisons in 1.81 seconds (2.21 comparisons/second).
[2024-03-06 19:02:56] INFO: 0 genome(s) have been classified using FastANI and pplacer.

donovan.parks · March 6, 2024, 8:18pm

Hi,

I think this is expected. The first thing pplacer does it load the reference tree and data into memory. This is single threaded and can take a substantial amount of time. This isn’t too much of an issue if processing large numbers of genomes, but as you are observing is likely the rate limiting step when processing fewer genomes.

Cheers,
Donovan

m.bernt · March 7, 2024, 6:49am

Thanks for the explanations. This helped a lot.