How to estimate resource requirements for classifying 60k genomes?

jolespin · March 21, 2024, 11:13pm

I have 60k prokaryotic genomes and many are using different GTDB databases. I’d like to propose reclassifying these using the most up-to-date (r214.1) database and utilizing the ANI prescreen but would need a resource estimate.

Is there anyway I can estimate a lower and upper bound of memory/time needed? Maybe an average per genome if an ANI screen is successful or if it needs to go through the full classify_wf workflow. Then I could estimate how much it would cost and if it would be worth it.

donovan.parks · March 23, 2024, 12:05am

Hi,

The approach you proposed sounds reasonable. I’d probably run a few small sets (say 50 to 100 genomes) and extrapolate numbers from those assuming the number of genomes caught by ANI screen will be similar across the entire dataset.

Cheers,
Donovan

jolespin · March 25, 2024, 11:37pm

Here it says 1 hr per 1000 genomes using 64 threads: Installing GTDB-Tk — GTDB-Tk 2.3.2 documentation

Is that including pplacer threads or was pplacer set to 1 here to keep the memory down to ~65GB?

I think what I’m going to do is run mash to get an estimate on how many genomes be screened out and then do the calculation based on the remainder.

Is there a way to use --genes without skipping the ANI screening so the genes aren’t called if they already exist?

p.chaumeil · March 28, 2024, 3:21am

Hi,

Is that including pplacer threads or was pplacer set to 1 here to keep the memory down to ~65GB?
“–pplacer_threads” flag was set to 64 for this test

Is there a way to use --genes without skipping the ANI screening so the genes aren’t called if they already exist?
Not currently, we assume that users do not use full genomes as input for GTDB-Tk when using the ‘–genes’ option. As a result, GTDB-Tk is unable to calculate the Average Nucleotide Identity (ANI) against representative genomes.

Cheers,
Pierre