Existance of (or how to create) a 'tiny' version of GTDB for use for GTDB-Tk testing

jfy133 · August 22, 2024, 2:04pm

Hi all,

Apologies if this is a duplicate question, however all the keywords I used when searching didn’t come up with any hits. I think I may have found someone asking something similar on the GTDB-Tk GitHub but the original request was a bit confusing so I want to make sure I understand the (negative) answer is correct. Also sorry if this is a bit verbose, but might be better for future searchability for others.

Essentially, we have the GTDB-Tk classify_wf tool as a particular ‘module’ in our pipeline (nf-core/mag, for more context).

Currently, we are unable to test the GTDB-Tk part of the pipeline using the GitHub Actions CI runners we test the pipeline on during development, because the 110GB database required by classify_wf (defined with the GTDBTK_DATA_PATH shell environment) is much too big for the runner’s hard drive space (about 14 GB).

I would like to know if there exists, or it would be possible to create, a ‘tiny’/dummy version of the GTDB release tarballs (e.g. maybe containing just two or three genomes), that we can use to properly ‘simulate’ running GTDB in our tests.

Alternatively, a description of the minimal required files/structure (and how to generate) to replicate the database would be also helpful

I see that there are unit tests within the GTDB-Tk repo, however I can’t really follow how this incorporates the files specifed in the GTDBTK_DATA_PATH.

I note that having such a tiny / mini / dummy / nonsense database would likely be boon for a lot of pipeline developers (as I think what the original poster of the GitHub issue was pointing towards).

I hope the question is clear (even if verbose), and sorry again if this has been asked before.

Cheers,
James

jfy133 · May 9, 2025, 6:31am

Update: @pchaumeil made a ‘small’ one for us

github.com/Ecogenomics/GTDBTk

How to create a small test DB ?

opened 12:46PM - 09 Apr 25 UTC

paulzierep

question

Is is possible to create a small test DB of the latest release? E.g. only using …5 genomes? The requirement would be that GTDB-tk can work with this DB, even tough that most genomes would obviously be not assigned. We would need this for testing purposes for a wrapper of GTDB-tk we added to Galaxy (https://github.com/galaxyproject/tools-iuc/tree/main/tools/gtdbtk) as well as metagenomic / MAGs workflows (https://github.com/galaxyproject/iwc/pull/769). If you could guide us on creating such a Mock-DB that would be great.