Existance of (or how to create) a 'tiny' version of GTDB for use for GTDB-Tk testing

Hi all,

Apologies if this is a duplicate question, however all the keywords I used when searching didn’t come up with any hits. I think I may have found someone asking something similar on the GTDB-Tk GitHub but the original request was a bit confusing so I want to make sure I understand the (negative) answer is correct. Also sorry if this is a bit verbose, but might be better for future searchability for others.

Essentially, we have the GTDB-Tk classify_wf tool as a particular ‘module’ in our pipeline (nf-core/mag, for more context).

Currently, we are unable to test the GTDB-Tk part of the pipeline using the GitHub Actions CI runners we test the pipeline on during development, because the 110GB database required by classify_wf (defined with the GTDBTK_DATA_PATH shell environment) is much too big for the runner’s hard drive space (about 14 GB).

I would like to know if there exists, or it would be possible to create, a ‘tiny’/dummy version of the GTDB release tarballs (e.g. maybe containing just two or three genomes), that we can use to properly ‘simulate’ running GTDB in our tests.

Alternatively, a description of the minimal required files/structure (and how to generate) to replicate the database would be also helpful

I see that there are unit tests within the GTDB-Tk repo, however I can’t really follow how this incorporates the files specifed in the GTDBTK_DATA_PATH.

I note that having such a tiny / mini / dummy / nonsense database would likely be boon for a lot of pipeline developers (as I think what the original poster of the GitHub issue was pointing towards).

I hope the question is clear (even if verbose), and sorry again if this has been asked before.

Cheers,
James