Hi all,
Apologies if this is a duplicate question, however all the keywords I used when searching didn’t come up with any hits. I think I may have found someone asking something similar on the GTDB-Tk GitHub but the original request was a bit confusing so I want to make sure I understand the (negative) answer is correct. Also sorry if this is a bit verbose, but might be better for future searchability for others.
Essentially, we have the GTDB-Tk classify_wf
tool as a particular ‘module’ in our pipeline (nf-core/mag, for more context).
Currently, we are unable to test the GTDB-Tk part of the pipeline using the GitHub Actions CI runners we test the pipeline on during development, because the 110GB database required by classify_wf
(defined with the GTDBTK_DATA_PATH
shell environment) is much too big for the runner’s hard drive space (about 14 GB).
I would like to know if there exists, or it would be possible to create, a ‘tiny’/dummy version of the GTDB release tarballs (e.g. maybe containing just two or three genomes), that we can use to properly ‘simulate’ running GTDB in our tests.
Alternatively, a description of the minimal required files/structure (and how to generate) to replicate the database would be also helpful
I see that there are unit tests within the GTDB-Tk repo, however I can’t really follow how this incorporates the files specifed in the GTDBTK_DATA_PATH
.
I note that having such a tiny / mini / dummy / nonsense database would likely be boon for a lot of pipeline developers (as I think what the original poster of the GitHub issue was pointing towards).
I hope the question is clear (even if verbose), and sorry again if this has been asked before.
Cheers,
James