Possible to provide a test database

Hello,

Currently, I work on a data manager for the Galaxy Project (https://usegalaxy.org), which has the purpose of saving the whole data on the server where each user can use it without the need to download it for each tool run. Since the database needs a lot of storage room, the testing of the data manager or any tool that might need the database is hard. Because of this, I wanted to ask if it is possible to provide a small mock database (around 50 MB) with data that can be used for, for example, GTDB-Tk to write good test cases.

Hi,

You can use the check_install command to verify installation of GTDB-Tk. If you need to verify GTDB-Tk is operating as expected, I would suggest using the E. coli str. K-12 genome assembly from NCBI and verifying GTDB-Tk classifies this as E. coli.

Cheers,
Donovan

Hi,

first thank you for the answer but this was not my question. I did work with GTDB-Tk and it worked but the probelm which i had is when implementing a tool into Galaxy there has to be test written but the problem here is that the files used in the test should not be larger then 1 MB. Because of this i open this thread to ask if anyone of the GTDB team craate a test database which can be used for a case like this. The database doesnt need to function or can work with a specific genome only.

Maybe this help to unterstand it better since i try to mimic a database but i do not know how this database is build to create such a database.

Cheers,
Santino

Hi Santino,

We do not have a reduced set of GTDB-Tk data files for testing purposes. This should be possible, but isn’t something we have considered. For now, I would recommend running check_install as a way to confirm GTDB-Tk can find all relevant dependencies and then spoofing the output of GTDB-Tk for the purposes of integration testing in pipelines (e.g. run E. coli K-12 once and save the GTDB-Tk output, and use this output instead of running GTDB-Tk for integration testing). Not ideal, but either is running GTDB-Tk with a reduced set of data files which won’t necessarily indicate it is operating as expected when using the full set of reference data files.

Cheers,
Donovan