How can I use the --mash_db option of classify_wf

I just updated to 2.1.1 and it says I need to either select --skip_ani_screen or provide a --mash_db.

What is the suggested database for --mash_db?

Where can I download said database? (e.g., Supporting Data — Mash 2.0 documentation)

Heya, @jolespin,

That’s created on the first run if it doesn’t exist yet (assuming you’ve already set up the standard gtdb reference db, e.g. with download-db.sh). So you can provide the path to where you want that to go, e.g. --mash_db gtdb-tk-r207.msh or whatever you’d like (that’s what the “path to save/read (if exists)” part is about).

Providing some argument there should let the process kick off as intended and create that mash file where you specified :+1:

For me however, i’m currently getting an error after that’s generated (with v2.2.3 at least), telling me the gtdbtk.failed_genomes.tsv file it’s looking for doesn’t exist:

[2023-02-16 19:38:57] TASK: Running Prodigal V2.6.3 to identify genes.
[2023-02-16 19:38:58] ERROR: Uncontrolled exit resulting from an unexpected error.

================================================================================
EXCEPTION: FileNotFoundError
  MESSAGE: [Errno 2] No such file or directory: 'gtdb-tk-output/identify/gtdbtk.failed_genomes.tsv'
________________________________________________________________________________

I can’t really look into this any further right now, so i’m not posting an issue or anything yet (just found your post here doing a quick search around), but i’d appreciate hearing back if you happen to hit that same error in the v2.1.1 you’re using, or if it works for you once you provide the --mash_db argument.

Thank you! Do you know what command is run to create the mash database in the backend? I’m trying to set up all the databases prior to running GTDB-Tk for my VEBA package (GitHub - jolespin/veba: A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes) so everything works out of the box!

Sure thing :+1:

I don’t know without digging, but easiest thing would probably be to just run it once yourself and save the mash db produced, then incorporate that however you are incorporating the larger db into your package. I’m pretty sure it was just that one mash file that we name and it’s only like 1.5 G with the current included genomes if i’m remembering right

That’s a good idea. I looked at some of the internals and it looks like there is a Python function that creates the db from my very brief search.

Did you say that —mash_db throws an error for you?

Not that specifically, but later in the processing (after the mash database is created successfully) when running it with mash used at all as described above with v2.2.3. I’d like to know if a run for you finishes successfully when using mash. It works as expected for me when i run it without ani/mash by going the --skip_ani_screen way.