GTDB-Tk files count

jsgounot · December 2, 2024, 5:12am

Hey everyone,

This is more a suggestion than a direct issue. With the increased number of genomes, that will likely continue to grow at a fast pace, the size of the GTDB-Tk database continues to grow not only space-wise but also in terms of files. Not everyone is aware of this, but some cluster configuration limits quite a lot the number of files on their system. The current database is more than 200K files, mostly genomic files. That’s a lot. I wonder if it would be possible to drastically reduce the number of files, using fasta index tool on concatenated genomes instead for example.

Thanks for considering the option,
JSG

jolespin · December 8, 2024, 10:52pm

I’ve noticed this as well. Been going back and forth between my engineering team wanting more traditional databases and me explaining that flat files are the norm in bioinformatics. I wonder what other solutions there could be like loading the genomes into a sql db or similar?

donovan.parks · December 16, 2024, 10:48pm

Hi,

Thanks for the comment and we fully appreciate the concern. Unfortunately, there isn’t an easily solution since, as jolespin indicated, most bioinformatics software requires flat files. As such, even if we put them in a DB we would need to dump them to disk before running a lot of the 3rd party dependencies on which GTDB-Tk relies.

Cheers,
Donovan

jsgounot · December 17, 2024, 3:52am

There is no need for a database, as indicated in my first message you can query a large fasta file with samtools faidx or other tools, producing flat files on the fly fast and easily. This will however require a bit of scripting and maybe formatting the fasta files before concatenation, but nothing impossible.