The interpretation of genomic, transcriptomic and other microbial ‘omics data is highly dependent on the availability of well-annotated genomes. As the number of publicly available microbial genomes continues to increase exponentially, the need for quality control and consistent annotation is becoming critical. We present proGenomes3, a database of 907 388 high-quality genomes containing 4 billion genes that passed stringent criteria and have been consistently annotated using multiple functional and taxonomic databases including mobile genetic elements and biosynthetic gene clusters. proGenomes3 encompasses 41 171 species-level clusters, defined based on universal single copy marker genes, for which pan-genomes and contextual habitat annotations are provided. The database is available at http://progenomes.embl.de/
proGenomes and GTDB both access NCBI for genome sequences, and then both independently dereplicate the data to species reps, so effectively GTDB already incorporates proGenomes and visa versa.
Note that proGenomes also provide links to the GTDB genome pages.
That’s awesome! I just saw this publication when I was searching refs for GTDBTk earlier and thought I would relay it over. Just updated to 2.3.0 and r214.1 using the ANI screen with mash.
I noticed there was a mash subdirectory under the gtdb database. Is there supposed to be a mash file in there? That’s where I put my mash db.
You can place the mash database anywhere on your server, you do not have to use the mash subdirectory. Once the database is created , you will need to provide the path to the sketch file using –