Doubt about Genome QC methods

Dear GTDB team,

I have some genomes of interest that I would like to quality check in the same way that was used for quality control of GTDB. From what I have seen in the methods section most of the statistics for the applied criteria could be obtained with your CheckM program. However, for criterion iv (contain >40% of the bac120 or ar53 marker genes), I am not sure how it could be calculated (whether it could be calculated with CheckM, or another program). Likewise for criterion vi ( have an N50 >5kb), do you use the N50_contigs or the N50_scaffolds?

Could you tell me in detail how you get this statistic?

Thank you very much in advance for any help you can give me.
Best regards, Sam

Hi Sam,

We used the contig N50. Criterion iv is specific to GTDB and is used because we require a sufficient number of these marker genes for phylogenetic inference. For general MAG QC I would not recommend this criterion. If you really want to use this criterion let me know and I can see if GTDB-Tk can produce this number for you.

Cheers,
Donovan

Hello Donovan,

Thank you very much for your reply. Well this actually depends on another question.

I have been able to classify this set of MAGs of interest using GTDBtk2.1.1., about half have been able to be classified to species level, the rest are classified only to genus level, and a few to family or order. My goal would be to generate a database based on GTDB and enrich it with these missing genomes.

After some research I understand that I have two options to do this:

Thank you very much in advance for any help you can give me.
Best regards, Sam

Hi Sam,

I would not recommend using the “GTDB Species Clusters Toolkit”. It is meant for internal use and has a large number of software and data dependencies that are not easily satisfied by external users.

De novo ANI clustering done with a tool like drep or CoverM would be my recommendation. These can produce ANI-based species clusters that are functionally equivalent to those produced by the GTDB Species Clusters Toolkit.

Cheers,
Donovan

Thank you very much for all your help and quick response!

Best regards, Sam