Reduced/symbioant genomes that pass CheckM2 but not CheckM1

If I am reading the R226 release notes correctly, GTDB seems to apply filtering based on both CheckM1 and CheckM2, so what it accepts is just an intersection of what the two tools would accept. This goes a bit against this part of CheckM2 readme:

As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.

Indeed there are genomes that score well on CheckM2 but not CheckM1 that are rejected by GTDB, including:

I think the good CheckM2 completeness score and the contig count (1) provide enough confidence that the genome in question is indeed complete enough. Perhaps there should be a rule where GTDB considers genomes that score “particularly well” (perhaps > 70% complete, <5% contamination – I wanted to suggest a higher completeness bar but Vidania is still a little confusing to CheckM2 it seems) on CheckM2 to be acceptable regardless of what CheckM1 says.

Hi,

Thanks for the question. The criteria we have implemented makes an exception for genome assemblies consisting of <10 contigs. Such genomes pass QC if the genome quality estimates are satisfied by either CheckM v1 or v2. Unfortunately, highly reduced symbiont genomes are a challenge as they typically lack a sufficient number of phylogenetic marker genes to allow them to be robustly placed in the GTDB reference tree. Currently, we require genomes to have >40% of the marker genes we use for tree inference and the genomes from the species you have flagged fail this criterion. You can find the full QC criteria used by GTDB here: GTDB - FAQ.

Cheers,

Donovan

I see. To make this kind of failure reason more obvious, perhaps the “Genome Characteristics” section of the genome info page can add a row for “marker gene count/coverage%” or something similar?

Agreed. We are working on adding this information.

Hi,

Marker count information is now provided on each genome page:

2 Likes