Reduced/symbioant genomes that pass CheckM2 but not CheckM1

Artoria2e5 · February 22, 2026, 4:32am

If I am reading the R226 release notes correctly, GTDB seems to apply filtering based on both CheckM1 and CheckM2, so what it accepts is just an intersection of what the two tools would accept. This goes a bit against this part of CheckM2 readme:

As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.

Indeed there are genomes that score well on CheckM2 but not CheckM1 that are rejected by GTDB, including:

GTDB - Loading... (cicada symbioant, Candidatus Hodgkinia)
GTDB - Loading... (grasshopper symbioant, Vidania)
GTDB - Loading... (also an insect symbioant, Nasuia)

I think the good CheckM2 completeness score and the contig count (1) provide enough confidence that the genome in question is indeed complete enough. Perhaps there should be a rule where GTDB considers genomes that score “particularly well” (perhaps > 70% complete, <5% contamination – I wanted to suggest a higher completeness bar but Vidania is still a little confusing to CheckM2 it seems) on CheckM2 to be acceptable regardless of what CheckM1 says.

donovan.parks · February 22, 2026, 11:53pm

Hi,

Thanks for the question. The criteria we have implemented makes an exception for genome assemblies consisting of <10 contigs. Such genomes pass QC if the genome quality estimates are satisfied by either CheckM v1 or v2. Unfortunately, highly reduced symbiont genomes are a challenge as they typically lack a sufficient number of phylogenetic marker genes to allow them to be robustly placed in the GTDB reference tree. Currently, we require genomes to have >40% of the marker genes we use for tree inference and the genomes from the species you have flagged fail this criterion. You can find the full QC criteria used by GTDB here: GTDB - FAQ.

Cheers,

Donovan

Artoria2e5 · February 24, 2026, 1:54pm

I see. To make this kind of failure reason more obvious, perhaps the “Genome Characteristics” section of the genome info page can add a row for “marker gene count/coverage%” or something similar?

donovan.parks · March 2, 2026, 5:49pm

Agreed. We are working on adding this information.

donovan.parks · March 3, 2026, 2:43pm

Hi,

Marker count information is now provided on each genome page: