MAG or Isolate and ambiguous checkm scores

Hey,

First, would it be possible to reveal what combination of fields in the metadata file is used to determine whether a genome will be displayed as MAG or isolate in the online browser? This topic has been discussed in Assess if a genome is a MAG with *_metadata_r95.tsv files But I find the information incomplete.

As an example: GCF_016902615.1, s__Collinsella tanakei_A. How does GTDB derive that this is an isolate?

I looked up the ASM for this entry and further found that there they report
image

However in GTDB it says Completeness 100% and Contamination 0%.
Which is frankly quite confusing.

It would be great if you could shed some light on why that is, I am sure there is an explanation that I am not aware of.

Best,
Jogi

PS: I was gonna put the links in, but for new users there is a restriction on the number of links I can post.

Hi Jogi,

The determination of a genome being an isolate, MAG, or SAG is based on metadata from NCBI and is stored in the ncbi_genome_category field of the GTDB metadata files. In general, NCBI makes this determination based on users indicating that a submitted genome assembly is an isolate, MAG, or SAG. This determinate is independent of any assessment of the quality of the genome assembly.

NCBI uses the CheckM v1 “taxonomy based” marker set that best aligns with the NCBI classification of a genome when estimating genome quality. GTDB uses the CheckM v1 “lineage based” masker set which is determined by placing a genome into a reference tree. These markers sets can be different and thus the quality estimates will be different. GTDB also now provides CheckM v2 quality estimates.

Cheers,
Donovan