"undefined name" entries

Hi all,

I’ve recently come across a few entries with “Undefined (Failed Quality Check)” in the gtdb taxonomy field (such as: GTDB - GCA_003648795.1). and I’m somewhat confused on their status.

Are entries like this particular one included in the GTDB set, or have they failed one or more of the criteria for inclusion? (my cited example looks like it fails the N50 criterion).

As I’m formulating the question I think i get what’s going on, but would still be good to confirm. Are all assemblies in refseq trawled, and have a web entry generated for them, but only those that pass all quality criteria included in the non-redundant species representative set? i.e. there’s more web entries than genomes in the GTDB?

Thanks!

Hi Dan,

Such genomes have failed our QC criteria. As such, they are never assigned to a GTDB species cluster or given a GTDB Taxonomy string.

We organize all genomes which pass QC into ANI-based species clusters (A complete domain-to-species taxonomy for Bacteria and Archaea - PubMed). Our reference tree contains only 1 genome (the “representative”) for each species cluster. Web entries are created for all these genomes and many of the auxiliary data files provided on the GTDB website span all genomes passing QC.

Cheers,
Donovan

Hi Donovan,

thanks for getting back to me quickly, much appreciated! I had assumed that genomes that fail the QC just wouldn’t be included anywhere in the GTDB ecosystem (including web interface), and hadn’t realized that the web interface does have entries for them.
For what it’s worth, your approach is the more elegant way to deal with refseq entries not passing QC, as it makes clear their parent datasets are included.

best,
Daan