Newly formed ECMA0423 out of Escherichia coli and representative choice

Hello GTDB team,

Thanks for this new release, I can only appreciate the amount of work and curation that goes into this essential resource!!

I was puzzled to notice that one of my isolates previously assigned to the species cluster of Escherichia coli in r226 is now assigned to the newly formed ECMA0423 species cluster ( GTDB - Loading... ). Given the large sizes of this cluster (13k) and the remaining Escherichia coli (36k), I was surprised that this split was not documented in the release notes.

I was especially concerned as to why a MAG of relatively medium quality was selected as the representative of this species cluster (among 13k genomes, including some isolates).

In particular, I spotted these two genomes from isolates with few contigs and perfect Checkm scores:

I can only imagine that clustering such a large quantity of genomes must come with compromises, but I fail to understand the choices for this one. I would be very grateful for pointers to understand the decisions. Please find below the code to get the data and to reproduce the figure.

All the best,

Charlie

PS: can’t add all the links as a new user sorry.

aria2c -c -s 16 -x 16 -k 1M -j 1  https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/bac120_metadata_r232.tsv.gz
csvtk filter2 -t -o subset_ECMA0423.tsv -f '$gtdb_genome_representative=="GB_GCA_047199055.1"' bac120_metadata_r232.tsv.gz
library(readr)
library(ggplot2)
library(cowplot)
ecma <- read_tsv("/data/subset_ECMA0423.tsv")
p <- ggplot(ecma, aes(x = checkm2_completeness, y = checkm2_contamination))+
  geom_point(aes(color =gtdb_representative, shape = gtdb_representative ), alpha = 0.7, size = 3)+
  scale_shape_manual(values = c("TRUE"=1, "FALSE"=5))+
  scale_color_manual(values = c("TRUE"="#FF69B4", "FALSE"="black"))+
  geom_rug(alpha=0.1)+
  geom_hline(yintercept = 5, linetype = "dashed")+
  geom_vline(xintercept = 90, linetype = "dashed")+
  labs(x="Completeness (Checkm2)", y = "Contamination (Checkm2)", color = "ECMA0423 representative",
       shape = "ECMA0423 representative")+
  theme_cowplot()+theme(legend.position = "bottom")

ggsave("/data/plot_ECMA0423.png",p, width = 5, height = 5, units = "in", bg = "white")

Hi Charlie,

Apologies for the slow reply. May is a busy month for the GTDB team.

GTDB uses a strict ANI-based definition for defining species. You can read about this in the following two manuscripts:

This works well in general, but can lead to situations that are less than ideal. This has clearly happen here for E. coli, where the genome GCA_047199055.1 has 94.9% ANI to the type strain of E. coli (GCF_003697165.2). GTDB considers genomes with <95% ANI to be from different species. As such, the GCA_047199055.1 genome was selected as a type genome for a new species cluster which was given the name s__ECMA0423 sp047199055. GTDB also requires (guarantees) that genomes be assigned to the closest type genomes. As such, a large number of genomes previously classified as E. coli were reassigned to this new species clusters.

I agree this is not ideal and we are looking to improve how species clusters are created and updated in GTDB. This will take some time as any such improvements need to be applicable across the GTDB in a largely (ideally entirely) automated fashion to allow for yearly updates.

Cheers,

Donovan