GTDB R207 has been released and consists of 317,542 genomes organized into 65,703 species clusters. Additional statistics for this release are available on the GTDB Statistics page.
This release introduces the following changes:
The archaeal marker set has been updated from 122 markers (ar.122) to 53 markers (ar.53). The 53 markers are a subset of the “top-ranked marker proteins” from a recent evaluation based on minimizing horizontal gene transfer and optimising the recovery of monophyletic lineages (Dombrowski et al., 2020). This change was made to improve the marker coverage of archaeal lineages and the robustness of the genome phylogeny.
Shigella species have been reclassified as heterotypic synonyms of E. coli (Parks et al., 2021).
Changed alignment fraction (AF) criterion from 65% to 50% when forming GTDB species clusters. This change was made to accommodate the growing number of MAGs within the GTDB, many of which are not complete. Comparing partial genomes can result in artificially low AF values. This change was evaluated on R202 where 274 species representatives would have been merged if the AF criterion was lowered to 50%. These genome pairs had an average ANI of 97.5% and, as expected, an average completeness of only 71%.
To facilitate the AF change, the 548 species clusters with representatives that had an ANI >=95%and an AF meeting the reduced 50% criteria were all disbanded in R207. This allows these genomes to form new de novo genome clusters that follow the 50% AF criterion while ensuring selection of the most appropriate genomes to act as species representatives.
Added NCBI metadata regarding frameshifted proteins as an additional criterion for determining the score of a genome assembly. This score is used to determine which genome to use as a GTDB representative (Parks et al., 2022):
- Specifically, the score of a genome is reduced by 25 if the genome has many frameshifted proteins (“many frameshifted proteins”).
- This change was tested on R202 and resulted in 172 representatives having a reduced assembly quality score. However, only 43 representatives were replaced with a new representative considered to be of better quality. These 43 new representatives had an average ANI of 99.3% and AF of 94.7% to the previous representative.
- This change was made in response to the GTDB Forum post by Florian Plaza Oñate.