How can I classify my own genomes with the GTDB?
What is the GTDB versioning scheme?
The GTDB version indicates both the GTDB and RefSeq release numbers. For example, R05-RS95 designates the fifth release of the GTDB and indicates reference genomes were obtained from RefSeq release 95.
Why has the suffix of phyla names been changed to -ota?
This is based on a Whitman et al. (2018) proposal to normalise the suffix of the rank of phylum as is done with other ranks. See Microbiolgy Society website.
Why are some genus names formed from a strain identifier?
A strain identifier is used as a placeholder for the genus name when there is no existing genus name and no binomially named representative genome. For example, the genome GCF_000318095.2 has the NCBI organism name Prevotella sp. oral taxon 473 str. F0040.
However, this genome is more closely related to Prevotellamassilia and Alloprevotella. Consequently, we assign it to the placeholder genus g__F0040. If the organism had been assigned a binomial species name such as Prevotella oralitaxus str. F0040 we would assign it to the placeholder genus g__Prevotella_A to indicate it is not a true Prevotella species, but that there are representative genomes that have been assigned to a species.
A strain identifier is used as a placeholder for the genus name when there is no existing genus name and no binomially named representative genome. For example, the genome GCF_000318095.2 has the NCBI organism name Prevotella sp. oral taxon 473 str. F0040 and is assigned to the genus Alloprevotella in NCBI.
However, this genome appears to be neither assigned to Prevotella, Alloprevotella or another closely related genus Prevotellamassilia in GTDB. Consequently, we assign it to the placeholder genus g__F0040. If the organism had been assigned a binomial species name such as Prevotella oralitaxus str. F0040, and it is not part of true Prevotella in GTDB, we would assign it to the placeholder genus g__Prevotella_A to indicate it is not a true Prevotella species, but that there are representative genomes that have been assigned to a species.
Why do some genus and species names end with an alphabetic suffix?
Genus names ending with an alphabetic suffix indicate genera that are i) polyphyletic according to the current GTDB reference tree, or ii) subdivided based on taxonomic rank normalisation according to the current GTDB reference tree.
Species names end with an alphabetic suffix if the GTDB species cluster is (or was previously) associated with a species name, but the correct application of this name is ambiguous or the name assigned to a different GTDB species cluster based on the presence of type material or via majority voting.
The lineage or species cluster containing the nomenclature type or, in case of species, satisfying the majority vote criteria retains the unsuffixed name and all other lineages/clusters are given alphabetic suffixes, indicating that they are placeholder names that need to be replaced in due course. A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.
Why do some family and higher rank names end with an alphabetic suffix?
Taxon names above the rank of genus appended with an alphabetic suffix indicate groups that are under the following category: i) groups that are not monophyletic in the GTDB reference tree, but for which there exists alternative evidence that they are monophyletic groups; ii) groups whose placement is unstable between releases.
A best effort is made to retain the same alphabetical suffix for a taxon between GTDB releases, but this is not guaranteed.
What criteria are used to select genomes for inclusion in the GTDB
Genomes are obtained from NCBI and must meet the following criteria to be included in the GTDB reference trees and database:
- CheckM completeness estimate >50%
- CheckM contamination estimate <10%
- quality score, defined as completeness - 5*contamination, >50
- contain >40% of the bac120 or arc122 marker genes
- contain <1000 contigs
- have an N50 >5kb
- contain <100,000 ambiguous bases
How are the bacterial and archaeal multiple sequence alignments constructed?
Bacterial and archaeal multiple sequence alignments (MSAs) are formed from the concatenation of 120 or 122 phylogenetically informative markers, respectively. These marker sets are referred to as bac120 and arc122 for bacterial and archaeal markers, respectively, and are comprised of proteins or protein domains specified in the Pfam v27 or TIGRFAMs v15.0 databases. Details on these markers are available for download (here). Gene calling is performed with Prodigal v2.6.3, and markers identified and aligned using HMMER v3.1b1. Columns in the MSA with >50% gaps or with a single amino acid spanning <25% or >95% of taxa are removed. In order to reduce computational requirements, 42 amino acids per marker were randomly selected from the remaining columns to produce MSAs of ~5,000 columns. The final masks applied to the concatenated MSAs are available for download (here) and the identical filtering approach is implemented in GTDB-Tk.
How are the bacterial and archaeal reference trees inferred?
Bacterial and archaeal reference trees are inferred from the filtered bac120 and ar122 multiple sequence alignments, respectively. Reference trees contain 1 genome per GTDB species cluster. The bacterial reference tree is inferred with FastTree v2.1.10 under the WAG model. The archaeal reference tree is inferred with IQ-Tree v1.6.9 under the PMSF model, a rapid approximation of the C10 mixture model (LG+C10+F+G), using FastTree v2.1.10 to infer an initial guide tree. Both trees contain non-parametric bootstrap support values.
How are GTDB species clusters formed?
The full methodology used to establish species clusters is described in:
Parks, D.H., et al. (2020). "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.
Briefly, species clusters are formed as follows:
- Identify a GTDB representative genome for each validly or effectively published species with one or more genomes passing quality control. In most cases this will be a genome sequenced from the type strain of the species. When this is not possible, the representative genome is selected based on its quality and with consideration to additional metadata (e.g., NCBI reference or representative genome, genome assembled from type strain of subspecies).
- Assign genomes to selected GTDB representative genomes using average nucleotide identity (ANI) and alignment fraction (AF) criteria. GTDB uses an ANI circumscription radius of 95%, though permits this to be as high as 97% in order to retain a larger number of existing species names. Species with an ANI >97% are synonyms within the GTDB. Species assignments use an AF of 65%. ANI and AF values are calculated with FastANI v1.1.
- Remaining genomes are formed into de novo species clusters using a greedy clustering approach that emphasizes selecting representative genomes of high quality. This clustering consists of 3 steps: i) sort remaining genomes by their estimated genome quality, ii) select the highest-quality genome to form a new species cluster, and iii) assign genomes to this species cluster using the ANI and AF criteria. These steps are repeated until all genomes have been assigned to a species.
How are placeholder genus names formed?
An internal node representing a genus without any descendant genomes with validly or effectively published genus names is assigned a placeholder name. This placeholder genus name is generally derived from the oldest representative genome within the lineage and formed, in priority order, from the:
- NCBI organism name,
- NCBI infraspecific/strain ID,
- NCBI WGS identifier, or
- NCBI genome assembly ID
Many of these placeholder names have been automatically generated with manual inspection used to modify names to more suitable, human-readable names when appropriate.
How is the specific name of novel GTDB species clusters formed?
GTDB species clusters without any validly or effectively published specific name are assigned a placeholder name which is formed from the NCBI accession number of the GTDB representative genome of the species. For example, if GCF_000192635.1 is the representative genome of a species cluster within the genus Agrobacterium the cluster will be named Agrobacterium sp000192635. Representative genomes of a GTDB species cluster are updated between releases when genomes of sufficiently higher quality become available, but placeholder names are not updated as preference is given to the stability of names. As a consequence, the placeholder name of GTDB species clusters may not reflect the current representative genome.
How are the number of taxa at each rank counted?
Each taxon at the rank of species and genus are counted, including those with an alphabetic suffix. For ranks higher than genus, suffixed names are collapsed and counted once (e.g. Firmicutes, Firmicutes_A, Firmicutes_B, ... is counted as a single phylum).
How are GTDB species representatives updated with each release?
Each GTDB species is defined by a single representative genome and species assignments established by considering the ANI and AF to these representative genomes (Parks et al., Nature Biotechnology, 2019). Species representatives are re-evaluated each GTDB release with an emphasis placed on retaining representatives so they can serve as effective nomenclatural type material. However, the goal of stable representatives must be balanced with the desire to use high-quality genomes as representatives, the incorporation of changing taxonomic opinion, and identified errors in genome classification or assembly.
GTDB representatives are updated according to two primary principles: i) representatives should be assembled from the type strain of a species whenever possible, and ii) representatives should only be replaced by assemblies of suitably higher overall quality. These two principles are quantitatively defined by the balanced ANI score (BAS) which is 0.5 * (ANI score) + 0.5 * (quality score), where the ANI score is 100 – 20 * (100 - ANI to current representative) and the quality score is defined by the criteria given in Table 1. An existing representative is only replaced by a new representative if it has a BAS ≥ 10 above the BAS of the current representative.Intuitively, the BAS achieves the goal of stable representatives by requiring a new representative to be of increasingly higher quality (as defined by the quality score) the more dissimilar it is from the current representative (as defined by the ANI score).
Representatives are also updated to account for genome assemblies being removed from NCBI and representatives are updated whenever the underlying assembly is updated at NCBI.Table 1. Criteria used to establish quality score of an assembly
|Type species of genome||100,000|
|Effective type strain of species according to NCBI||10,000|
|NCBI representative of species||1,000|
|CheckM quality estimate||completeness - 5*contamination|
|MAG or SAG||-100|
|Contig count||-5 * (no. contigs/100)|
|Undetermined bases||-5 * (no. undetermined bases/10,000)|
|Full length 16S rRNA gene||10|
How are the names of GTDB species clusters updated with each release?
The names assigned to GTDB species clusters are re-evaluated each GTDB release with an emphasize placed on nomenclature stability.
However, names are changed in some cases to reflect changes in taxonomic opinions and/or to correct identified errors in GTDB or NCBI assignments.Species clusters containing one or more genomes assembled from the type strain of a species are named after the species with nomenclatural priority (Parker et al., 2019),with the generic and specific names changed as necessary to reflect any genus level reclassifications in the GTDB. Species names identified as synonyms are provided as separated file in the GTDB repository and updated each release.
Species clusters without a type strain genome are assigned via a majority voting approach based on NCBI species assignments regarded as correct under the GTDB framework.
A genome is considered to have an erroneous NCBI species assignment if a genome assembled from the type strain of this species exists and resides in a different GTDB species cluster. A cluster is assigned a name by majority voting if >50% of genomes in the cluster with a GTDB-validated NCBI name are from a single species and >50% of all genomes with this species classification are in the cluster. Otherwise, the species cluster is assigned an alphanumeric or Latin suffixed placeholder name.
In order to maximize the stability of GTDB names, placeholder names are not updated to new placeholder names (e.g., Bacillus sp002153395 to B. subtilis_A or vice versa) even if an updated placeholder name might better reflect the current classification of genomes within a cluster.
Species clusters containing an assembly from the type strain of a subspecies or a subspecies satisfying the majority voting criteria will have the subspecies name promoted to the specific name of the cluster in cases where a placeholder name would otherwise be required.
Oren A, et al. (2015). Proposal to include the rank of phylum in the international code of nomenclature of prokaryotes. Int J Syst Evol Microbiol 65, 4284-4287.
Parker CT, et al. (2019). International Code of Nomenclature of Prokaryotes. Int J Syst Evol Microbiol 69, S1-S111.
Whitman WB, et al. (2018). Proposal of the suffix -ota to denote phyla. Addendum to 'Proposal to include the rank of phylum in the International Code of Nomenclature of Prokaryotes'. Int J Syst Evol Microbiol 68, 967-969.