Taxonomy name changes/mapping between R220 and R226

Qiaofan_Li · November 28, 2025, 8:29am

Dear GTDB Team and Community,

I am currently working on expanding the MetaPhlAn 4 database with custom SGBs. I have annotated my genomes using GTDB-Tk R226. However, the current MetaPhlAn 4 taxonomy is built upon GTDB releases R220.

To avoid taxonomic splitting due to synonym changes (e.g., handling the Firmicutes vs. Bacillota transition or other Phylum/Class level reclassifications), I need to harmonize my R226 annotations with the R220 taxonomy structure.

Could you please advise if there is a recommended way to retrieve a list of taxon name changes between Release 220 and Release 226?

Specifically, I am looking for:

A changelog or mapping file that highlights reclassified taxa (renaming, merging, or splitting) between these two versions.

Or, advice on the best practice to “back-map” R226 taxonomies to R220 for backward compatibility.

Thank you very much for your time!

Best regards,
Qiaofan Li

jtclaypool · December 11, 2025, 9:10pm

I think there’s a way to extract it from the API:

``curl -X ‘GET’ \
‘https://gtdb-api.ecogenomic.org/sankey?taxon=s__Ruminococcus_D%20bicirculans&releaseFrom=R80&releaseTo=R226&filterRank=s__’ \
-H ‘accept: application/json’

If you get to a solution first, let me know. Otherwise, I’ll be looking to try something. I think there’s a nuance where it’s genome dependent and could be reclassified to a new species or removed from a species classification

A visual of the above

jtclaypool · December 15, 2025, 5:18pm

Here is my code for querying the API. I highly recommend a target list of species so you’re not hitting the API forever.

species	R214	R220	R226
Bifidobacterium longum	s__Bifidobacterium longum	s__Bifidobacterium longum	s__Bifidobacterium longum
Bifidobacterium longum	Not Present	Not Present	s__Bifidobacterium infantis

import os
import sys
import requests
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def generate_request(species):
    api_url = f'https://gtdb-api.ecogenomic.org/sankey?taxon=s__{species}&releaseFrom=R214&releaseTo=R226&filterRank=s__'
    try:
        response = requests.get(f"{api_url}",verify=False)
        response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)
        data = response.json()  # Parse the JSON response
        return data
    except requests.exceptions.RequestException as e:
        print(f"Error making API call: {e}")        

def read_csv(csv = "file"):
    results = {}
    with open(csv,"rt") as csv_file:
        csv_file.readline()
        for line in csv_file:
            data = line.strip().split(",")
            print(data[8])
            gtdb_r226 = generate_request(species=data[8].strip('"'))
            results[data[8].strip('"')]=gtdb_r226

    return results   

def main():
    results = read_csv(sys.argv[1])
    with open('gtdb_changes.tsv','w+') as changefile:
        changefile.write(f'species\tR214\tR220\tR226\n')
        for k,v in results.items():
            node_dict={}
            node_tree = []  
            r214 = []
            r220 = []
            r226 = []
            for node in v['nodes']:
                node_dict[node['id']] = node['name']
                match node['col']:
                    case "Release 214":
                        r214.append(node['name'].lstrip("R214: "))
                    case "Release 220":
                        r220.append(node['name'].lstrip("R220: "))
                    case "Release 226":
                        r226.append(node['name'].lstrip("R226: "))

           while any(len(sublist)>0 for sublist in [r214,r220,r226]):
                print(len(r214),len(r220),len(r226))
                changefile.write(f'{k}\t{r214[0] if len(r214)>0 else ""}\t{r220[0] if len(r220)>0 else ""}\t{r226[0] if len(r226)>0 else ""}\n')
                [x.pop(0) for x in [r214,r220,r226] if len(x)>0]

if __name__ == "__main__":
    main()

I saved the code above as gtdb_species.py and then run like:

python gtdb_species.py species_list.csv

My species_list.csv had the species name stripped of the taxonomic delimiters like s__ in the 9th column.

donovan.parks · December 30, 2025, 7:16pm

Hi,

You can get flat TSV files (e.g. bac120_taxonomy_r226.tsv.gz) indicating the classification of each GTDB genome for each release: Index of /public/gtdb/data/releases . This removes the need to use the API. This would let you determine how the classification for each genome has changed between releases. There is no clear answer for how individual taxa have changed between releases since a given taxon can be split, merged, and/or reclassified in complicated ways. In general, it is much easier to take a “genome centric” view of classification instead of trying to track the history of how a given taxon name has changed. That said, we do have a tool to help visualize how taxon names have changed over time: GTDB - Taxon History .

Cheers,
Donovan

pirovc · March 30, 2026, 3:49pm

I implemented a simple “genomic centric” conversion between GTDB nodes in the MultiTax package. script is also available in the repo:

$ git clone https://github.com/pirovc/multitax.git
$ multitax/data/gtdb/convert_gtdb_version.py 95 226 "s__Ruminococcus_A sp003011855" "s__Bact-08 sp003520315" "g__JOSHI-001"
95: s__Ruminococcus_A sp003011855 -> 226: s__Oliverpabstia intestinalis
95: s__Bact-08 sp003520315 -> 226: 
95: g__JOSHI-001 -> 226: g__Aquabacterium_A, g__AHLZ01