Taxonomy name changes/mapping between R220 and R226

Dear GTDB Team and Community,

I am currently working on expanding the MetaPhlAn 4 database with custom SGBs. I have annotated my genomes using GTDB-Tk R226. However, the current MetaPhlAn 4 taxonomy is built upon GTDB releases R220.

To avoid taxonomic splitting due to synonym changes (e.g., handling the Firmicutes vs. Bacillota transition or other Phylum/Class level reclassifications), I need to harmonize my R226 annotations with the R220 taxonomy structure.

Could you please advise if there is a recommended way to retrieve a list of taxon name changes between Release 220 and Release 226?

Specifically, I am looking for:

  1. A changelog or mapping file that highlights reclassified taxa (renaming, merging, or splitting) between these two versions.
  2. Or, advice on the best practice to ā€œback-mapā€ R226 taxonomies to R220 for backward compatibility.

Thank you very much for your time!

Best regards,
Qiaofan Li

1 Like

I think there’s a way to extract it from the API:

``curl -X ā€˜GET’ \
ā€˜https://gtdb-api.ecogenomic.org/sankey?taxon=s__Ruminococcus_D%20bicirculans&releaseFrom=R80&releaseTo=R226&filterRank=s__’ \
-H ā€˜accept: application/json’


If you get to a solution first, let me know. Otherwise, I’ll be looking to try something. I think there’s a nuance where it’s genome dependent and could be reclassified to a new species or removed from a species classification

A visual of the above

Here is my code for querying the API. I highly recommend a target list of species so you’re not hitting the API forever.

species R214 R220 R226
Bifidobacterium longum s__Bifidobacterium longum s__Bifidobacterium longum s__Bifidobacterium longum
Bifidobacterium longum Not Present Not Present s__Bifidobacterium infantis
import os
import sys
import requests
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def generate_request(species):
    api_url = f'https://gtdb-api.ecogenomic.org/sankey?taxon=s__{species}&releaseFrom=R214&releaseTo=R226&filterRank=s__'
    try:
        response = requests.get(f"{api_url}",verify=False)
        response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)
        data = response.json()  # Parse the JSON response
        return data
    except requests.exceptions.RequestException as e:
        print(f"Error making API call: {e}")        

def read_csv(csv = "file"):
    results = {}
    with open(csv,"rt") as csv_file:
        csv_file.readline()
        for line in csv_file:
            data = line.strip().split(",")
            print(data[8])
            gtdb_r226 = generate_request(species=data[8].strip('"'))
            results[data[8].strip('"')]=gtdb_r226

    return results   

def main():
    results = read_csv(sys.argv[1])
    with open('gtdb_changes.tsv','w+') as changefile:
        changefile.write(f'species\tR214\tR220\tR226\n')
        for k,v in results.items():
            node_dict={}
            node_tree = []  
            r214 = []
            r220 = []
            r226 = []
            for node in v['nodes']:
                node_dict[node['id']] = node['name']
                match node['col']:
                    case "Release 214":
                        r214.append(node['name'].lstrip("R214: "))
                    case "Release 220":
                        r220.append(node['name'].lstrip("R220: "))
                    case "Release 226":
                        r226.append(node['name'].lstrip("R226: "))

           while any(len(sublist)>0 for sublist in [r214,r220,r226]):
                print(len(r214),len(r220),len(r226))
                changefile.write(f'{k}\t{r214[0] if len(r214)>0 else ""}\t{r220[0] if len(r220)>0 else ""}\t{r226[0] if len(r226)>0 else ""}\n')
                [x.pop(0) for x in [r214,r220,r226] if len(x)>0]

if __name__ == "__main__":
    main()

I saved the code above as gtdb_species.py and then run like:

python gtdb_species.py species_list.csv

My species_list.csv had the species name stripped of the taxonomic delimiters like s__ in the 9th column.

Hi,

You can get flat TSV files (e.g. bac120_taxonomy_r226.tsv.gz) indicating the classification of each GTDB genome for each release: Index of /public/gtdb/data/releases . This removes the need to use the API. This would let you determine how the classification for each genome has changed between releases. There is no clear answer for how individual taxa have changed between releases since a given taxon can be split, merged, and/or reclassified in complicated ways. In general, it is much easier to take a ā€œgenome centricā€ view of classification instead of trying to track the history of how a given taxon name has changed. That said, we do have a tool to help visualize how taxon names have changed over time: GTDB - Taxon History .

Cheers,
Donovan