GTDB release95 genome name prefixes

Hi Everyone,

I am reaching out to ask a question to ask about the file gtdb_genomes_reps_r95.tar.gz from GTDB release95.

I would like to query the taxonomy of some GTDB genomes from gtdb_genomes_reps_r95.tar.gz in the associated bacteria and archaea taxonomy files (e.g. bac120_taxonomy_r95.tsv). I noticed that all of the names in the taxonomy files have a prefix to their associated genome name in gtdb_genomes_reps_r95.tar.gz (e.g. GB_* or RS_*).

If I removed these prefixes, will I be able to match up all the genome names from gtdb_genomes_reps_r95.tar.gz to bac120_taxonomy_r95.tsv? I assume they do match but just wanted to double-check. Also, what do the RS_* and GB_* prefixes indicate?

Thank you very much for your time and this invaluable resource.

Cheers,
Matt

yes, that has worked for us with release 89.

GB refers to GenBank
RS refers to RefSeq

and there are some UBA ones in there too, right? At least in R89 there were.

(I have some parsing scripts in Python if you are interested :joy:)

As of GTDB R95, all genomes in the GTDB are in the NCBI Assembly database. No more UBA genomes! The RS and GB prefixes indicate GenBank and RefSeq as indicated by ctb. This is for future proofing in case we start to accept genomes from other repositories.

2 Likes

Glad to hear it and thanks for the info!

I appreciate the offer but this is a great opportunity to improve my Python skills :wink: