Hey,
I’m working the with gtdb_proteins_aa_reps.tar.gz and am curious is there any way to easily get a taxonomy label for the genome it’s from? or would this involve downloading the corresponding genomes and running through gtdb-tk? or am I misunderstanding something about this data.
Thanks!
Hello,
You should be able to map the protein file names from gtdb_protein_aa_reps
with the taxonomy files ar53_taxonomy_<release_number>.tsv
and bac120_taxonomy_<release_number>.tsv
provided on the download page.
Regards,
Pierre
1 Like
Thanks Pierre. Thats what I thought but it doesn’t seem to be the case. For example the IDs below. At least some of these are MAGS/incomplete which may explain it?
BADC01000283.1
LJZT01000005.1
LDXQ01000011.1
LSKJ01000330.1
LWRO01000005.1
MEMN01000233.1
MEMX01000041.1
MEOK01000033.1
MESE01000094.1
MHEA01000019.1
Hi feargair,
The name of the file indicates the genome (e.g.GB_GCA_021813265.1_protein.faa.gz has the proteins for the genome GB_GCA_021813265.1). You can look up the taxonomic classification for this genome in the taxonomy files flagged by Pierre above. The names of the proteins in this file come from the genomic FASTA file for this genome which were obtained from the NCBI Assembly database.
Cheers,
Donovan