Your metadata files list the ncbi_translation_table, but the value is often “none”, even for species representatives, or in other cases mistaken by NCBI. I’m trying to use consensus from taxonomic relatives to decide all translation tables, but not sure it will work (there’s lots of mixed use, due to the NCBI mistakes). You’ve translated all representatives, so what do you do when the table value is “none”, just use table 11 as default and suffer possibly truncated proteins? Or do you have a method for determining the translation table and if so could you add that field to your metadata or sp_clusters files?
Just to give an example of a mistake at NCBI, they call assembly GCA_012031035.1 a Gracilibacteria species with translation table 4, but according to GTDB it is the representative for a Hydrococcus species in cyanobacteria, so should be table 11. I haven’t checked whether the proteins you’re using for its treeing are shortened at UGA codons.
OK, now I’m seeing the new Codetta paper from Sean Eddy’s lab that has this table of inferred genetic codes: https://cdn.elifesciences.org/articles/71402/elife-71402-supp1-v3.csv. It treats 246599 bacterial assemblies and 4972 archaeal.
The Shulgina/Eddy data treated 42093 species and 11679 genera of GTDB release 202, summarized as follows.
If these preliminary adjustments to GTDB taxonomy are allowed:
- from g__UBA7642, split off new g__UBA4682 (sp900549915,sp002404995,sp002297045)
- from g__UBA4855, split off new g__UBA6877 (sp002451465)
- from o__Mycoplasmatales, split off new o__UBA3375 (f__UBA3375, which is basal to the rest and doesn’t have the recoding tRNA)
Then all tested genera use the standard code, except
UGA>W : o__Mycoplasmatales, g__Zinderia, g__Stammera
UGA>G : o__BD1-5
AGG>M : g__UBA4682
CGG>Q : g__Peptacetobacter
CGG>W : g__Anaerococcus, g__UBA4855
UGA>G,CGA>W,CGG>W : o__Absconditabacterales
These rules left 31 Codetta-untreated species with suspected non-standard codes. These were all validated either directly by finding the recoding tRNAs (not always expected to be found in these incomplete genomes) or indirectly by being in genera with validated members. Of the 15 untreated Mycoplasmatales spp, 2 were missing the recoding tRNA. Of the 7 untreated BD1-5 spp, 4 were missing the recoding tRNA. Of the 9 untreated Absconditibacterales spp, 3 were missing the UGA-recoding tRNA. All these species not directly validated were in genera with validated members.
We use the translation table specified at NCBI when given. In cases where NCBI does not indicate a translation table we use the following heuristic:
- infer proteins using Prodigal with translation tables 4 and 11
- use table 4 if the coding density is >=5% higher under table 4 and the coding density is >=70% under table 4
- otherwise, use table 11
I realize this is far from perfect since it doesn’t distinguish between tables 4 and 25, but it is pragmatic since I don’t know of a computationally efficient method for predicting translation tables. It is certainly an area we would like to explore more.