GTDB R08-RS214 has been released and consists of 402,709 genomes organized into 85,205 species clusters. Additional statistics for this release are available on the GTDB Statistics page.
This release introduces the following changes:
We thank Jan Mares for his help in curating the Cyanobacteria
Fixed issue with SSU files where sequences started 2 bp after correct start and stopped
1 bp after correct end of sequence. Thanks to CX for bringing this issue to our attention: 16S, 23S and ssu_all_r207 - #2 by donovan.parks
SSU files now provide sequences in their 5β to 3β orientation
Changed QC criterion to use ar53 instead of ar122 marker set. The impact of this change was
evaluated on the 353,569 genomes (~6,100 archaeal) considered for GTDB R207:
β only 1 additional genome passed QC
β only 21 additional genomes failed QC which included the following species representatives:
β s__Methanoregula sp002497485
β s__Methanobrevibacter_A sp017634055
β s__Methanosphaera sp003266165
β s__MGIIa-L1 sp002688825
β s__MGIIb-N2 sp002503665
β s__MGIIa-L2 sp002692685
β s__MGIIb-O3 sp002730445
β s__DTDI01 sp011334935
β s__Methanosphaera sp017652595
β s__Nitrosopelagicus sp902606945
β s__Methanolinea sp002501965
Have there been any other changes to the processing of the 16S dataset apart from putting all sequences in the same orientation and the very small change in the start and stop positions? Has there, for example, been some culling of short sequences or those with ambiguities?
Despite a larger number of 16S sequences in R08-RS214 compared to the previous release, Iβm seeing a small reduction in the proportion of my Archaea and Bacteria OTU sequences which have matches in GTDB.
No changes have been made to the 16S processing outside those you have indicated. The set of genomes covered by GTDB does vary slightly from each release as genomes can be flagged as deprecated at NCBI. However, this isnβt likely a large enough change to explain what you are seeing. Are you able to provide us with the 16S sequence from the previous release that was previously matching on of your OTUs?
I was wondering for RS214 if the 402,709 genomes fastas are available to download? I can find the 85,205 rep fastas from the data website but not the complete set. If its not available, is there a list of the GCA/GCF files of the genomes.
We do not provide all genomic FASTA files. We have made the decision to leave this to the INSDC (NCBI, EMBL-EBI, DDBJ) resources. You can get a list of all genomes covered by GTDB in the taxonomy files:
That is really helpful.
Iβve got to the bottom of my issue which results from the 16S sequences containing a mix of full length and partial 16S sequences. A hit to a newly added sequence that has greater coverage of a query can have a lower percentage similarity over the matched length but be a better match when judged by its E-value, because it is longer. Quality of the best match judged by E-values should improve monotonically with each release of the database, but the best match can show a reduction in percentage sequence similarity from one release to the next
Alastair