16S, 23S and ssu_all_r207

Hello GTDB team,

First time user here.

  1. Is there a straightforward way to retrieve full-length 16S and 23S rRNAs from specific taxa from GTDB, including introns? Some taxa appear to have 23S rRNAs identified, but it is unclear how we can access them.

  2. I am interested in rRNA sequences from Archaea. What is the difference between sequences in ssu_all_r207.tar.gz and those in ar53_ssu_reps.tar.gz?

  3. I can extract the sequences in ssu_all_r207.tar.gz file without problem, but I note discrepancies in the identified positions (and length) in the sequences in this file relative to the original sequences on NCBI (an example below).

From ssu_all_r207 - the sequence length is 202, which is 1 short of the expected length in positions 11802-12004:

>GB_GCA_019058055.1~JAHLWG010000214.1 d__Archaea;p__Asgardarchaeota;c__Lokiarchaeia;o__CR-4;f__SOKP01;g__Loki-b32;s__Loki-b32 sp019057865 [location=11802..12004] [ssu_len=202] [contig_len=12006]
GAGGTGATCCAGCCGCAGGTTCCCCTACGGCTACCTTGTTACGACTTCTCCCTCCTCGCATACTAGAAACTCGATATGACCAGTCTGACCATACCTCATTTTTAGCACACTCGGATGGAGCGACGGGCGGTGTGTGCAAGGAGCAGAGACGTATTCACCGTGCGATGATGACACACGATTACTAGGGATTCCACGTTCATGT

From NCBI GenBank of the same sequence region based on the annotated positions (MAG: Candidatus Lokiarchaeota archaeon isolate 3H5_20 k141_57026, whol - Nucleotide - NCBI) - the length is 203 as expected:

>JAHLWG010000214.1:11802-12004 MAG: Candidatus Lokiarchaeota archaeon isolate 3H5_20 k141_57026, whole genome shotgun sequence
AGGAGGTGATCCAGCCGCAGGTTCCCCTACGGCTACCTTGTTACGACTTCTCCCTCCTCGCATACTAGAA
ACTCGATATGACCAGTCTGACCATACCTCATTTTTAGCACACTCGGATGGAGCGACGGGCGGTGTGTGCA
AGGAGCAGAGACGTATTCACCGTGCGATGATGACACACGATTACTAGGGATTCCACGTTCATG

I greatly appreciate you input and advice.

Many thanks,
CX

Hi CX,

  1. Unfortunately, we do not provide 23S rRNA sequences at this time.
  2. The ssu_all_r207.tar.gz file contain 16S rRNA sequence identified across all genome in GTDB while the ar53_ssu_reps.tar.gz is restricted to just archaeal genomes selected as GTDB representatives of a species. The file FILE_DESCRIPTIONS gives more information about what is contained in each file provided on the GTDB FTP site.
  3. We identify 16S rRNA genes de novo and thus our results may differ slightly from those at NCBI.

Thank you for pointing out the discrepancy with the GCA_019058055.1 16S fragment. I will need to dig into this to determine why our results differ from those at NCBI. It seems we start the 16S fragment 2 bases earlier and terminate it 1 base prior and thus are 1 bp shorter.

Cheers,
Donovan

Hi CX,

I can confirm that we are incorrectly starting our SSU sequences 2 bp after the correct start and stopping 1 bp after the correct end. This will be fixed for the next GTDB release.

Thanks again for pointing out this issue.

Regards,
Donovan

Hi Donovan,

Thanks for your replies and your confirmation. I’ll use the GTDB sequence headers as a guide and retrieve the sequences directly from GenBank, then. I’ll find a way to retrieve the 23S rRNA sequences.

Cheers,
CX