Contamination in Pseudomonas sp000955805

Hi, just want to report that there is a contamination in the genome GTDB - Loading.... The genomes has perfect matches to 16 different RNA sequences in the ERCC RNA spike-in mix. Considering the mix is artificially designed for quantification in RNAseq and therefore shouldn’t have any matches, let alone 16 different perfect matches in any bacteria, this is almost certainly a contamination issue. I came across this when analyzing the bacterial content of a RNAseq sample with ERCC spike-in.

I know that you guys are not responsible for cleaning up genomes, but maybe consider not using this genome as a representative species/strain, or not including it in your next release at all?

Thanks.

Thanks for the heads up, we’ll keep an eye out for this one in the next release

Hi zxl124,

Have much contamination do these 16 different RNA sequences represent? We do accept genomes with small amount of contamination into GTDB so long as they pass out QC criteria:

We have good evidence that contamination does impact taxonomic classification so have generally leaned towards capturing more biodiversity than aiming to have only contamination-free genomes:

That said, if the contamination here is substantial (>5% of bps) we can explicitly flag this as a genome that should fail QC.

Thanks,
Donovan