I was wondering how the quality standard for ambiguous nucleotides in scaffolds of genomes included in the GTDB is run, and what could cause a genome to slip past it. Specifically, this entry has 1.7 million Ns in it: GTDB - GCA_003029985.1
I found out when the genome broke some work on the representative species set using anvio, and thought it’d be worth mentioning. The number of ambiguous bases listed is N/A, so that might be why it slipped through the <100000 ambigous bases threshold?
Are you familiar with or working with this genome assembly or did this case fall out of other work you were doing? It looks to me like the submitter knew the genome was Acidovorax avenae and had relatively low coverage sequencing resulting in a poor assembly. I’m guessing the assembled contigs were than mapped to a trusted A. avenae assembly and the missing data indicated by N’s. Based on our pipeline this genome still should have failed QC, but it is an interesting case and I’m just wondering if you have any knowledge about this genome.
Nope, I don’t unfortunately. I keep a local copy of the GDTB species representatives as anvio dbs, and was updating that set after the release of anvio v7. anvio dbs are created from the nucleotide seqs, and that (at least the way I do it) relies on gene prediciton using prodigal, and a prodigal-predicted protein of >100K amino acids broke hmmer 3.3.1. Following the breadcrumbs of failure lead me to that genome.
Seems like poor decision making on the submitters part, but that’s all I know.
This issue comes down to what one means by ambiguous bases. Unfortunately, the field has adopted the practice of using N’s to stitch contigs into scaffolds. It is common (though not universal) practice to use 10 or more N’s to indicate when you are stitching together contigs. We don’t consider these ambiguous bases for the purposed of QC’ing genomes in the GTDB. So, under this definition, GCA_003029985.1 actually has zero ambiguous bases and a lot of N’s for stitching together contigs. As such, this genome passes QC and probably should pass QC since besides being incomplete the data is likely reliable. Fair to say it is a less than ideal genome assembly, but we have to balance filtering out genomes and providing GTDB taxonomy strings to as many genomes as possible.
fair enough, I understand the trade off. The issue will likely be addressed on the anvio end as well, so it won’t be a hassle going forward. For future users, is adding percentage or number of Ns in the assembly as metadata possible, or does that add unwanted computational overhead to the process?