Scaffolds with lots of N nucleotides?

dspeth · January 13, 2021, 5:21am

Hi all,

I was wondering how the quality standard for ambiguous nucleotides in scaffolds of genomes included in the GTDB is run, and what could cause a genome to slip past it. Specifically, this entry has 1.7 million Ns in it: GTDB - GCA_003029985.1

I found out when the genome broke some work on the representative species set using anvio, and thought it’d be worth mentioning. The number of ambiguous bases listed is N/A, so that might be why it slipped through the <100000 ambigous bases threshold?

Thanks!
Daan

donovan.parks · January 13, 2021, 11:18pm

Hi Daan,

Thank you for bring this to our attention. It does look like an issue on our end. You can find the GTDB QC criteria at: https://gtdb.ecogenomic.org/faq#gtdb_selection_criteria

We toss genomes with >100,000 ambiguous bases, though clearly something has gone wrong in this case. We are looking into it now.

Thanks,
Donovan

donovan.parks · January 13, 2021, 11:51pm

Hi Daan,

Are you familiar with or working with this genome assembly or did this case fall out of other work you were doing? It looks to me like the submitter knew the genome was Acidovorax avenae and had relatively low coverage sequencing resulting in a poor assembly. I’m guessing the assembled contigs were than mapped to a trusted A. avenae assembly and the missing data indicated by N’s. Based on our pipeline this genome still should have failed QC, but it is an interesting case and I’m just wondering if you have any knowledge about this genome.

Cheers,
Donovan

dspeth · January 14, 2021, 12:12am

Hi Donovan,

Nope, I don’t unfortunately. I keep a local copy of the GDTB species representatives as anvio dbs, and was updating that set after the release of anvio v7. anvio dbs are created from the nucleotide seqs, and that (at least the way I do it) relies on gene prediciton using prodigal, and a prodigal-predicted protein of >100K amino acids broke hmmer 3.3.1. Following the breadcrumbs of failure lead me to that genome.

Seems like poor decision making on the submitters part, but that’s all I know.

sorry I can’t help you further
Daan

donovan.parks · January 14, 2021, 1:08am

Hi Daan,

This issue comes down to what one means by ambiguous bases. Unfortunately, the field has adopted the practice of using N’s to stitch contigs into scaffolds. It is common (though not universal) practice to use 10 or more N’s to indicate when you are stitching together contigs. We don’t consider these ambiguous bases for the purposed of QC’ing genomes in the GTDB. So, under this definition, GCA_003029985.1 actually has zero ambiguous bases and a lot of N’s for stitching together contigs. As such, this genome passes QC and probably should pass QC since besides being incomplete the data is likely reliable. Fair to say it is a less than ideal genome assembly, but we have to balance filtering out genomes and providing GTDB taxonomy strings to as many genomes as possible.

Cheers,
Donovan

dspeth · January 14, 2021, 7:43am

fair enough, I understand the trade off. The issue will likely be addressed on the anvio end as well, so it won’t be a hassle going forward. For future users, is adding percentage or number of Ns in the assembly as metadata possible, or does that add unwanted computational overhead to the process?

donovan.parks · January 14, 2021, 3:00pm

Hi Daan. I think you are after the “Total Gap Length” which is taken directly from NCBI.

dspeth · January 15, 2021, 5:27am

ah, excellent I did overlook that

edit: For others reading this later, there’s 9 genomes in r95 with more than 1Mbp total gap length, so I tripped over a unicorn