First of all, thank you very much for GTDB! We are big fans here!
I couldn’t easily find this information so I thought I’d try the forums! I was wondering if GTDB included MAGs from large-scale metagenomic assembly projects that are focusing on characterizing unknown diversity, and that don’t seem to be depositing their sequences on RefSeq. I imagine that GTDB does not include anything not on RefSeq (but I’m not 100% sure, please correct if I am wrong). If they aren’t in GTDB, are there any plans to do so?
I am thinking of two studies in particular:
Pasolli et al. (2019) Cell 176:649–662 in which ~9500 metagenomes were assembled into ~150,000 MAGs corresponding to ~75% “unknown” species from repositories. The paper mentions that they are available on this website.
Almeida et al. (2019) Nature 568:499–504 Similarly, >92,000 MAGs from were deposited on (the ENA repository) but I’m not sure if they are on RefSeq.
GTDB includes genomes from the RefSeq and GenBank Assembly databases. RefSeq is NCBI’s curated set of genome assemblies and generally excludes all MAGs and the majority (if not all) SAGs.
Currently, we do not source genomes outside of these two databases. We encourage researchers of large-scale metagenomic assembly projects to submit at least representative genomes to the GenBank Assembly database (or the EMBL-EBI/DDBJ equivalent). Many researchers are doing this. For example, just over a 1,000 MAGs from the “Unified Human Gastrointestinal Genome catalogue” were submitted to NCBI under BioProject PRJEB33885, representative MAGs from the Nayfach et al. manuscript were submitted under BioProject PRJEB31003, and representative MAGs from the Almeida et al. study are under BioProject PRJEB26432.
Unfortunately, GTDB R05-RS95 is based on genome assemblies in RefSeq/GenBank release 95 which was released in July 12, 2019 so does not yet contain the MAGs from these studies.
I like the clarity and simplicity of basing GTDB on RefSeq/GenBank releases, so I agree with the decision to limit it to those genomes. But it’s a shame for so many MAGs to be left out, so the question is how can we get them into GenBank? Should we be emailing the authors to nag them to make a BioProject? Maybe as incentive the GTDB team can send a free ‘Make Taxonomy Great Again’ hat to any researcher that contributes 1000 genomes or more to GenBank