Such contamination of sequences is a potential major problem for those dealing with host-associated metagenomic sequences, especially using read-binning approaches.
And rather than any user who is aware of the problem rolling their own solution, it could be really useful if GTDB provided a “cleaned” download. I expect you will be more nimble than NCBI for implementing such a change.
Thank you for the post. I wasn’t aware of Conterminator.
We are always considering ways to improve the quality of genomes in ways that may impact the GTDB reference tree or ANI-based species clusters. We are currently prioritizing our resources on phylogenetic, taxonomic, and nomenclatural issues. Our current stance is that the GTDB should not become a genome repository outside of providing data directly relevant to the GTDB taxonomy given that serving genomes is a primary role of the INSDC repositories (NCBI, EBI, DDBJ) and other “value added” genome repositories already exist (e.g. IMG/M, MAGnify) .
I fully take your point though that a “cleaned” version of genomes (especially given the wealth of MAGs) would be appreciated by the community. Unfortunately, I don’t think we are in a position to take this on in the short to medium term, and would probably only do so long term if it became clear there isn’t a more appropriate research group or existing genome repository interested in providing this service.
This is really helpful to understand, and totally get that you cannot do everything with limited time and resources. I assume NCBI are aware of the issue, but I wonder if they will take it on. Perhaps I’ll ask!