Hello, I found that GTDB released a new version named r226. Many old species have disappeared even some new species have been added. I checked the completeness and contamination of some discarded representive genomes, and If I’m not mistaken, r226 calculate the genome quality score based on the checkm2 v1.0.2 results but earlier versions (e.,g., r220) used the checkm 1 results? But I did not found any assessment of the necessity and potential consequences of changing tools in the update log, and the database update indeed influenced our ongoing researches.
It also should be note that checkm2 v1.0.2 have a problem that genome quality score will change with the number of input genomes (Got different results of the same genome from the different runs of checkm2 · Issue #103 · chklovski/CheckM2 · GitHub), and the score of some genomes will also change between checkm2 v1.1.0 and v1.0.2 (The huge genome quality score differences between version 1.0.2 and 1.1.0, or checkm 1. · Issue #142 · chklovski/CheckM2 · GitHub).
Would you consider creating two versions of the database, one stable version and the other variable version? I know the GTDB have the ‘Taxon history’ function, but some useful modules are only available in the latest version (e.g., the ‘Taxonomy tree’ function). At least the staff should fully evaluate the consequences of tool changes and inform users. I wouldn’t even have known the checkm1 had been replaced with checkm2 if I hadn’t checked for completeness and contamination.
Most important, is it ok focusing on a special version of checkm (e.g., checkm2 v1.0.2) and database (e.g., GTDB r220) rather than chasing the latest versions? I am concerned that our previous results may be questioned by the reviewers.
Sincerely.
Hi Liping Qu,
Thank you for your message. You can find the release notes for R226 on both the GTDB FTP site and on this forum:
The R226 release notes indicate that we are now using CheckM v1 and v2 in order to establish which genomes should be included in GTDB.
We do perform an assessment of any methodological changes before implementing them into GTDB. These are either described in the release notes or manuscripts. Incorporation of CheckM v2 will be discussed in a NAR database manuscript we are currently preparing. In short, adoption of CheckM v2 quality estimates and the exception for genomes with <10 contigs resulted in 12,214 (11.1% increase) additional genomes failing QC and only 178 (0.023% increase) additional genomes passing QC, demonstrating that these changes largely result in more stringent QC.
We appreciate that changing QC standards do result in less stability in GTDB. However, we have to balance this with reflecting best practices in the community. CheckM v2 was published nearly 2 years ago at this point and appears to have been widely adopted as evidence by its citation count. For these reasons, we determined that it was suitable to start using CheckM v2 genome quality estimates as part of the GTDB QC.
We plan to use the latest versions of CheckM v2 as they become available. This would include any bug fixes and updates to its ML models.
I hope this clarifies are decision to incorporate CheckM v2 into the GTDB QC process.
Regards,
Donovan
Thanks a lot for your kindly and timely reply. Sorry for the previous misunderstanding, I only checked the webpage before, and I did not check the instruction file in the download directory. Now I have read the update log and obtained the information I wanted.
In fact, I’m working on a study which have used the GTDB r220 and checkm2 v1.0.2, and the data analysis process was almost finished and the manuscript is being written. But the results and conclusion may be changed (e.g., the taxonomy of some MAGs, the number of new species) if I transfer to the r226. I would like to seek your advice, that is it necessary to re-analysis our data using GTDB-tk v2.4.1 together with r226? In addition, the checkm2 version used by GTDB r226 is v1.0.2 or v1.1.0, since the quality score of the same genome may be changed between checkm2 v1.0.2 and 1.1.0?
Sincerely,
Liping Qu
Hi Liping Qu,
GTDB and many genomic resources are updated every year so in general it is hard to do a bioinformatic analysis and write a paper before these resources are updated. In general, I wouldn’t worry about this and reviewers will generally understand that it takes time to run analysis, interpret results, and write the manuscript. The exception would be if your manuscript is predominately focused on proposing new taxon names. In this case, I would think it is prudent to see in new names haven’t already been proposed.
Cheers,
Donovan