Hi,
I was wondering if the GTDB can provide its output from skani in the auxiliary files? Even with skani’s memory efficiency, all-to-all comparisons between genomes requires extensive compute resources that most people simply won’t have. It would be really nice to see how that underlying data contributes to the representative selection process at the whole database level.
Thanks!
Bryan
Hi Bryan,
We never perform an all-to-all comparison between all GTDB genomes. As you indicated, this is computationally expensive. We mitigate this be only calculating the ANI values required to establish the GTDB species clusters. The largest set of comparisons we end up performing is between the selected GTDB species representative genomes and the remaining genomes. For R226, this is a comparison between the 143,614 GTDB species representatives and the 588,861 remaining genomes. These results are tuned for our purposes in the sense the we run skani –min-af set to 60 and -s set to 85 (i.e. limiting results to those with relatively high alignment fraction and percent identity). I’m happy to provide these results, but they are unsuitable for most applications and thus we have elected not to make them generally available.
Cheers,
Donovan
1 Like