Why many gtdb representative genome pairs has ANI >95%?

Representative genomes are supposed to be the only genome chosen from its species, so they each other should have ANI < 95%, which is the species boundary. But i found many gtdb representative genome pairs has ANI >95%, some even >99%

Here is a part of this list (generated using gtdb r214 and fastANI):
Genome1|Genome2|ANI OUTPUT
|GCA_017399555.1|GCA_017411665.1|99.6286|123|640|
|GCA_021851985.1|GCA_021803425.1|99.5409|114|229|
|GCA_021803425.1|GCA_021851985.1|99.2152|113|279|
|GCA_021836125.1|GCA_021807125.1|99.1863|104|247|
|GCA_002348425.1|GCA_002727395.1|99.1817|191|539|
|GCA_002690875.1|GCA_002171275.2|99.1221|58|419|
|GCA_021808875.1|GCA_021804785.1|99.0221|393|902|
|GCA_021804195.1|GCA_021831565.1|98.8212|201|676|
|GCA_021844725.1|GCA_021803345.1|98.737|397|938|
|GCA_902510755.1|GCA_902512585.1|98.7229|159|372|
|GCA_902512585.1|GCA_902510755.1|98.7166|159|334|
|GCA_021817415.1|GCA_023379555.1|98.6906|183|633|
|GCA_021785915.1|GCA_023379465.1|98.683|57|283|
|GCA_003519645.1|GCA_002410715.1|98.6693|376|800|
|GCA_002716645.1|GCA_009392775.1|98.3346|158|331|
|GCA_021791875.1|GCA_021799405.1|98.1231|618|1307|
|GCA_021799405.1|GCA_021846945.1|98.1058|761|1699|
|GCA_021799405.1|GCA_021791875.1|98.0913|623|1699|
|GCA_016777745.1|GCA_905182145.1|98.0772|118|333|
|GCA_021819225.1|GCA_023381355.1|98.0532|170|358|
|GCA_023381095.1|GCA_023488565.1|98.0014|159|408|
|GCF_018854775.1|GCA_021840785.1|97.8831|259|681|
|GCA_023488565.1|GCA_023381095.1|97.8819|160|546|
|GCA_023487645.1|GCA_021786105.1|97.8677|133|373|
|GCA_014801235.1|GCA_014801525.1|97.8168|215|568|
|GCA_021820245.1|GCA_021847855.1|97.7669|172|472|
|GCA_905182145.1|GCA_016777745.1|97.7347|121|244|
|GCA_021846945.1|GCA_021799405.1|97.6849|771|1590|
|GCA_021831565.1|GCA_021804195.1|97.6817|205|895|
|GCA_023474385.1|GCA_021793035.1|97.6804|63|148|
|GCA_002712565.1|GCA_002731735.1|97.6798|296|598|
|GCA_021810735.1|GCA_021791875.1|97.5543|354|1090|
|GCA_902564345.1|GCA_002716165.1|97.5511|194|558|
|GCA_900769635.1|GCA_900557675.1|97.5172|88|289|
|GCA_905181965.1|GCA_000383735.1|97.4201|269|867|
|GCA_021811505.1|GCA_023488885.1|97.3398|211|454|
|GCA_021793035.1|GCA_023474385.1|97.2654|66|175|
|GCA_021825785.1|GCA_021840255.1|97.2574|241|591|
|GCA_018691945.1|GCA_902512865.1|97.2327|91|218|
|GCA_021794075.1|GCA_021796055.1|97.1882|624|1393|
|GCA_013372875.1|GCF_900460535.1|97.1568|380|1106|
|GCA_021781655.1|GCA_021780795.1|97.1098|219|1151|
|GCA_021780415.1|GCA_021778845.1|97.099|328|656|
|GCF_006384875.1|GCF_000007825.1|97.0513|1623|1975|
|GCF_003935375.1|GCF_014764705.1|97.0034|1311|1530|
|GCF_021432085.1|GCF_014764705.1|97.0019|1368|1591|
|GCA_014802625.1|GCA_014801595.1|96.985|305|614|
|GCF_001612905.1|GCF_001612805.1|96.9719|1923|2301|
|GCA_002371695.1|GCA_002342665.1|96.9647|253|542|
|GCA_002243605.1|GCA_900450595.1|96.9643|928|1166|
|GCF_001457655.1|GCF_900475885.1|96.9609|560|630|
|GCF_000007825.1|GCF_002200015.1|96.9563|1504|1808|
|GCF_001457655.1|GCF_019703545.1|96.9516|570|630|
|GCA_003722315.1|GCF_000147695.2|96.9409|761|970|
|GCF_900114065.1|GCF_014764705.1|96.9406|1315|1535|
|GCA_021791875.1|GCA_021810735.1|96.9392|364|1307|
|GCF_002200015.1|GCF_000007825.1|96.9306|1523|1757|
|GCF_000147695.2|GCA_003722315.1|96.9275|752|928|
|GCF_001613005.1|GCF_001612845.1|96.924|2097|2513|
|GCF_003935375.1|GCF_002929225.1|96.8882|1344|1530|
|GCF_003788635.1|GCF_003732545.1|96.8739|1045|1198|
|GCF_000284075.1|GCF_001951015.1|96.8592|405|435|
|GCF_001521715.1|GCF_007035645.1|96.8565|1361|1599|
|GCF_001457655.1|GCA_923172585.1|96.8559|552|630|
|GCF_900105615.1|GCF_900105215.1|96.8556|1596|1766|
|GCF_000721105.1|GCA_000720805.1|96.8456|2702|3285|
|GCF_002982115.1|GCF_003049665.1|96.8438|1326|1560|
|GCF_000421285.1|GCF_000350545.1|96.8425|1295|1704|
|GCA_007713455.1|GCF_002929225.1|96.8367|1364|1548|
|GCF_000935215.1|GCF_002929225.1|96.8359|1272|1450|
|GCF_000336775.1|GCF_000336815.1|96.8304|1159|1338|
|GCF_000738675.1|GCF_017299535.1|96.8234|1573|1740|
|GCF_009741445.1|GCA_002291465.1|96.8054|1309|1652|
|GCA_002243605.1|GCF_001277805.1|96.8001|747|1166|
|GCF_000336815.1|GCF_000336775.1|96.7989|1159|1310|
|GCF_900101905.1|GCF_900106535.1|96.7759|2750|3049|
|GCF_001595725.1|GCF_002200015.1|96.7756|1601|2125|
|GCF_000312025.1|GCF_001517665.2|96.7724|496|582|
|GCF_900114065.1|GCF_002929225.1|96.7693|1385|1535|
|GCF_900105215.1|GCF_900105615.1|96.7634|1612|1976|
|GCF_000495915.1|GCF_002929225.1|96.7548|1357|1645|
|GCA_913041275.1|GCA_902516225.1|96.7543|143|356|
|GCF_900475885.1|GCF_019703545.1|96.753|551|664|
|GCA_007713455.1|GCF_014764705.1|96.7524|1311|1548|
|GCF_014873495.1|GCF_000383595.1|96.7417|2547|3498|
|GCF_900102635.1|GCF_900106015.1|96.7402|1519|1782|
|GCF_021432085.1|GCF_002929225.1|96.7381|1387|1591|
|GCF_000308575.1|GCF_001612985.1|96.7192|2099|2388|
|GCA_923172585.1|GCF_900475885.1|96.7175|545|661|
|GCA_021801615.1|GCA_021787255.1|96.7164|284|656|
|GCF_002929225.1|GCF_014764705.1|96.7124|1323|1622|
|GCF_000311745.1|GCF_000163295.1|96.6984|554|675|
|GCF_000495915.1|GCF_014764705.1|96.6944|1280|1645|
|GCF_900102635.1|GCF_015712065.1|96.6834|1531|1782|
|GCF_010998615.1|GCF_900106535.1|96.6758|2715|2932|
|GCF_019703545.1|GCF_900475885.1|96.6738|559|634|
|GCF_000719115.1|GCF_900100275.1|96.6656|2384|2843|
|GCF_002119445.1|GCF_000007825.1|96.6612|1626|2101|
|GCA_021778845.1|GCA_021780415.1|96.6568|326|1216|
|GCF_001507325.1|GCF_001420285.1|96.6446|1101|1335|
|GCF_900475885.1|GCA_923172585.1|96.644|552|664|
|GCF_002809955.1|GCA_900450595.1|96.6404|884|996|
|GCF_002208095.1|GCF_007035645.1|96.6396|1441|1795|
|GCF_014764705.1|GCF_002929225.1|96.6372|1338|1569|
|GCA_000496135.1|GCA_021845325.1|96.632|176|549|
|GCF_001420285.1|GCF_001507325.1|96.6305|1105|1263|
|GCF_001517665.2|GCF_000312025.1|96.6248|495|576|
|GCF_008364625.1|GCF_007035645.1|96.5795|1376|1490|
|GCF_000166295.1|GCF_002933295.1|96.579|1198|1549|
|GCF_000931445.1|GCF_001013905.1|96.5765|2233|2587|
|GCA_021802085.1|GCA_021810735.1|96.5682|389|980|
|GCF_900101905.1|GCF_013116825.1|96.5617|2583|3049|
|GCF_000260655.1|GCF_014054945.1|96.5569|738|772|
|GCF_005222125.1|GCF_001886855.1|96.5541|343|470|
|GCA_003488145.1|GCF_002929225.1|96.5378|859|965|
|GCF_007035645.1|GCF_008364625.1|96.5327|1380|1589|
|GCF_000235625.1|GCF_002933295.1|96.5121|1216|1471|
|GCF_001595725.1|GCF_000007825.1|96.5106|1626|2125|
|GCF_001886855.1|GCF_005222125.1|96.5103|353|545|
|GCF_022749495.1|GCF_014054945.1|96.5076|841|933|
|GCA_001704275.1|GCF_016917755.1|96.5028|2659|3271|
|GCA_003488145.1|GCF_014764705.1|96.5014|836|965|
|GCA_021809755.1|GCA_021790415.1|96.4926|228|606|

Hey there, @Huiguang_Yi,

The gtdb faq here notes that a species cluster radius of 95% is targeted, but it can be as high as 97% to retain existing species names, and those >97% are synonyms within GTDB. So if the species clusters can go higher, I imagine that shows up for their respective representative genomes too

Hi @Huiguang_Yi,

Thank you for your message. As AstrobioMike indicated, the ANI can be as high as 97%. We have also observed that different versions of FastANI can give slightly different results. As such, we have put a 0.1 “fudge factor” in the code that effectively allows the ANI to be as high as 97.1% if two genomes were previously GTDB representatives, but now have an ANI slightly >97%. Not ideal, but sometimes we need to update FastANI to take advantage of new features and don’t want to change representatives due to small artifacts in the ANI and AF calculations.

The more major criteria to consider here is that GTDB species clusters also use a 50% alignment fraction (AF) cutoff. All the cases you indicated where the ANI is extremely high have low AF. It is entirely possible (even probably) that these genomes are highly similar and only have a low AF because they are incomplete MAGs. This is the reality of working with genomes of varying quality. Over time, we expect the representative genomes of species clusters to improve which will mitigate this issue. As your analysis shows, this is an extreme edge case only impacting a handful of species.

Cheers,
Donovan

If these genomes are highly similar and only have a low AF due to they are MAGs, I will suspect that they are misassembled rather than incomplete. They are the mosaic genome from > 2 species, which is not real, then these representative genomes better be removed from the database.

Hi,

MAGs certainly can be misassembled and this may account for some of the low AF cases. However, GTDB does accept MAGs that are only 50% complete so another situation is having two partial MAGs that happen to have little overlap. Both these MAGs could be perfectly fine other than being incomplete.

Cheers,
Donovan