Dear gtdb team,
I found that with new r207, marker for archaea is about 50 only and a majority of the 6000 archaea genomes contains only 1/3 of the 50 marker. With old version R202, 120+ archaea marker, majority of genomes contains 100+. how can you make sure that the overlap of those archaea marker in v207 is sufficient enough to have a significant concatenated marker set, worst case, nothing share between 2 archaea genome, one contains 1/3 and another contains a different 1/3.
The change in the number of markers (from 122 to 53) is explained in the release notes.
Hi,
You can find statistics regarding how often each of the 53 archaeal marker genes is identified across the archaeal GTDB representatives in ar53_msa_marker_info_r207.tsv
:
https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/ar53_msa_marker_info_r207.tsv
Only 2 markers are identified in <50% of the archaeal GTDB representative genomes and the average is 82%.
Note that there are 3,412 archaeal representatives in GTDB R207. Many non-representatives genomes are partially complete MAGs which may account for the low ubiquity you are seeing.
Cheers,
Donovan