I have noticed that some genomes excluded from RefSeq because they have “many frameshifted proteins” are sometimes selected as GTDB representatives.
Yet, genomes of higher quality are sometimes available.
Below are some examples:
GCA_000614735.1 (excluded from RefSeq) -> GCF_001434975.1
GCA_001311765.1 (excluded from RefSeq) -> GCF_001434815.1
We have observed that NCBI indicates “many frameshifted proteins” even when this is not the case. For example, GCF_000006925.2 was flagged as such and this label subsequently removed after we contact NCBI with evidence that the gene calling appears correct. As such, we are currently not using this annotation to help determine the best GTDB representative. Internally, we call genes using Prodigal so are not dependent on the quality of the gene calling provided with genome assemblies as NCBI.
However, we appreciate this might not be the most convenient for all GTDB users. Have you investigated the two genomes you flagged and found the gene calling to be problematic? We’d like to have some confirmed cases where genomes flagged as having frameshift errors has been verified.
GCA_000614735.1 and GCA_001311765.1 were generated with Ion PGM sequencers.
This technology is known to produce short indels at homopolymers.
I aligned genes of GCF_001434975.1 and GCF_001434815.1 against the contigs of GCA_000614735.1 and GCA_001311765.1 respectively.
I found many short gaps at homopolymers that will cause frameshifts.
GCA_001311765.1 : 1092 genes /1843 (59%) with gaps.
GCA_000614735.1 : 215 genes /1904 (11%) with gaps.
Just following up on this suggestion. We will start considering NCBI assembly quality metadata when determining the most appropriate genome to use as the GTDB species representative. Unfortunately, we are too far along in the upcoming release (R06-RS202) to incorporate these changes. They will appear in the subsequent release (R07-RS205).