Incorrect annotation in the representative genomes

I found some genomes in which the plasmids are bigger than the chromosomes:

# how to create name.map and gtdb.files.txt
#     https://github.com/shenwei356/kmcp/blob/main/docs/database.md#gtdb

$ grep plasmid name.map
GCF_000163055.2 NC_022111.1 Prevotella sp. oral taxon 299 str. F0039 plasmid unnamed, complete sequence
GCF_000292525.1 NZ_AEYF01000045.1 Rhizobium sp. CCGE 510 plasmid pRspCCGE510d Contig45, whole genome shotgun sequence
GCF_000298315.2 NZ_AEYE02000035.1 Rhizobium grahamii CCGE 502 map unlocalized plasmid pRg502b contig0035, whole genome shotgun sequence
GCF_008274585.1 NZ_VTRT01000001.1 Pedobacter sp. BS3 map unlocalized plasmid unnamed1 BS3-1_scaffold1, whole genome shotgun sequence
GCF_008274625.1 NZ_VTRU01000001.1 Chryseobacterium sp. Gsoil 183 map unlocalized plasmid unnamed1 Gsoil183-2_scaffold1, whole genome shotgun sequence
GCF_016839145.1 NZ_CP069303.1 Shinella sp. PSBB067 plasmid unnamed1, complete sequence
GCF_900637865.1 NZ_LR134418.1 Legionella adelaidensis strain NCTC12735 plasmid 9, complete sequence
GCF_900660545.1 NZ_LR214986.1 Mycoplasma cynos strain NCTC10142 plasmid 13
GCA_001190015.1 LGTG01000643.1 Candidatus Burkholderia crenata strain UZHbot9 plasmid pBCRE02, whole genome shotgun sequence

# r207
$ seqkit stats --infile-list <(grep -f <(cat name.map | grep plasmid | cut -f 1) gtdb.files.txt)
file                                                                format  type  num_seqs    sum_len  min_len      avg_len    max_len
gtdb/gtdb_genomes_reps_r207/GCF/000/163/055/GCF_000163055.2.fna.gz  FASTA   DNA          2  2,480,269  709,850  1,240,134.5  1,770,419
gtdb/gtdb_genomes_reps_r207/GCF/000/292/525/GCF_000292525.1.fna.gz  FASTA   DNA        142  6,916,614      507     48,708.5    923,843
gtdb/gtdb_genomes_reps_r207/GCF/000/298/315/GCF_000298315.2.fna.gz  FASTA   DNA         80  7,146,037      257     89,325.5    607,513
gtdb/gtdb_genomes_reps_r207/GCF/008/274/585/GCF_008274585.1.fna.gz  FASTA   DNA         27  4,811,172      527    178,191.6  1,310,315
gtdb/gtdb_genomes_reps_r207/GCF/008/274/625/GCF_008274625.1.fna.gz  FASTA   DNA         17  4,983,760      676    293,162.4  2,346,872
gtdb/gtdb_genomes_reps_r207/GCF/016/839/145/GCF_016839145.1.fna.gz  FASTA   DNA          4  5,774,137  108,567  1,443,534.3  4,605,385
gtdb/gtdb_genomes_reps_r207/GCF/900/637/865/GCF_900637865.1.fna.gz  FASTA   DNA         29  2,094,483    1,153     72,223.6    451,287
gtdb/gtdb_genomes_reps_r207/GCF/900/660/545/GCF_900660545.1.fna.gz  FASTA   DNA         18  1,093,147    4,380     60,730.4    986,659
gtdb/gtdb_genomes_reps_r207/GCA/001/190/015/GCA_001190015.1.fna.gz  FASTA   DNA        643  2,843,741      500      4,422.6    103,016

# r202
$  seqkit stats --infile-list <(grep -f <(cat name.map | grep plasmid | cut -f 1) gtdb.files.txt)
file                                         format  type  num_seqs    sum_len  min_len      avg_len    max_len
gtdb/GCF/000/163/055/GCF_000163055.2.fna.gz  FASTA   DNA          2  2,480,269  709,850  1,240,134.5  1,770,419
gtdb/GCF/000/292/525/GCF_000292525.1.fna.gz  FASTA   DNA        142  6,916,614      507     48,708.5    923,843
gtdb/GCF/000/298/315/GCF_000298315.2.fna.gz  FASTA   DNA         80  7,146,037      257     89,325.5    607,513
gtdb/GCF/001/266/905/GCF_001266905.1.fna.gz  FASTA   DNA        101  3,174,715    1,109     31,432.8    170,794
gtdb/GCF/008/274/585/GCF_008274585.1.fna.gz  FASTA   DNA         27  4,811,172      527    178,191.6  1,310,315
gtdb/GCF/008/274/625/GCF_008274625.1.fna.gz  FASTA   DNA         17  4,983,760      676    293,162.4  2,346,872
gtdb/GCF/010/731/815/GCF_010731815.1.fna.gz  FASTA   DNA          2  6,451,210  434,050    3,225,605  6,017,160
gtdb/GCF/900/637/865/GCF_900637865.1.fna.gz  FASTA   DNA         29  2,094,483    1,153     72,223.6    451,287
gtdb/GCF/900/660/545/GCF_900660545.1.fna.gz  FASTA   DNA         18  1,093,147    4,380     60,730.4    986,659
gtdb/GCA/001/190/015/GCA_001190015.1.fna.gz  FASTA   DNA        643  2,843,741      500      4,422.6    103,016

Examples with only one chromosome:

$ seqkit fx2tab -l -n   gtdb/gtdb_genomes_reps_r207/GCF/000/163/055/GCF_000163055.2.fna.gz 
NC_022111.1 Prevotella sp. oral taxon 299 str. F0039 plasmid unnamed, complete sequence 1770419
NC_022124.1 Prevotella sp. oral taxon 299 str. F0039, complete sequence 70985

$ seqkit fx2tab -l -n   gtdb/gtdb_genomes_reps_r207/GCF/016/839/145/GCF_016839145.1.fna.gz
NZ_CP069302.1 Shinella sp. PSBB067 chromosome, complete genome  375573
NZ_CP069303.1 Shinella sp. PSBB067 plasmid unnamed1, complete sequence  4605385
NZ_CP069304.1 Shinella sp. PSBB067 plasmid unnamed2, complete sequence  684612
NZ_CP069305.1 Shinella sp. PSBB067 plasmid unnamed3, complete sequence  108567

I guess it’s due to the incorrect annotation by the submission author.
It would be better to choose another well-annotated representative genome.

For GCF_010731815, the new annotation version fixed this problem.

# r202
$ seqkit fx2tab -l -n gtdb/GCF/010/731/815/GCF_010731815.1.fna.gz
NZ_AP022592.1 Mycolicibacterium arabiense strain JCM 18538      434050
NZ_AP022593.1 Mycolicibacterium arabiense strain JCM 18538 plasmid pJCM18538, complete sequence 6017160

# r207
$ seqkit fx2tab -l -n gtdb/gtdb_genomes_reps_r207/GCF/010/731/815/GCF_010731815.2.fna.gz
NZ_AP022593.1 Mycolicibacterium arabiense strain JCM 18538 chromosome, complete genome  6017160
NZ_AP022592.1 Mycolicibacterium arabiense strain JCM 18538 plasmid pJCM18538, complete sequence 434050

PS: I filter out plasmids according to the sequence name.

Hi Wei Shen,

This is certainly an interesting issue. Have you brought this to the attention of NCBI? Ideally, this would be resolved at the source of the data. Perhaps these are all annotation errors that NCBI can correct as per GCF_010731815.

Interestingly, for a number of these cases there is only 1 genome in the GTDB species cluster so we can’t replace the representative. This might suggest the genome assembly is problematic, but it is hard to know. Hopefully NCBI could provide more insight and remove these if the assemblies can be determined to be erroneous or resolve any incorrect annotations.

Cheers,
Donovan

Sorry for the late reply, I just write an email to gb-admin@ncbi.nlm.nih.gov, hope the records would be fixed soon.

I have another question, why did not you filter out the plasmids first before any downstream analysis? Cause the plasmids, as mobile elements, would be transferred or conjugated between bacteria.

The “plasmid” record NC_022111.1 was also collected in RefSeq plasmid collection: Index of /refseq/release/plasmid . So a lot of databases are affected.

Hi Wei Shen,

It is true that we don’t explicitly filter out plasmids. GTDB infers reference trees using genes that are predominately single-copy and ubiquitous across either the bacterial or archaeal domain. These are (presumably) all chromosomal genes. Species delineation is done using ANI where retention of plasmids isn’t critical given their relatively small size. I’d suggest that if two genomes have a shared plasmid this should be included in the ANI calculation though there certainly isn’t any consensus on this more nuanced issued. Again, in practice, this isn’t critical for ANI given the small size of plasmids.

Cheers,
Donovan

I got it. Thanks for your reply.