Scoring representative genomes

Hi,

Thank you for creating and maintaining such a great resource.

I am interested in dereplicating the full GTDB at different ANI levels. I was hoping to select representatives using the same scoring scheme as described in the documentation here.

Is there a straightforward way to map the column names from the metadata file (*metadata_r214.tsv.gz) to this scoring scheme, or are the scores for each genome in the database available somewhere?

Hi,

The scores are not saved anywhere so you would need to recalculate them. You can find a description of each field in this file: https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/metadata_field_desc.tsv.

I think the mapping should be clear except for maybe determining if a genome is an NCBI representative/reference genome. This is under the ncbi_refseq_category field.

Cheers,
Donovan

Thanks very much! I had missed that file.