Skip to content

Create new default trees #7948

@grantfitzsimmons

Description

@grantfitzsimmons

We need to build a tool that creates Specify-ready taxonomy outputs from the GBIF backbone taxonomy dataset.

Workflow

The tool should support a simple end-to-end workflow:

  1. Get the current GBIF backbone dataset.
  2. Prepare it for local processing.
  3. Build a local taxonomy data store.
  4. Generate discipline-specific taxonomy CSV files.
  5. Generate companion mapping files and a manifest file.

Scope

The tool should export a fixed set of discipline datasets (that we determine), including zoology and botany-focused outputs (for example: aves, ichthyology, herpetology, botany, fungi, mammalogy, and related groups).

Each discipline output should include only taxa that belong to that discipline's intended scope.

Contents

The exported taxonomy should represent currently accepted, non-extinct taxa and preserve lineage context needed for tree-based import.

Output

Each discipline should produce one CSV file that:

  1. Contains hierarchical taxonomy columns appropriate to the data.
  2. Includes key metadata fields needed by Specify.
  3. Avoids duplicate rows.

For each CSV, the tool should create a mapping JSON file describing how columns map to taxonomy ranks and metadata fields.

The tool should also create a taxonfiles.json manifest that lists all generated datasets and their metadata so they can be published and consumed by downstream systems. This should match the format of the ones we've created by hand and are currently using.

Artifacts

After a full run, the project should contain:

  1. Downloaded source data.
  2. Prepared local working data.
  3. A local processed taxonomy SQL database.
  4. Discipline CSV outputs.
  5. Per-discipline mapping JSON files.
  6. A top-level manifest file for all generated taxon files.

These should then be added to S3 and available via https://files.specifysoftware.org/taxonfiles/beta/*

Metadata

Metadata

Assignees

No one assigned

    Labels

    2 - TreesIssues that are related to the tree system and related functionalities.

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions