We need to build a tool that creates Specify-ready taxonomy outputs from the GBIF backbone taxonomy dataset.
Workflow
The tool should support a simple end-to-end workflow:
- Get the current GBIF backbone dataset.
- Prepare it for local processing.
- Build a local taxonomy data store.
- Generate discipline-specific taxonomy CSV files.
- Generate companion mapping files and a manifest file.
Scope
The tool should export a fixed set of discipline datasets (that we determine), including zoology and botany-focused outputs (for example: aves, ichthyology, herpetology, botany, fungi, mammalogy, and related groups).
Each discipline output should include only taxa that belong to that discipline's intended scope.
Contents
The exported taxonomy should represent currently accepted, non-extinct taxa and preserve lineage context needed for tree-based import.
Output
Each discipline should produce one CSV file that:
- Contains hierarchical taxonomy columns appropriate to the data.
- Includes key metadata fields needed by Specify.
- Avoids duplicate rows.
For each CSV, the tool should create a mapping JSON file describing how columns map to taxonomy ranks and metadata fields.
The tool should also create a taxonfiles.json manifest that lists all generated datasets and their metadata so they can be published and consumed by downstream systems. This should match the format of the ones we've created by hand and are currently using.
Artifacts
After a full run, the project should contain:
- Downloaded source data.
- Prepared local working data.
- A local processed taxonomy SQL database.
- Discipline CSV outputs.
- Per-discipline mapping JSON files.
- A top-level manifest file for all generated taxon files.
These should then be added to S3 and available via https://files.specifysoftware.org/taxonfiles/beta/*
We need to build a tool that creates Specify-ready taxonomy outputs from the GBIF backbone taxonomy dataset.
Workflow
The tool should support a simple end-to-end workflow:
Scope
The tool should export a fixed set of discipline datasets (that we determine), including zoology and botany-focused outputs (for example: aves, ichthyology, herpetology, botany, fungi, mammalogy, and related groups).
Each discipline output should include only taxa that belong to that discipline's intended scope.
Contents
The exported taxonomy should represent currently accepted, non-extinct taxa and preserve lineage context needed for tree-based import.
Output
Each discipline should produce one CSV file that:
For each CSV, the tool should create a mapping JSON file describing how columns map to taxonomy ranks and metadata fields.
The tool should also create a
taxonfiles.jsonmanifest that lists all generated datasets and their metadata so they can be published and consumed by downstream systems. This should match the format of the ones we've created by hand and are currently using.Artifacts
After a full run, the project should contain:
These should then be added to S3 and available via
https://files.specifysoftware.org/taxonfiles/beta/*