Skip to content

feat: dataset import framework #1254

@mihow

Description

@mihow

Motivation

Antenna has a clean, extensible export framework (ami/exports/) with a registry, a BaseExporter abstraction, a DataExport model, a DataExportJob runner, an API endpoint, and a UI wizard. Imports are fragmented: four management commands in different apps, one direct-upload API endpoint, one S3-sync path, and at least three unmerged branches that each re-invent parts of the problem. There is no shared DataImport model, no registry, no API endpoint, no UI, and no agreed-upon file format.

The export framework is the obvious template. This ticket proposes mirroring it for imports, and asks for explicit decisions on the open design questions before anyone writes production code. It consolidates threads from #1187, #1208, #933, #746, #1015, #871 and the feat/adc-importer, feat/dwca-export, feat/inventory-import, feat/import-cover-images, feat/taxa-import-updates branches.

Current state (what exists today)

Export framework (the model to mirror)

Layer File Notes
Registry ami/exports/registry.py:4-30 ExportRegistry.register("format_key")(ExporterClass)
Base class ami/exports/base.py:10-74 get_queryset(), export(), progress hook
Model ami/exports/models.py:23-144 DataExport — format, filters, file_url, record_count
Job runner ami/jobs/models.py:710-744 DataExportJob.run()data_export.run_export()
API ami/exports/views.py:13-87 POST /api/v2/exports/ creates DataExport + Job, enqueues
UI ui/src/pages/project/exports/exports.tsx Polls Job progress; shows download link
Formats ami/exports/format_types.py JSONExporter, CSVExporter; DwC-A in-flight on feat/dwca-export

Existing import code (fragmented)

Path What it imports Trigger Status
ami/main/management/commands/import_taxa.py Taxonomy CSV/JSON/Google Sheets CLI only merged, active
ami/main/management/commands/import_source_images.pyDeployment.sync_captures() S3 bucket image inventory CLI + API + Celery (sync_source_images) merged, active
ami/main/management/commands/import_trapdata_project.py Legacy AMI Data Manager JSON CLI only merged, legacy
ami/main/management/commands/create_demo_project.py Synthetic fixtures CLI only merged
SourceImageUploadViewSet (ami/main/api/views.py:848-901) + create_source_image_from_upload() (models.py:1683-1741) Single-image direct upload API + UI merged, active
ami/exports/management/commands/import_pipeline_results.py (286 lines) PipelineResultsResponse-shaped JSON CLI on feat/adc-importer, unmerged
ami-data-companion PR #82 ADC → PipelineResultsResponse export CLI (ADC side) unmerged, unreviewed

Gaps

  1. No DataImport model, no ImportRegistry, no BaseImporter, no DataImportJob.
  2. No REST API endpoint for uploading a dataset file and asking Antenna to ingest it.
  3. No UI entry point for "import a CSV/JSON/DwC-A I have on my laptop."
  4. Taxonomy import lives in ami.main; occurrence/detection import lives on feat/adc-importer under ami.exports. No canonical home.
  5. No shared answers for dry-run, conflict resolution, category-map handling, validation reporting, rollback.

Proposal

Mirror the export framework, one component at a time. The following is the minimum-viable shape; format-specific work is follow-on tickets.

Core framework (one PR)

  • ami/imports/ app (parallel to ami/exports/).
  • ImportRegistry — same decorator pattern.
  • BaseImporter — abstract validate(), import_records(), plus a progress hook matching BaseExporter.update_job_progress().
  • DataImport model — fields: user, project, format, source (file upload or URL or S3 pointer), options (JSON — dry_run, conflict_strategy, etc.), record_counts (created/updated/skipped/errored), error_report_url, file (FileField for uploaded artifacts, stored in imports/ on default_storage).
  • DataImportJob subclass of Job, matching DataExportJob.
  • POST /api/v2/imports/ — accepts multipart file upload or a URL/S3 reference; creates DataImport + Job; enqueues; returns job ID.
  • ImportViewSet with retrieve, list, and a validate action for dry-run.

UI (follow-on PR)

  • /projects/:id/imports page modeled on the exports page.
  • Upload wizard: pick format → upload file or paste URL → set options (dry-run, conflict strategy) → submit → poll job progress → see record-count summary and error report link.

Format modules (one per PR, in priority order)

  1. taxa_csv — absorb the logic from import_taxa.py into BaseImporter. Keeps the management command as a thin wrapper that calls the same code path. Addresses Management command for taxonomy export/import (cross-environment sync) #1187, Document how to import a species list #933, Method for importing species of interest lists #746, [Issue draft] Make it possible for users to manage taxa and taxa lists as scale #1015.
  2. source_images_s3_inventory — formalize what Deployment.sync_captures() already does, so it shows up in the same UI and the same Job history as other imports. Addresses threads in feat: Local filesystem storage backend for zero-copy image ingestion #1160, Support for SFTP data sources #467.
  3. occurrences_pipeline_results_json — the feat/adc-importer format (PipelineResultsResponse). Reuses pipeline.save_results() for the actual record creation.
  4. occurrences_simple_csv — roundtrip with the existing CSV exporter. Lowest user-onboarding friction.
  5. taxa_reference_images_zip — cover images for Import reference images for taxa #871.
  6. occurrences_dwca — GBIF DwC-A inbound. Requires the DwC-A export (feat/dwca-export) mapping spec to land first so the field contract is symmetric.
  7. project_bundle — for feat: Project portability — UUID fields, template cloning, and cross-environment export/import #1208 (cross-environment project portability). Needs UUID work to precede it.

Open design questions (decide before writing code)

These are the implicit decisions the unmerged branches each made differently. None has been discussed as a team. Flag your position on each.

  1. Is an import primarily round-trip with export, or onboarding of third-party data? Different answers lead to different canonical formats. Roundtrip favours JSON mirroring PipelineResultsResponse; onboarding favours DwC-A + flat CSV. These are not mutually exclusive, but the priority ordering in the format list above assumes onboarding first. Challenge this if you disagree.
  2. Upload mechanism: multipart body, signed-URL handoff, or S3 pointer? Large DwC-A archives and image ZIPs can be >1 GB. Multipart through the Django app is the wrong answer above some threshold. Proposal: multipart up to ~100 MB; above that, DataImport.source is an S3 URI the user has uploaded to directly (same pattern as storage sources already use).
  3. Conflict resolution strategy per-format or per-import? Proposal: per-format default (e.g. taxa upsert by name; occurrences create-only unless an external stable ID is provided), overridable per-import via options.
  4. Category maps for detection/classification imports. PR Bump fastapi from 0.89.1 to 0.95.0 in /backend #82 punted on this. Proposal: require the import to either (a) reference an existing Algorithm + CategoryMap by name/version, or (b) include inline category map definitions in the payload. Reject the import if neither is present.
  5. Dry-run / validate-before-commit. Proposal: validate action on the viewset runs BaseImporter.validate() against the file and returns a structured error report without mutating state. UI surfaces errors before the user confirms.
  6. Error reporting artifact. When an import partially fails, we need a downloadable per-row error report (CSV of row_number + error). Store as a second FileField on DataImport.
  7. Rollback. Proposal: atomic transaction per import is tempting but conflicts with Celery + chunked progress updates. Instead: design formats to be idempotent (re-running the same import produces the same state) and document that partial imports are recoverable by fixing the source and re-submitting.
  8. Permission model. Imports mutate project data; probably gate by change_project (or a new import_data) guardian permission.
  9. Canonical home for the app. The feat/adc-importer branch put import_pipeline_results.py under ami/exports/ — reasonable if you see export/import as one app, questionable if you don't. Proposal: new ami/imports/ app, with ami/exports/ untouched. Happy to hear the case for merging.

What this ticket is NOT

Suggested next step

A short design review (async doc or 30-min call) to lock in answers to the nine open questions above, then a first PR that lands the core framework (no format modules yet) so subsequent format PRs are mechanical.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions