feat: dataset import framework

## Motivation

Antenna has a clean, extensible **export** framework (`ami/exports/`) with a registry, a `BaseExporter` abstraction, a `DataExport` model, a `DataExportJob` runner, an API endpoint, and a UI wizard. Imports are fragmented: four management commands in different apps, one direct-upload API endpoint, one S3-sync path, and at least three unmerged branches that each re-invent parts of the problem. There is no shared `DataImport` model, no registry, no API endpoint, no UI, and no agreed-upon file format.

The export framework is the obvious template. This ticket proposes mirroring it for imports, and asks for explicit decisions on the open design questions before anyone writes production code. It consolidates threads from #1187, #1208, #933, #746, #1015, #871 and the `feat/adc-importer`, `feat/dwca-export`, `feat/inventory-import`, `feat/import-cover-images`, `feat/taxa-import-updates` branches.

## Current state (what exists today)

### Export framework (the model to mirror)

| Layer | File | Notes |
|---|---|---|
| Registry | `ami/exports/registry.py:4-30` | `ExportRegistry.register("format_key")(ExporterClass)` |
| Base class | `ami/exports/base.py:10-74` | `get_queryset()`, `export()`, progress hook |
| Model | `ami/exports/models.py:23-144` | `DataExport` — format, filters, file_url, record_count |
| Job runner | `ami/jobs/models.py:710-744` | `DataExportJob.run()` → `data_export.run_export()` |
| API | `ami/exports/views.py:13-87` | `POST /api/v2/exports/` creates DataExport + Job, enqueues |
| UI | `ui/src/pages/project/exports/exports.tsx` | Polls Job progress; shows download link |
| Formats | `ami/exports/format_types.py` | `JSONExporter`, `CSVExporter`; DwC-A in-flight on `feat/dwca-export` |

### Existing import code (fragmented)

| Path | What it imports | Trigger | Status |
|---|---|---|---|
| `ami/main/management/commands/import_taxa.py` | Taxonomy CSV/JSON/Google Sheets | CLI only | merged, active |
| `ami/main/management/commands/import_source_images.py` → `Deployment.sync_captures()` | S3 bucket image inventory | CLI + API + Celery (`sync_source_images`) | merged, active |
| `ami/main/management/commands/import_trapdata_project.py` | Legacy AMI Data Manager JSON | CLI only | merged, legacy |
| `ami/main/management/commands/create_demo_project.py` | Synthetic fixtures | CLI only | merged |
| `SourceImageUploadViewSet` (`ami/main/api/views.py:848-901`) + `create_source_image_from_upload()` (`models.py:1683-1741`) | Single-image direct upload | API + UI | merged, active |
| `ami/exports/management/commands/import_pipeline_results.py` (286 lines) | `PipelineResultsResponse`-shaped JSON | CLI | on `feat/adc-importer`, unmerged |
| [ami-data-companion PR #82](https://github.com/RolnickLab/ami-data-companion/pull/82) | ADC → `PipelineResultsResponse` export | CLI (ADC side) | unmerged, unreviewed |

### Gaps

1. No `DataImport` model, no `ImportRegistry`, no `BaseImporter`, no `DataImportJob`.
2. No REST API endpoint for uploading a dataset file and asking Antenna to ingest it.
3. No UI entry point for "import a CSV/JSON/DwC-A I have on my laptop."
4. Taxonomy import lives in `ami.main`; occurrence/detection import lives on `feat/adc-importer` under `ami.exports`. No canonical home.
5. No shared answers for dry-run, conflict resolution, category-map handling, validation reporting, rollback.

## Proposal

Mirror the export framework, one component at a time. The following is the minimum-viable shape; format-specific work is follow-on tickets.

### Core framework (one PR)

- `ami/imports/` app (parallel to `ami/exports/`).
- `ImportRegistry` — same decorator pattern.
- `BaseImporter` — abstract `validate()`, `import_records()`, plus a progress hook matching `BaseExporter.update_job_progress()`.
- `DataImport` model — fields: `user`, `project`, `format`, `source` (file upload **or** URL **or** S3 pointer), `options` (JSON — dry_run, conflict_strategy, etc.), `record_counts` (created/updated/skipped/errored), `error_report_url`, `file` (FileField for uploaded artifacts, stored in `imports/` on `default_storage`).
- `DataImportJob` subclass of `Job`, matching `DataExportJob`.
- `POST /api/v2/imports/` — accepts multipart file upload **or** a URL/S3 reference; creates `DataImport` + `Job`; enqueues; returns job ID.
- `ImportViewSet` with `retrieve`, `list`, and a `validate` action for dry-run.

### UI (follow-on PR)

- `/projects/:id/imports` page modeled on the exports page.
- Upload wizard: pick format → upload file or paste URL → set options (dry-run, conflict strategy) → submit → poll job progress → see record-count summary and error report link.

### Format modules (one per PR, in priority order)

1. **`taxa_csv`** — absorb the logic from `import_taxa.py` into `BaseImporter`. Keeps the management command as a thin wrapper that calls the same code path. Addresses #1187, #933, #746, #1015.
2. **`source_images_s3_inventory`** — formalize what `Deployment.sync_captures()` already does, so it shows up in the same UI and the same `Job` history as other imports. Addresses threads in #1160, #467.
3. **`occurrences_pipeline_results_json`** — the `feat/adc-importer` format (`PipelineResultsResponse`). Reuses `pipeline.save_results()` for the actual record creation.
4. **`occurrences_simple_csv`** — roundtrip with the existing CSV exporter. Lowest user-onboarding friction.
5. **`taxa_reference_images_zip`** — cover images for #871.
6. **`occurrences_dwca`** — GBIF DwC-A inbound. Requires the DwC-A export (`feat/dwca-export`) mapping spec to land first so the field contract is symmetric.
7. **`project_bundle`** — for #1208 (cross-environment project portability). Needs UUID work to precede it.

## Open design questions (decide before writing code)

These are the implicit decisions the unmerged branches each made differently. None has been discussed as a team. Flag your position on each.

1. **Is an import primarily round-trip with export, or onboarding of third-party data?** Different answers lead to different canonical formats. Roundtrip favours JSON mirroring `PipelineResultsResponse`; onboarding favours DwC-A + flat CSV. These are not mutually exclusive, but the priority ordering in the format list above assumes **onboarding first**. Challenge this if you disagree.
2. **Upload mechanism: multipart body, signed-URL handoff, or S3 pointer?** Large DwC-A archives and image ZIPs can be >1 GB. Multipart through the Django app is the wrong answer above some threshold. Proposal: multipart up to ~100 MB; above that, `DataImport.source` is an S3 URI the user has uploaded to directly (same pattern as storage sources already use).
3. **Conflict resolution strategy per-format or per-import?** Proposal: per-format default (e.g. taxa upsert by name; occurrences create-only unless an external stable ID is provided), overridable per-import via options.
4. **Category maps for detection/classification imports.** PR #82 punted on this. Proposal: require the import to either (a) reference an existing `Algorithm` + `CategoryMap` by name/version, or (b) include inline category map definitions in the payload. Reject the import if neither is present.
5. **Dry-run / validate-before-commit.** Proposal: `validate` action on the viewset runs `BaseImporter.validate()` against the file and returns a structured error report without mutating state. UI surfaces errors before the user confirms.
6. **Error reporting artifact.** When an import partially fails, we need a downloadable per-row error report (CSV of row_number + error). Store as a second FileField on `DataImport`.
7. **Rollback.** Proposal: atomic transaction per import is tempting but conflicts with Celery + chunked progress updates. Instead: design formats to be idempotent (re-running the same import produces the same state) and document that partial imports are recoverable by fixing the source and re-submitting.
8. **Permission model.** Imports mutate project data; probably gate by `change_project` (or a new `import_data`) guardian permission.
9. **Canonical home for the app.** The `feat/adc-importer` branch put `import_pipeline_results.py` under `ami/exports/` — reasonable if you see export/import as one app, questionable if you don't. Proposal: new `ami/imports/` app, with `ami/exports/` untouched. Happy to hear the case for merging.

## What this ticket is NOT

- Not a design for cross-environment project cloning — that's #1208, and it depends on UUID work this framework doesn't need.
- Not replacing `Deployment.sync_captures()` / `SourceImageUploadViewSet` — those stay; format #2 above just gives them a shared surface.
- Not a database dump/restore — intentionally application-level.

## Suggested next step

A short design review (async doc or 30-min call) to lock in answers to the nine open questions above, then a first PR that lands the core framework (no format modules yet) so subsequent format PRs are mechanical.

## Related

- #1187, #1208, #933, #746, #1015, #1160, #467, #871
- Branches: `feat/adc-importer`, `feat/dwca-export`, `feat/inventory-import`, `feat/import-cover-images`, `feat/taxa-import-updates`, `fix/species-import`
- External: [RolnickLab/ami-data-companion#82](https://github.com/RolnickLab/ami-data-companion/pull/82)


Layer	File	Notes
Registry	`ami/exports/registry.py:4-30`	`ExportRegistry.register("format_key")(ExporterClass)`
Base class	`ami/exports/base.py:10-74`	`get_queryset()`, `export()`, progress hook
Model	`ami/exports/models.py:23-144`	`DataExport` — format, filters, file_url, record_count
Job runner	`ami/jobs/models.py:710-744`	`DataExportJob.run()` → `data_export.run_export()`
API	`ami/exports/views.py:13-87`	`POST /api/v2/exports/` creates DataExport + Job, enqueues
UI	`ui/src/pages/project/exports/exports.tsx`	Polls Job progress; shows download link
Formats	`ami/exports/format_types.py`	`JSONExporter`, `CSVExporter`; DwC-A in-flight on `feat/dwca-export`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dataset import framework #1254

Motivation

Current state (what exists today)

Export framework (the model to mirror)

Existing import code (fragmented)

Gaps

Proposal

Core framework (one PR)

UI (follow-on PR)

Format modules (one per PR, in priority order)

Open design questions (decide before writing code)

What this ticket is NOT

Suggested next step

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Path	What it imports	Trigger	Status
`ami/main/management/commands/import_taxa.py`	Taxonomy CSV/JSON/Google Sheets	CLI only	merged, active
`ami/main/management/commands/import_source_images.py` → `Deployment.sync_captures()`	S3 bucket image inventory	CLI + API + Celery (`sync_source_images`)	merged, active
`ami/main/management/commands/import_trapdata_project.py`	Legacy AMI Data Manager JSON	CLI only	merged, legacy
`ami/main/management/commands/create_demo_project.py`	Synthetic fixtures	CLI only	merged
`SourceImageUploadViewSet` (`ami/main/api/views.py:848-901`) + `create_source_image_from_upload()` (`models.py:1683-1741`)	Single-image direct upload	API + UI	merged, active
`ami/exports/management/commands/import_pipeline_results.py` (286 lines)	`PipelineResultsResponse`-shaped JSON	CLI	on `feat/adc-importer`, unmerged
ami-data-companion PR #82	ADC → `PipelineResultsResponse` export	CLI (ADC side)	unmerged, unreviewed

feat: dataset import framework #1254

Description

Motivation

Current state (what exists today)

Export framework (the model to mirror)

Existing import code (fragmented)

Gaps

Proposal

Core framework (one PR)

UI (follow-on PR)

Format modules (one per PR, in priority order)

Open design questions (decide before writing code)

What this ticket is NOT

Suggested next step

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions