You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Antenna has a clean, extensible export framework (ami/exports/) with a registry, a BaseExporter abstraction, a DataExport model, a DataExportJob runner, an API endpoint, and a UI wizard. Imports are fragmented: four management commands in different apps, one direct-upload API endpoint, one S3-sync path, and at least three unmerged branches that each re-invent parts of the problem. There is no shared DataImport model, no registry, no API endpoint, no UI, and no agreed-upon file format.
The export framework is the obvious template. This ticket proposes mirroring it for imports, and asks for explicit decisions on the open design questions before anyone writes production code. It consolidates threads from #1187, #1208, #933, #746, #1015, #871 and the feat/adc-importer, feat/dwca-export, feat/inventory-import, feat/import-cover-images, feat/taxa-import-updates branches.
No DataImport model, no ImportRegistry, no BaseImporter, no DataImportJob.
No REST API endpoint for uploading a dataset file and asking Antenna to ingest it.
No UI entry point for "import a CSV/JSON/DwC-A I have on my laptop."
Taxonomy import lives in ami.main; occurrence/detection import lives on feat/adc-importer under ami.exports. No canonical home.
No shared answers for dry-run, conflict resolution, category-map handling, validation reporting, rollback.
Proposal
Mirror the export framework, one component at a time. The following is the minimum-viable shape; format-specific work is follow-on tickets.
Core framework (one PR)
ami/imports/ app (parallel to ami/exports/).
ImportRegistry — same decorator pattern.
BaseImporter — abstract validate(), import_records(), plus a progress hook matching BaseExporter.update_job_progress().
DataImport model — fields: user, project, format, source (file upload or URL or S3 pointer), options (JSON — dry_run, conflict_strategy, etc.), record_counts (created/updated/skipped/errored), error_report_url, file (FileField for uploaded artifacts, stored in imports/ on default_storage).
DataImportJob subclass of Job, matching DataExportJob.
POST /api/v2/imports/ — accepts multipart file upload or a URL/S3 reference; creates DataImport + Job; enqueues; returns job ID.
ImportViewSet with retrieve, list, and a validate action for dry-run.
UI (follow-on PR)
/projects/:id/imports page modeled on the exports page.
Upload wizard: pick format → upload file or paste URL → set options (dry-run, conflict strategy) → submit → poll job progress → see record-count summary and error report link.
occurrences_pipeline_results_json — the feat/adc-importer format (PipelineResultsResponse). Reuses pipeline.save_results() for the actual record creation.
occurrences_simple_csv — roundtrip with the existing CSV exporter. Lowest user-onboarding friction.
Open design questions (decide before writing code)
These are the implicit decisions the unmerged branches each made differently. None has been discussed as a team. Flag your position on each.
Is an import primarily round-trip with export, or onboarding of third-party data? Different answers lead to different canonical formats. Roundtrip favours JSON mirroring PipelineResultsResponse; onboarding favours DwC-A + flat CSV. These are not mutually exclusive, but the priority ordering in the format list above assumes onboarding first. Challenge this if you disagree.
Upload mechanism: multipart body, signed-URL handoff, or S3 pointer? Large DwC-A archives and image ZIPs can be >1 GB. Multipart through the Django app is the wrong answer above some threshold. Proposal: multipart up to ~100 MB; above that, DataImport.source is an S3 URI the user has uploaded to directly (same pattern as storage sources already use).
Conflict resolution strategy per-format or per-import? Proposal: per-format default (e.g. taxa upsert by name; occurrences create-only unless an external stable ID is provided), overridable per-import via options.
Category maps for detection/classification imports. PR Bump fastapi from 0.89.1 to 0.95.0 in /backend #82 punted on this. Proposal: require the import to either (a) reference an existing Algorithm + CategoryMap by name/version, or (b) include inline category map definitions in the payload. Reject the import if neither is present.
Dry-run / validate-before-commit. Proposal: validate action on the viewset runs BaseImporter.validate() against the file and returns a structured error report without mutating state. UI surfaces errors before the user confirms.
Error reporting artifact. When an import partially fails, we need a downloadable per-row error report (CSV of row_number + error). Store as a second FileField on DataImport.
Rollback. Proposal: atomic transaction per import is tempting but conflicts with Celery + chunked progress updates. Instead: design formats to be idempotent (re-running the same import produces the same state) and document that partial imports are recoverable by fixing the source and re-submitting.
Permission model. Imports mutate project data; probably gate by change_project (or a new import_data) guardian permission.
Canonical home for the app. The feat/adc-importer branch put import_pipeline_results.py under ami/exports/ — reasonable if you see export/import as one app, questionable if you don't. Proposal: new ami/imports/ app, with ami/exports/ untouched. Happy to hear the case for merging.
Not a database dump/restore — intentionally application-level.
Suggested next step
A short design review (async doc or 30-min call) to lock in answers to the nine open questions above, then a first PR that lands the core framework (no format modules yet) so subsequent format PRs are mechanical.
Motivation
Antenna has a clean, extensible export framework (
ami/exports/) with a registry, aBaseExporterabstraction, aDataExportmodel, aDataExportJobrunner, an API endpoint, and a UI wizard. Imports are fragmented: four management commands in different apps, one direct-upload API endpoint, one S3-sync path, and at least three unmerged branches that each re-invent parts of the problem. There is no sharedDataImportmodel, no registry, no API endpoint, no UI, and no agreed-upon file format.The export framework is the obvious template. This ticket proposes mirroring it for imports, and asks for explicit decisions on the open design questions before anyone writes production code. It consolidates threads from #1187, #1208, #933, #746, #1015, #871 and the
feat/adc-importer,feat/dwca-export,feat/inventory-import,feat/import-cover-images,feat/taxa-import-updatesbranches.Current state (what exists today)
Export framework (the model to mirror)
ami/exports/registry.py:4-30ExportRegistry.register("format_key")(ExporterClass)ami/exports/base.py:10-74get_queryset(),export(), progress hookami/exports/models.py:23-144DataExport— format, filters, file_url, record_countami/jobs/models.py:710-744DataExportJob.run()→data_export.run_export()ami/exports/views.py:13-87POST /api/v2/exports/creates DataExport + Job, enqueuesui/src/pages/project/exports/exports.tsxami/exports/format_types.pyJSONExporter,CSVExporter; DwC-A in-flight onfeat/dwca-exportExisting import code (fragmented)
ami/main/management/commands/import_taxa.pyami/main/management/commands/import_source_images.py→Deployment.sync_captures()sync_source_images)ami/main/management/commands/import_trapdata_project.pyami/main/management/commands/create_demo_project.pySourceImageUploadViewSet(ami/main/api/views.py:848-901) +create_source_image_from_upload()(models.py:1683-1741)ami/exports/management/commands/import_pipeline_results.py(286 lines)PipelineResultsResponse-shaped JSONfeat/adc-importer, unmergedPipelineResultsResponseexportGaps
DataImportmodel, noImportRegistry, noBaseImporter, noDataImportJob.ami.main; occurrence/detection import lives onfeat/adc-importerunderami.exports. No canonical home.Proposal
Mirror the export framework, one component at a time. The following is the minimum-viable shape; format-specific work is follow-on tickets.
Core framework (one PR)
ami/imports/app (parallel toami/exports/).ImportRegistry— same decorator pattern.BaseImporter— abstractvalidate(),import_records(), plus a progress hook matchingBaseExporter.update_job_progress().DataImportmodel — fields:user,project,format,source(file upload or URL or S3 pointer),options(JSON — dry_run, conflict_strategy, etc.),record_counts(created/updated/skipped/errored),error_report_url,file(FileField for uploaded artifacts, stored inimports/ondefault_storage).DataImportJobsubclass ofJob, matchingDataExportJob.POST /api/v2/imports/— accepts multipart file upload or a URL/S3 reference; createsDataImport+Job; enqueues; returns job ID.ImportViewSetwithretrieve,list, and avalidateaction for dry-run.UI (follow-on PR)
/projects/:id/importspage modeled on the exports page.Format modules (one per PR, in priority order)
taxa_csv— absorb the logic fromimport_taxa.pyintoBaseImporter. Keeps the management command as a thin wrapper that calls the same code path. Addresses Management command for taxonomy export/import (cross-environment sync) #1187, Document how to import a species list #933, Method for importing species of interest lists #746, [Issue draft] Make it possible for users to manage taxa and taxa lists as scale #1015.source_images_s3_inventory— formalize whatDeployment.sync_captures()already does, so it shows up in the same UI and the sameJobhistory as other imports. Addresses threads in feat: Local filesystem storage backend for zero-copy image ingestion #1160, Support for SFTP data sources #467.occurrences_pipeline_results_json— thefeat/adc-importerformat (PipelineResultsResponse). Reusespipeline.save_results()for the actual record creation.occurrences_simple_csv— roundtrip with the existing CSV exporter. Lowest user-onboarding friction.taxa_reference_images_zip— cover images for Import reference images for taxa #871.occurrences_dwca— GBIF DwC-A inbound. Requires the DwC-A export (feat/dwca-export) mapping spec to land first so the field contract is symmetric.project_bundle— for feat: Project portability — UUID fields, template cloning, and cross-environment export/import #1208 (cross-environment project portability). Needs UUID work to precede it.Open design questions (decide before writing code)
These are the implicit decisions the unmerged branches each made differently. None has been discussed as a team. Flag your position on each.
PipelineResultsResponse; onboarding favours DwC-A + flat CSV. These are not mutually exclusive, but the priority ordering in the format list above assumes onboarding first. Challenge this if you disagree.DataImport.sourceis an S3 URI the user has uploaded to directly (same pattern as storage sources already use).Algorithm+CategoryMapby name/version, or (b) include inline category map definitions in the payload. Reject the import if neither is present.validateaction on the viewset runsBaseImporter.validate()against the file and returns a structured error report without mutating state. UI surfaces errors before the user confirms.DataImport.change_project(or a newimport_data) guardian permission.feat/adc-importerbranch putimport_pipeline_results.pyunderami/exports/— reasonable if you see export/import as one app, questionable if you don't. Proposal: newami/imports/app, withami/exports/untouched. Happy to hear the case for merging.What this ticket is NOT
Deployment.sync_captures()/SourceImageUploadViewSet— those stay; format Bump pre-commit/action from 2.0.0 to 3.0.0 #2 above just gives them a shared surface.Suggested next step
A short design review (async doc or 30-min call) to lock in answers to the nine open questions above, then a first PR that lands the core framework (no format modules yet) so subsequent format PRs are mechanical.
Related
feat/adc-importer,feat/dwca-export,feat/inventory-import,feat/import-cover-images,feat/taxa-import-updates,fix/species-import