Clarify and document tuning parameters

## Summary

The `ami worker` process OOMs when processing jobs that contain large source images. On a test worker with 70 GB system RAM and a 24 GB VRAM GPU slice, processing a job of ~914 source images at ~24 MB per JPEG, the worker consumed **~68 GB of RAM in roughly 40 seconds** before being OOM-killed by the init system. Based on reading the ingest code and a single successful workaround deployment, we **think** this is primarily a config/defaults/documentation problem rather than a new leak or a code bug, but more investigation and controlled testing is needed before committing to any particular fix. This issue is mainly to (a) document what we saw, (b) enumerate the tuning knobs the worker already honors, (c) make the currently-hardcoded DataLoader knobs configurable, and (d) open a discussion about auto-tuning strategies.

## What we originally thought

Our first theory was the same class of problem as #118 — a memory leak in the worker loop. #118 was a real leak that was fixed and deployed earlier. The behavior we saw here (OOM kills, DataLoader workers growing to 20+ GB RSS, cascading NATS `max_deliver` failures) **looked identical** to #118, which sent us down the wrong path for a while.

What seems to actually be happening is different: memory grows to its operating ceiling in seconds on startup, not over hours. It's the **steady-state peak** that appears to be the problem, and that peak is determined by (1) per-image tensor size, (2) `antenna_api_batch_size`, (3) `prefetch_factor`, and (4) `pin_memory`. When per-image size gets large, the math stops working. We haven't ruled out that there is *also* a leak on top of this — more careful instrumentation would be needed to say for sure.

## Reproduction

- **Job characteristics**: ~914 source images, ~24 MB JPEGs, roughly 4096×3072 (unusually large — most jobs in this deployment are ~1.5 MB images)
- **Worker host**: 24 GB VRAM GPU slice, 70 GB system RAM
- **Branch**: includes the #118 fix

**Symptom timeline** (single observation):
```
T+0s    supervisor spawns ami worker
T+43s   init system OOM-kills the supervisor cgroup
        Mem peak: ~68 GB on a 68 GB instance
```

We also initially saw `OSError(24, 'Too many open files')` from `multiprocessing.Pipe`'s `os.pipe()` call during DataLoader startup. That turned out to be an unrelated infrastructure problem — the default soft FD ulimit on many distros is 1024, and the worker legitimately needs ~1000+ FDs at steady state (SSL sockets × ThreadPoolExecutor, DataLoader IPC pipes, in-flight image downloads). Fixed out-of-band by raising the soft FD limit at the supervisor or systemd level. **Not an ADC bug**, but worth mentioning in the deployment docs as an ops requirement.

## Known configuration parameters

All env vars use the `AMI_` prefix (pydantic `env_prefix = "ami_"`). Current state of everything the worker honors:

### Core paths / database

| Env var | Default | Purpose |
|---|---|---|
| `AMI_DATABASE_URL` | sqlite at `AMI_USER_DATA_PATH/trapdata.db` | SQLAlchemy connection string. Supports PostgreSQL. For API workers, only used for local state. |
| `AMI_USER_DATA_PATH` | platform-specific app dir | Where model weights, thumbnails, reports, and the default SQLite DB live. Must exist before worker start or SQLite DB creation fails. |
| `AMI_IMAGE_BASE_PATH` | *unset* | Root folder for local-filesystem image workflows. Not used by the Antenna API worker path. |

### Model selection

| Env var | Default | Purpose |
|---|---|---|
| `AMI_LOCALIZATION_MODEL` | `DEFAULT_OBJECT_DETECTOR` | Which localization model (enum in `trapdata.ml.models`) |
| `AMI_BINARY_CLASSIFICATION_MODEL` | `DEFAULT_BINARY_CLASSIFIER` | Moth / non-moth filter |
| `AMI_SPECIES_CLASSIFICATION_MODEL` | `DEFAULT_SPECIES_CLASSIFIER` | Fine-grained species classifier |
| `AMI_FEATURE_EXTRACTOR` | `DEFAULT_FEATURE_EXTRACTOR` | Feature extractor for downstream models |
| `AMI_CLASSIFICATION_THRESHOLD` | `0.6` | Minimum classification confidence to emit a detection |

### Inference batch sizes (model-level, not ingest-level)

| Env var | Default | What it actually controls |
|---|---|---|
| `AMI_LOCALIZATION_BATCH_SIZE` | `8` | Images per localization (detector) forward pass. Affects VRAM during detection and model-level throughput. **Does not limit how much data is loaded into RAM or how many tasks are fetched from the API.** |
| `AMI_CLASSIFICATION_BATCH_SIZE` | `20` | Crops per classifier forward pass. Same caveat — model-side knob, not ingest-side knob. |
| `AMI_NUM_WORKERS` | `4` | Number of AMI worker *instances* (one per GPU, via `torch.multiprocessing.spawn`). Also reused as the DataLoader's `num_workers` inside each instance. This reuse is confusing and probably worth separating. |

### Antenna API worker settings

| Env var | Default | Purpose |
|---|---|---|
| `AMI_ANTENNA_API_BASE_URL` | `http://localhost:8000/api/v2` | Base URL for the Antenna API |
| `AMI_ANTENNA_API_AUTH_TOKEN` | *empty, required* | Auth token for the worker's identity |
| `AMI_ANTENNA_SERVICE_NAME` | `"AMI Data Companion"` | Human-readable name shown in the ProcessingService admin |
| `AMI_ANTENNA_API_BATCH_SIZE` | `24` | **Tasks per POST fetch from `/api/v2/jobs/{id}/tasks/`.** This appears to be the single biggest lever on peak ingest memory. Currently undocumented in the example env. See analysis below. |

### Hardcoded in code (not configurable — proposed to become configurable in a PR from this issue)

| Location | Value | What it does |
|---|---|---|
| `trapdata/antenna/datasets.py::get_rest_dataloader` | `pin_memory=True` | Allocates prefetched tensors in page-locked (non-swappable) system RAM. See explanation below. |
| `trapdata/antenna/datasets.py::get_rest_dataloader` | `prefetch_factor=4` | Each DataLoader subprocess keeps 4 batches already decoded and queued ahead of the main process. |
| `trapdata/antenna/datasets.py::RESTDataset._ensure_sessions` | `ThreadPoolExecutor(max_workers=8)` | 8 concurrent image downloads per DataLoader subprocess. |
| `trapdata/antenna/worker.py` | `MAX_PENDING_POSTS = 5` | Result-poster concurrency ceiling. |
| `trapdata/antenna/worker.py` | `SLEEP_TIME_SECONDS = 5` | Polling interval when no jobs are available. |

## What `pin_memory` and `prefetch_factor` actually do

These two DataLoader knobs are the main multipliers on peak ingest memory, they interact in non-obvious ways, and they're both hardcoded today — worth being explicit about their tradeoffs.

### `pin_memory=True` (affects system RAM, not VRAM)

Allocates tensors in page-locked RAM. CUDA's DMA engine requires pinned pages to transfer from host to device; if a tensor is in pageable memory, PyTorch has to copy it to a scratch pinned buffer first, then DMA from there. With `pin_memory=True` the prefetched tensor goes straight into pinned memory, skipping the extra copy, and `.to(device, non_blocking=True)` can overlap transfer with compute.

- **When it helps**: CPU→GPU transfer is a meaningful fraction of wall-clock time, typically with fast local data and a high-bandwidth model.
- **When it hurts**:
  - Pinned pages cannot be swapped. They count as hard RSS against physical RAM.
  - With `num_workers > 0`, each prefetched batch is held in pinned **shared memory** for IPC between the loader subprocess and the main process. Cost scales per prefetched batch.
  - On workers where data loading dominates wall time (for example HTTP-sourced images where download is ~8 s per batch vs ~20 ms CPU→GPU transfer), pinning provides essentially no throughput benefit.
- **Key point**: `pin_memory` does **not** touch VRAM. Disabling it saves **system RAM**; GPU memory usage is the same either way.

### `prefetch_factor=N` (affects system RAM)

With `num_workers > 0`, each DataLoader subprocess keeps N batches already-prepared and queued ahead of the main process's consumption (default 2). When the main process pulls one, the subprocess immediately starts on another.

- **When it helps**: data loading is slow relative to inference, so the GPU would otherwise sit idle waiting. Prefetching hides that latency.
- **When it hurts**: data loading is faster than inference — the queue is wasted memory.
- **For the HTTP-sourced workload**: prefetching is **genuinely useful**. HTTP image download dominates wall time (~7–11 s per batch) relative to inference (~1 s per batch). Without prefetch the GPU would idle most of the time. The problem is only that the fixed `prefetch_factor=4` combined with `batch_size=24` and large images can exceed worker RAM. On small-image jobs the same settings are fine. **The right value is job-dependent, not worker-dependent.**

### The useful middle ground

Keeping `prefetch_factor` high while disabling `pin_memory` gets you network-latency hiding without the unswappable-RAM cost. Prefetched batches live in pageable memory that the kernel can reclaim if pressure rises, and the pinned-shmem IPC duplication goes away. CPU→GPU transfers get about 20 ms slower per batch, which is noise compared to inference time on this workload.

**This is probably the best default for any HTTP-sourced workload**, regardless of image size. Needs to be validated against a small-image job to confirm it doesn't regress throughput on the common case.

## Hypothesis: peak ingest memory scales with image size × fanout

Each 24 MB JPEG decodes to ~36 MB uint8, then `image_transforms = ToTensor()` converts it to `float32`, quadrupling it to **~144 MB per image as a CHW tensor**. For ~1.5 MB JPEGs the equivalent is ~9 MB per tensor.

### Rough estimates at current defaults (1 DataLoader worker)

| Component | Large images (24 MB JPEG → 144 MB tensor) | Typical images (1.5 MB JPEG → 9 MB tensor) |
|---|---|---|
| Active batch (24 images) | 3.5 GB | 216 MB |
| Prefetch queue (4 batches) | 13.8 GB | 864 MB |
| Pinned IPC shmem copy | +13.8 GB | +864 MB |
| Detector working set | ~0.5 GB | ~0.5 GB |
| 8× concurrent download buffers + PIL | ~0.6 GB | ~0.1 GB |
| **Peak before classification** | **~32 GB** | **~2 GB** |

Typical images fit comfortably on small workers. Large images don't fit at all. The current defaults appear reasonable for the common case and the failure mode appears only on the uncommon large-image case — but these numbers are estimated from code reading, not measured. A real memory profile on a reproducer would be more trustworthy.

### Rough estimates with `AMI_ANTENNA_API_BATCH_SIZE=4`

| Component | Large images | Typical images |
|---|---|---|
| Active batch (4 images) | 576 MB | 36 MB |
| Prefetch queue (4 batches) | 2.3 GB | 144 MB |
| Pinned IPC shmem copy | +2.3 GB | +144 MB |
| **Peak** | **~5.2 GB** | **~325 MB** |

### Rough estimates with `AMI_ANTENNA_API_BATCH_SIZE=4, pin_memory=False, prefetch_factor=4`

| Component | Large images | Typical images |
|---|---|---|
| Active batch (4 images) | 576 MB | 36 MB |
| Prefetch queue (4 batches, pageable) | 2.3 GB | 144 MB |
| Pinned IPC shmem copy | 0 (not pinned) | 0 |
| **Peak** | **~2.9 GB** | **~180 MB** |

Best tradeoff from the math: keeps the network-hiding benefit of prefetching, spends about half the RAM of the current defaults even on huge images, and the pageable pages are reclaimable under pressure.

## The env var that appears to fix the immediate crash

Setting `AMI_ANTENNA_API_BATCH_SIZE=4` in `.env` drops estimated peak ingest memory from ~32 GB to ~5 GB for 24 MB images. Observed on a test worker after the change: steady-state RSS ~13 GB (including Python baseline, model weights, buff/cache), processing at ~2 s/image with no crashes across several consecutive jobs.

This is a **single successful deployment, not a controlled experiment**, and we haven't yet isolated the effect of this variable on its own from other local changes applied at the same time — see "still to verify" below.

**Discoverability problem**: an operator seeing an OOMing worker reaches for `AMI_LOCALIZATION_BATCH_SIZE` and `AMI_CLASSIFICATION_BATCH_SIZE` because those sound like memory controls. They aren't — they control model inference batch sizes, which are a small fraction of peak memory. The knob that appears to actually limit ingest memory (`AMI_ANTENNA_API_BATCH_SIZE`) is currently undocumented in the example env and README.

## What was applied locally to get past the immediate crash

Three changes. Current hypothesis is that only (2) was strictly needed for ADC; (1) and (3) are environmental hygiene and belt-and-suspenders.

1. **Raised the supervisor / systemd soft FD limit** from 1024 to 65536. Unrelated to this issue as an ADC bug; worth a deployment docs note.
2. **`AMI_ANTENNA_API_BATCH_SIZE=4`** in `.env` — the suspected fix.
3. **Local patch to `trapdata/antenna/datasets.py`**: `pin_memory=False`, `prefetch_factor=1`. Extra headroom. Not yet independently verified as necessary.

## What we still need to verify

Before any of the proposed directions below is taken to code:

- **Config-only reproducibility**: revert the local `datasets.py` patch, leave only `AMI_ANTENNA_API_BATCH_SIZE=4`, and confirm the worker still runs large-image jobs cleanly. If yes, `pin_memory`/`prefetch_factor` don't need to change — they just need to become configurable.
- **Middle-ground test**: run large-image and small-image jobs with `pin_memory=False, prefetch_factor=4`. Does the small-image case keep its current throughput? If yes, this is probably the right new default.
- **Small-image regression**: run a typical ~1.5 MB job with everything at the proposed defaults and measure throughput vs the current defaults. Is prefetching materially helping throughput on the common case?
- **Real memory profile**: run a reproducer under `memray` or `tracemalloc` on a large-image job. The estimates in the tables above are code-reading, not measurement. Would also catch any residual leak on top of the steady-state peak.
- **Download vs inference throughput ratio**: measure typical `load_time` and `(classification + detection)_time` per batch across a mix of 1.5 MB and 24 MB jobs. The right prefetch factor depends on this ratio and it may differ per-job, which motivates the probe-based auto-tune below.
- **FD accounting**: confirm the FD crash was solely a ulimit issue and not a leak of its own. Watch `/proc/<pid>/fd` count over a long-running job.

## Possible directions (to discuss, not to immediately implement)

Six directions, listed in increasing effort. Not mutually exclusive.

### Direction 1 — Documentation and example env (low risk)

Add `AMI_ANTENNA_API_BATCH_SIZE` to `.env.example` with a comment explaining what it does and how to pick a value. Add a "Tuning for large-image workloads" section to the README. Add a deployment note about the FD soft limit for supervisor/systemd setups.

### Direction 2 — Expose the hardcoded DataLoader knobs (small code change)

Make these env-configurable with current values as defaults (no behavior change on its own):

| Proposed env var | Default | Source |
|---|---|---|
| `AMI_ANTENNA_API_DATALOADER_PIN_MEMORY` | `true` | hardcoded `pin_memory=True` |
| `AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR` | `4` | hardcoded `prefetch_factor=4` |
| `AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS` | `8` | hardcoded `ThreadPoolExecutor(max_workers=8)` |
| `AMI_ANTENNA_MAX_PENDING_POSTS` | `5` | hardcoded `MAX_PENDING_POSTS` |
| `AMI_ANTENNA_IDLE_SLEEP_SECONDS` | `5` | hardcoded `SLEEP_TIME_SECONDS` |

This enables Direction 4 and makes the verification work cheap to do.

### Direction 3 — A tuning command (medium effort)

`ami worker tune <project-or-pipeline>` that fetches a small sample of pending tasks, downloads and measures them, reads `psutil.virtual_memory().total`, and prints recommended settings with the math. Optionally writes to `.env`. Valuable even if Direction 4 is also done, because it lets operators *understand* what the worker will do before running.

### Direction 4 — Probe-then-commit auto-tune, per job

On entry to `_process_job`, probe a tiny sample (e.g. 2 tasks) to measure both **per-image tensor size** and **download throughput**, and use the ratio of download-time vs inference-time to pick the right batch size and prefetch factor for *that* job. Pseudocode:

```python
def _probe_job(job_id):
    start = time.time()
    sample = fetch_tasks(batch_size=2)
    if not sample:
        return None
    tensors = download_and_decode(sample)
    elapsed = time.time() - start

    tensor_mb = mean(t.element_size() * t.nelement() / 1e6 for t in tensors)
    download_mbps = (tensor_mb * len(tensors)) / elapsed
    inference_img_per_sec = self.last_inference_rate or 2.0  # conservative

    ram_budget = psutil.virtual_memory().total * 0.4

    seconds_per_batch_download = (CONFIGURED_BATCH_SIZE * tensor_mb) / download_mbps
    seconds_per_batch_inference = CONFIGURED_BATCH_SIZE / inference_img_per_sec
    if seconds_per_batch_download > 2 * seconds_per_batch_inference:
        prefetch = 4  # network-bound, hide latency hard
    elif seconds_per_batch_download > seconds_per_batch_inference:
        prefetch = 2
    else:
        prefetch = 1  # inference-bound, don't waste RAM

    batch_size = int(ram_budget / (tensor_mb * 1e6 * prefetch * 2))
    return min(batch_size, CONFIGURED_MAX), prefetch
```

Operator never sets the ingest knobs. Jobs with tiny images and slow networks get big batches and aggressive prefetch; jobs with huge images get small batches and minimal prefetch; jobs with fast networks and fast inference also get small batches because prefetch isn't helping. This matches the intuition that *jobs should be as large as they need to be and take as long as they need to, letting workers consume what they can when they can*.

Caveats: requires Direction 2 (runtime-configurable pin_memory / prefetch_factor). Rebuilding the DataLoader per job is a non-trivial refactor of `_process_job`. The probe needs to be cheap and needs to handle atypical sample images.

### Direction 5 — Pre-prefetch to local disk (best long-term architecture)

Decouple download from RAM by adding a disk-caching stage between fetch and decode. Raw JPEG bytes land on local disk as they download, and the DataLoader reads from disk to decode. RAM never holds more than one active batch of decoded tensors; "prefetch" becomes disk space, which is usually abundant.

Benefits:
- Bounded RAM regardless of `batch_size`, `prefetch_factor`, or input image size
- Prefetch can be arbitrarily large — whole-job staging becomes feasible
- Network and inference are fully decoupled (separate async stage), no more "load time dominates batch time"
- Crash recovery: if a worker dies mid-job, the cache lets a restart resume without re-downloading
- Makes image-size-based auto-tuning much simpler because the download throughput test is free

Costs:
- Disk space (bounded, must be cleaned up at job completion)
- Extra IO (negligible on SSD/NVMe, probably still negligible on spinning disk since HTTP fetch time dominates)
- Refactor touches `RESTDataset.__iter__` and probably adds a new staging class

This is the direction that most cleanly separates the two concerns (hiding network latency vs. limiting memory) that currently fight each other in the DataLoader config. Worth a design discussion separately.

### Direction 6 — Resource-aware self-throttling (long-term safety net)

Independent of all of the above, workers should self-throttle based on observed peak RSS. After every N completed batches sample `resource.getrusage(RUSAGE_SELF).ru_maxrss`, compare to a configurable soft cap (say 70% of MemTotal), and if approaching reduce `antenna_api_batch_size` and/or `prefetch_factor` for the next fetch. A worker that runs slowly is strictly better than a worker that crashes — a crash loses in-flight NATS messages to `max_deliver` exhaustion and silently fails the job. This is a safety net that catches whatever the probe missed.

## Related
- #118 (closed) — prior memory leak, real fix, mechanically unrelated but with the same *visible* failure mode and the same cascading-NATS-failure pattern.

Env var	Default	Purpose
`AMI_LOCALIZATION_MODEL`	`DEFAULT_OBJECT_DETECTOR`	Which localization model (enum in `trapdata.ml.models`)
`AMI_BINARY_CLASSIFICATION_MODEL`	`DEFAULT_BINARY_CLASSIFIER`	Moth / non-moth filter
`AMI_SPECIES_CLASSIFICATION_MODEL`	`DEFAULT_SPECIES_CLASSIFIER`	Fine-grained species classifier
`AMI_FEATURE_EXTRACTOR`	`DEFAULT_FEATURE_EXTRACTOR`	Feature extractor for downstream models
`AMI_CLASSIFICATION_THRESHOLD`	`0.6`	Minimum classification confidence to emit a detection

Env var	Default	What it actually controls
`AMI_LOCALIZATION_BATCH_SIZE`	`8`	Images per localization (detector) forward pass. Affects VRAM during detection and model-level throughput. Does not limit how much data is loaded into RAM or how many tasks are fetched from the API.
`AMI_CLASSIFICATION_BATCH_SIZE`	`20`	Crops per classifier forward pass. Same caveat — model-side knob, not ingest-side knob.
`AMI_NUM_WORKERS`	`4`	Number of AMI worker instances (one per GPU, via `torch.multiprocessing.spawn`). Also reused as the DataLoader's `num_workers` inside each instance. This reuse is confusing and probably worth separating.

Env var	Default	Purpose
`AMI_ANTENNA_API_BASE_URL`	`http://localhost:8000/api/v2`	Base URL for the Antenna API
`AMI_ANTENNA_API_AUTH_TOKEN`	empty, required	Auth token for the worker's identity
`AMI_ANTENNA_SERVICE_NAME`	`"AMI Data Companion"`	Human-readable name shown in the ProcessingService admin
`AMI_ANTENNA_API_BATCH_SIZE`	`24`	Tasks per POST fetch from `/api/v2/jobs/{id}/tasks/`. This appears to be the single biggest lever on peak ingest memory. Currently undocumented in the example env. See analysis below.

Location	Value	What it does
`trapdata/antenna/datasets.py::get_rest_dataloader`	`pin_memory=True`	Allocates prefetched tensors in page-locked (non-swappable) system RAM. See explanation below.
`trapdata/antenna/datasets.py::get_rest_dataloader`	`prefetch_factor=4`	Each DataLoader subprocess keeps 4 batches already decoded and queued ahead of the main process.
`trapdata/antenna/datasets.py::RESTDataset._ensure_sessions`	`ThreadPoolExecutor(max_workers=8)`	8 concurrent image downloads per DataLoader subprocess.
`trapdata/antenna/worker.py`	`MAX_PENDING_POSTS = 5`	Result-poster concurrency ceiling.
`trapdata/antenna/worker.py`	`SLEEP_TIME_SECONDS = 5`	Polling interval when no jobs are available.

Proposed env var	Default	Source
`AMI_ANTENNA_API_DATALOADER_PIN_MEMORY`	`true`	hardcoded `pin_memory=True`
`AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR`	`4`	hardcoded `prefetch_factor=4`
`AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS`	`8`	hardcoded `ThreadPoolExecutor(max_workers=8)`
`AMI_ANTENNA_MAX_PENDING_POSTS`	`5`	hardcoded `MAX_PENDING_POSTS`
`AMI_ANTENNA_IDLE_SLEEP_SECONDS`	`5`	hardcoded `SLEEP_TIME_SECONDS`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify and document tuning parameters #138

Summary

What we originally thought

Reproduction

Known configuration parameters

Core paths / database

Model selection

Inference batch sizes (model-level, not ingest-level)

Antenna API worker settings

Hardcoded in code (not configurable — proposed to become configurable in a PR from this issue)

What `pin_memory` and `prefetch_factor` actually do

`pin_memory=True` (affects system RAM, not VRAM)

`prefetch_factor=N` (affects system RAM)

The useful middle ground

Hypothesis: peak ingest memory scales with image size × fanout

Rough estimates at current defaults (1 DataLoader worker)

Rough estimates with `AMI_ANTENNA_API_BATCH_SIZE=4`

Rough estimates with `AMI_ANTENNA_API_BATCH_SIZE=4, pin_memory=False, prefetch_factor=4`

The env var that appears to fix the immediate crash

What was applied locally to get past the immediate crash

What we still need to verify

Possible directions (to discuss, not to immediately implement)

Direction 1 — Documentation and example env (low risk)

Direction 2 — Expose the hardcoded DataLoader knobs (small code change)

Direction 3 — A tuning command (medium effort)

Direction 4 — Probe-then-commit auto-tune, per job

Direction 5 — Pre-prefetch to local disk (best long-term architecture)

Direction 6 — Resource-aware self-throttling (long-term safety net)

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Env var	Default	Purpose
`AMI_DATABASE_URL`	sqlite at `AMI_USER_DATA_PATH/trapdata.db`	SQLAlchemy connection string. Supports PostgreSQL. For API workers, only used for local state.
`AMI_USER_DATA_PATH`	platform-specific app dir	Where model weights, thumbnails, reports, and the default SQLite DB live. Must exist before worker start or SQLite DB creation fails.
`AMI_IMAGE_BASE_PATH`	unset	Root folder for local-filesystem image workflows. Not used by the Antenna API worker path.

Component	Large images (24 MB JPEG → 144 MB tensor)	Typical images (1.5 MB JPEG → 9 MB tensor)
Active batch (24 images)	3.5 GB	216 MB
Prefetch queue (4 batches)	13.8 GB	864 MB
Pinned IPC shmem copy	+13.8 GB	+864 MB
Detector working set	~0.5 GB	~0.5 GB
8× concurrent download buffers + PIL	~0.6 GB	~0.1 GB
Peak before classification	~32 GB	~2 GB

Component	Large images	Typical images
Active batch (4 images)	576 MB	36 MB
Prefetch queue (4 batches)	2.3 GB	144 MB
Pinned IPC shmem copy	+2.3 GB	+144 MB
Peak	~5.2 GB	~325 MB

Clarify and document tuning parameters #138

Description

Summary

What we originally thought

Reproduction

Known configuration parameters

Core paths / database

Model selection

Inference batch sizes (model-level, not ingest-level)

Antenna API worker settings

Hardcoded in code (not configurable — proposed to become configurable in a PR from this issue)

What pin_memory and prefetch_factor actually do

pin_memory=True (affects system RAM, not VRAM)

prefetch_factor=N (affects system RAM)

The useful middle ground

Hypothesis: peak ingest memory scales with image size × fanout

Rough estimates at current defaults (1 DataLoader worker)

Rough estimates with AMI_ANTENNA_API_BATCH_SIZE=4

Rough estimates with AMI_ANTENNA_API_BATCH_SIZE=4, pin_memory=False, prefetch_factor=4

The env var that appears to fix the immediate crash

What was applied locally to get past the immediate crash

What we still need to verify

Possible directions (to discuss, not to immediately implement)

Direction 1 — Documentation and example env (low risk)

Direction 2 — Expose the hardcoded DataLoader knobs (small code change)

Direction 3 — A tuning command (medium effort)

Direction 4 — Probe-then-commit auto-tune, per job

Direction 5 — Pre-prefetch to local disk (best long-term architecture)

Direction 6 — Resource-aware self-throttling (long-term safety net)

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What `pin_memory` and `prefetch_factor` actually do

`pin_memory=True` (affects system RAM, not VRAM)

`prefetch_factor=N` (affects system RAM)

Rough estimates with `AMI_ANTENNA_API_BATCH_SIZE=4`

Rough estimates with `AMI_ANTENNA_API_BATCH_SIZE=4, pin_memory=False, prefetch_factor=4`