Skip to content

Clarify and document tuning parameters #138

@mihow

Description

@mihow

Summary

The ami worker process OOMs when processing jobs that contain large source images. On a test worker with 70 GB system RAM and a 24 GB VRAM GPU slice, processing a job of ~914 source images at ~24 MB per JPEG, the worker consumed ~68 GB of RAM in roughly 40 seconds before being OOM-killed by the init system. Based on reading the ingest code and a single successful workaround deployment, we think this is primarily a config/defaults/documentation problem rather than a new leak or a code bug, but more investigation and controlled testing is needed before committing to any particular fix. This issue is mainly to (a) document what we saw, (b) enumerate the tuning knobs the worker already honors, (c) make the currently-hardcoded DataLoader knobs configurable, and (d) open a discussion about auto-tuning strategies.

What we originally thought

Our first theory was the same class of problem as #118 — a memory leak in the worker loop. #118 was a real leak that was fixed and deployed earlier. The behavior we saw here (OOM kills, DataLoader workers growing to 20+ GB RSS, cascading NATS max_deliver failures) looked identical to #118, which sent us down the wrong path for a while.

What seems to actually be happening is different: memory grows to its operating ceiling in seconds on startup, not over hours. It's the steady-state peak that appears to be the problem, and that peak is determined by (1) per-image tensor size, (2) antenna_api_batch_size, (3) prefetch_factor, and (4) pin_memory. When per-image size gets large, the math stops working. We haven't ruled out that there is also a leak on top of this — more careful instrumentation would be needed to say for sure.

Reproduction

  • Job characteristics: ~914 source images, ~24 MB JPEGs, roughly 4096×3072 (unusually large — most jobs in this deployment are ~1.5 MB images)
  • Worker host: 24 GB VRAM GPU slice, 70 GB system RAM
  • Branch: includes the Memory leak in worker batch processing causes OOM #118 fix

Symptom timeline (single observation):

T+0s    supervisor spawns ami worker
T+43s   init system OOM-kills the supervisor cgroup
        Mem peak: ~68 GB on a 68 GB instance

We also initially saw OSError(24, 'Too many open files') from multiprocessing.Pipe's os.pipe() call during DataLoader startup. That turned out to be an unrelated infrastructure problem — the default soft FD ulimit on many distros is 1024, and the worker legitimately needs ~1000+ FDs at steady state (SSL sockets × ThreadPoolExecutor, DataLoader IPC pipes, in-flight image downloads). Fixed out-of-band by raising the soft FD limit at the supervisor or systemd level. Not an ADC bug, but worth mentioning in the deployment docs as an ops requirement.

Known configuration parameters

All env vars use the AMI_ prefix (pydantic env_prefix = "ami_"). Current state of everything the worker honors:

Core paths / database

Env var Default Purpose
AMI_DATABASE_URL sqlite at AMI_USER_DATA_PATH/trapdata.db SQLAlchemy connection string. Supports PostgreSQL. For API workers, only used for local state.
AMI_USER_DATA_PATH platform-specific app dir Where model weights, thumbnails, reports, and the default SQLite DB live. Must exist before worker start or SQLite DB creation fails.
AMI_IMAGE_BASE_PATH unset Root folder for local-filesystem image workflows. Not used by the Antenna API worker path.

Model selection

Env var Default Purpose
AMI_LOCALIZATION_MODEL DEFAULT_OBJECT_DETECTOR Which localization model (enum in trapdata.ml.models)
AMI_BINARY_CLASSIFICATION_MODEL DEFAULT_BINARY_CLASSIFIER Moth / non-moth filter
AMI_SPECIES_CLASSIFICATION_MODEL DEFAULT_SPECIES_CLASSIFIER Fine-grained species classifier
AMI_FEATURE_EXTRACTOR DEFAULT_FEATURE_EXTRACTOR Feature extractor for downstream models
AMI_CLASSIFICATION_THRESHOLD 0.6 Minimum classification confidence to emit a detection

Inference batch sizes (model-level, not ingest-level)

Env var Default What it actually controls
AMI_LOCALIZATION_BATCH_SIZE 8 Images per localization (detector) forward pass. Affects VRAM during detection and model-level throughput. Does not limit how much data is loaded into RAM or how many tasks are fetched from the API.
AMI_CLASSIFICATION_BATCH_SIZE 20 Crops per classifier forward pass. Same caveat — model-side knob, not ingest-side knob.
AMI_NUM_WORKERS 4 Number of AMI worker instances (one per GPU, via torch.multiprocessing.spawn). Also reused as the DataLoader's num_workers inside each instance. This reuse is confusing and probably worth separating.

Antenna API worker settings

Env var Default Purpose
AMI_ANTENNA_API_BASE_URL http://localhost:8000/api/v2 Base URL for the Antenna API
AMI_ANTENNA_API_AUTH_TOKEN empty, required Auth token for the worker's identity
AMI_ANTENNA_SERVICE_NAME "AMI Data Companion" Human-readable name shown in the ProcessingService admin
AMI_ANTENNA_API_BATCH_SIZE 24 Tasks per POST fetch from /api/v2/jobs/{id}/tasks/. This appears to be the single biggest lever on peak ingest memory. Currently undocumented in the example env. See analysis below.

Hardcoded in code (not configurable — proposed to become configurable in a PR from this issue)

Location Value What it does
trapdata/antenna/datasets.py::get_rest_dataloader pin_memory=True Allocates prefetched tensors in page-locked (non-swappable) system RAM. See explanation below.
trapdata/antenna/datasets.py::get_rest_dataloader prefetch_factor=4 Each DataLoader subprocess keeps 4 batches already decoded and queued ahead of the main process.
trapdata/antenna/datasets.py::RESTDataset._ensure_sessions ThreadPoolExecutor(max_workers=8) 8 concurrent image downloads per DataLoader subprocess.
trapdata/antenna/worker.py MAX_PENDING_POSTS = 5 Result-poster concurrency ceiling.
trapdata/antenna/worker.py SLEEP_TIME_SECONDS = 5 Polling interval when no jobs are available.

What pin_memory and prefetch_factor actually do

These two DataLoader knobs are the main multipliers on peak ingest memory, they interact in non-obvious ways, and they're both hardcoded today — worth being explicit about their tradeoffs.

pin_memory=True (affects system RAM, not VRAM)

Allocates tensors in page-locked RAM. CUDA's DMA engine requires pinned pages to transfer from host to device; if a tensor is in pageable memory, PyTorch has to copy it to a scratch pinned buffer first, then DMA from there. With pin_memory=True the prefetched tensor goes straight into pinned memory, skipping the extra copy, and .to(device, non_blocking=True) can overlap transfer with compute.

  • When it helps: CPU→GPU transfer is a meaningful fraction of wall-clock time, typically with fast local data and a high-bandwidth model.
  • When it hurts:
    • Pinned pages cannot be swapped. They count as hard RSS against physical RAM.
    • With num_workers > 0, each prefetched batch is held in pinned shared memory for IPC between the loader subprocess and the main process. Cost scales per prefetched batch.
    • On workers where data loading dominates wall time (for example HTTP-sourced images where download is ~8 s per batch vs ~20 ms CPU→GPU transfer), pinning provides essentially no throughput benefit.
  • Key point: pin_memory does not touch VRAM. Disabling it saves system RAM; GPU memory usage is the same either way.

prefetch_factor=N (affects system RAM)

With num_workers > 0, each DataLoader subprocess keeps N batches already-prepared and queued ahead of the main process's consumption (default 2). When the main process pulls one, the subprocess immediately starts on another.

  • When it helps: data loading is slow relative to inference, so the GPU would otherwise sit idle waiting. Prefetching hides that latency.
  • When it hurts: data loading is faster than inference — the queue is wasted memory.
  • For the HTTP-sourced workload: prefetching is genuinely useful. HTTP image download dominates wall time (~7–11 s per batch) relative to inference (~1 s per batch). Without prefetch the GPU would idle most of the time. The problem is only that the fixed prefetch_factor=4 combined with batch_size=24 and large images can exceed worker RAM. On small-image jobs the same settings are fine. The right value is job-dependent, not worker-dependent.

The useful middle ground

Keeping prefetch_factor high while disabling pin_memory gets you network-latency hiding without the unswappable-RAM cost. Prefetched batches live in pageable memory that the kernel can reclaim if pressure rises, and the pinned-shmem IPC duplication goes away. CPU→GPU transfers get about 20 ms slower per batch, which is noise compared to inference time on this workload.

This is probably the best default for any HTTP-sourced workload, regardless of image size. Needs to be validated against a small-image job to confirm it doesn't regress throughput on the common case.

Hypothesis: peak ingest memory scales with image size × fanout

Each 24 MB JPEG decodes to ~36 MB uint8, then image_transforms = ToTensor() converts it to float32, quadrupling it to ~144 MB per image as a CHW tensor. For ~1.5 MB JPEGs the equivalent is ~9 MB per tensor.

Rough estimates at current defaults (1 DataLoader worker)

Component Large images (24 MB JPEG → 144 MB tensor) Typical images (1.5 MB JPEG → 9 MB tensor)
Active batch (24 images) 3.5 GB 216 MB
Prefetch queue (4 batches) 13.8 GB 864 MB
Pinned IPC shmem copy +13.8 GB +864 MB
Detector working set ~0.5 GB ~0.5 GB
8× concurrent download buffers + PIL ~0.6 GB ~0.1 GB
Peak before classification ~32 GB ~2 GB

Typical images fit comfortably on small workers. Large images don't fit at all. The current defaults appear reasonable for the common case and the failure mode appears only on the uncommon large-image case — but these numbers are estimated from code reading, not measured. A real memory profile on a reproducer would be more trustworthy.

Rough estimates with AMI_ANTENNA_API_BATCH_SIZE=4

Component Large images Typical images
Active batch (4 images) 576 MB 36 MB
Prefetch queue (4 batches) 2.3 GB 144 MB
Pinned IPC shmem copy +2.3 GB +144 MB
Peak ~5.2 GB ~325 MB

Rough estimates with AMI_ANTENNA_API_BATCH_SIZE=4, pin_memory=False, prefetch_factor=4

Component Large images Typical images
Active batch (4 images) 576 MB 36 MB
Prefetch queue (4 batches, pageable) 2.3 GB 144 MB
Pinned IPC shmem copy 0 (not pinned) 0
Peak ~2.9 GB ~180 MB

Best tradeoff from the math: keeps the network-hiding benefit of prefetching, spends about half the RAM of the current defaults even on huge images, and the pageable pages are reclaimable under pressure.

The env var that appears to fix the immediate crash

Setting AMI_ANTENNA_API_BATCH_SIZE=4 in .env drops estimated peak ingest memory from ~32 GB to ~5 GB for 24 MB images. Observed on a test worker after the change: steady-state RSS ~13 GB (including Python baseline, model weights, buff/cache), processing at ~2 s/image with no crashes across several consecutive jobs.

This is a single successful deployment, not a controlled experiment, and we haven't yet isolated the effect of this variable on its own from other local changes applied at the same time — see "still to verify" below.

Discoverability problem: an operator seeing an OOMing worker reaches for AMI_LOCALIZATION_BATCH_SIZE and AMI_CLASSIFICATION_BATCH_SIZE because those sound like memory controls. They aren't — they control model inference batch sizes, which are a small fraction of peak memory. The knob that appears to actually limit ingest memory (AMI_ANTENNA_API_BATCH_SIZE) is currently undocumented in the example env and README.

What was applied locally to get past the immediate crash

Three changes. Current hypothesis is that only (2) was strictly needed for ADC; (1) and (3) are environmental hygiene and belt-and-suspenders.

  1. Raised the supervisor / systemd soft FD limit from 1024 to 65536. Unrelated to this issue as an ADC bug; worth a deployment docs note.
  2. AMI_ANTENNA_API_BATCH_SIZE=4 in .env — the suspected fix.
  3. Local patch to trapdata/antenna/datasets.py: pin_memory=False, prefetch_factor=1. Extra headroom. Not yet independently verified as necessary.

What we still need to verify

Before any of the proposed directions below is taken to code:

  • Config-only reproducibility: revert the local datasets.py patch, leave only AMI_ANTENNA_API_BATCH_SIZE=4, and confirm the worker still runs large-image jobs cleanly. If yes, pin_memory/prefetch_factor don't need to change — they just need to become configurable.
  • Middle-ground test: run large-image and small-image jobs with pin_memory=False, prefetch_factor=4. Does the small-image case keep its current throughput? If yes, this is probably the right new default.
  • Small-image regression: run a typical ~1.5 MB job with everything at the proposed defaults and measure throughput vs the current defaults. Is prefetching materially helping throughput on the common case?
  • Real memory profile: run a reproducer under memray or tracemalloc on a large-image job. The estimates in the tables above are code-reading, not measurement. Would also catch any residual leak on top of the steady-state peak.
  • Download vs inference throughput ratio: measure typical load_time and (classification + detection)_time per batch across a mix of 1.5 MB and 24 MB jobs. The right prefetch factor depends on this ratio and it may differ per-job, which motivates the probe-based auto-tune below.
  • FD accounting: confirm the FD crash was solely a ulimit issue and not a leak of its own. Watch /proc/<pid>/fd count over a long-running job.

Possible directions (to discuss, not to immediately implement)

Six directions, listed in increasing effort. Not mutually exclusive.

Direction 1 — Documentation and example env (low risk)

Add AMI_ANTENNA_API_BATCH_SIZE to .env.example with a comment explaining what it does and how to pick a value. Add a "Tuning for large-image workloads" section to the README. Add a deployment note about the FD soft limit for supervisor/systemd setups.

Direction 2 — Expose the hardcoded DataLoader knobs (small code change)

Make these env-configurable with current values as defaults (no behavior change on its own):

Proposed env var Default Source
AMI_ANTENNA_API_DATALOADER_PIN_MEMORY true hardcoded pin_memory=True
AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR 4 hardcoded prefetch_factor=4
AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS 8 hardcoded ThreadPoolExecutor(max_workers=8)
AMI_ANTENNA_MAX_PENDING_POSTS 5 hardcoded MAX_PENDING_POSTS
AMI_ANTENNA_IDLE_SLEEP_SECONDS 5 hardcoded SLEEP_TIME_SECONDS

This enables Direction 4 and makes the verification work cheap to do.

Direction 3 — A tuning command (medium effort)

ami worker tune <project-or-pipeline> that fetches a small sample of pending tasks, downloads and measures them, reads psutil.virtual_memory().total, and prints recommended settings with the math. Optionally writes to .env. Valuable even if Direction 4 is also done, because it lets operators understand what the worker will do before running.

Direction 4 — Probe-then-commit auto-tune, per job

On entry to _process_job, probe a tiny sample (e.g. 2 tasks) to measure both per-image tensor size and download throughput, and use the ratio of download-time vs inference-time to pick the right batch size and prefetch factor for that job. Pseudocode:

def _probe_job(job_id):
    start = time.time()
    sample = fetch_tasks(batch_size=2)
    if not sample:
        return None
    tensors = download_and_decode(sample)
    elapsed = time.time() - start

    tensor_mb = mean(t.element_size() * t.nelement() / 1e6 for t in tensors)
    download_mbps = (tensor_mb * len(tensors)) / elapsed
    inference_img_per_sec = self.last_inference_rate or 2.0  # conservative

    ram_budget = psutil.virtual_memory().total * 0.4

    seconds_per_batch_download = (CONFIGURED_BATCH_SIZE * tensor_mb) / download_mbps
    seconds_per_batch_inference = CONFIGURED_BATCH_SIZE / inference_img_per_sec
    if seconds_per_batch_download > 2 * seconds_per_batch_inference:
        prefetch = 4  # network-bound, hide latency hard
    elif seconds_per_batch_download > seconds_per_batch_inference:
        prefetch = 2
    else:
        prefetch = 1  # inference-bound, don't waste RAM

    batch_size = int(ram_budget / (tensor_mb * 1e6 * prefetch * 2))
    return min(batch_size, CONFIGURED_MAX), prefetch

Operator never sets the ingest knobs. Jobs with tiny images and slow networks get big batches and aggressive prefetch; jobs with huge images get small batches and minimal prefetch; jobs with fast networks and fast inference also get small batches because prefetch isn't helping. This matches the intuition that jobs should be as large as they need to be and take as long as they need to, letting workers consume what they can when they can.

Caveats: requires Direction 2 (runtime-configurable pin_memory / prefetch_factor). Rebuilding the DataLoader per job is a non-trivial refactor of _process_job. The probe needs to be cheap and needs to handle atypical sample images.

Direction 5 — Pre-prefetch to local disk (best long-term architecture)

Decouple download from RAM by adding a disk-caching stage between fetch and decode. Raw JPEG bytes land on local disk as they download, and the DataLoader reads from disk to decode. RAM never holds more than one active batch of decoded tensors; "prefetch" becomes disk space, which is usually abundant.

Benefits:

  • Bounded RAM regardless of batch_size, prefetch_factor, or input image size
  • Prefetch can be arbitrarily large — whole-job staging becomes feasible
  • Network and inference are fully decoupled (separate async stage), no more "load time dominates batch time"
  • Crash recovery: if a worker dies mid-job, the cache lets a restart resume without re-downloading
  • Makes image-size-based auto-tuning much simpler because the download throughput test is free

Costs:

  • Disk space (bounded, must be cleaned up at job completion)
  • Extra IO (negligible on SSD/NVMe, probably still negligible on spinning disk since HTTP fetch time dominates)
  • Refactor touches RESTDataset.__iter__ and probably adds a new staging class

This is the direction that most cleanly separates the two concerns (hiding network latency vs. limiting memory) that currently fight each other in the DataLoader config. Worth a design discussion separately.

Direction 6 — Resource-aware self-throttling (long-term safety net)

Independent of all of the above, workers should self-throttle based on observed peak RSS. After every N completed batches sample resource.getrusage(RUSAGE_SELF).ru_maxrss, compare to a configurable soft cap (say 70% of MemTotal), and if approaching reduce antenna_api_batch_size and/or prefetch_factor for the next fetch. A worker that runs slowly is strictly better than a worker that crashes — a crash loses in-flight NATS messages to max_deliver exhaustion and silently fails the job. This is a safety net that catches whatever the probe missed.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions