You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ami worker process OOMs when processing jobs that contain large source images. On a test worker with 70 GB system RAM and a 24 GB VRAM GPU slice, processing a job of ~914 source images at ~24 MB per JPEG, the worker consumed ~68 GB of RAM in roughly 40 seconds before being OOM-killed by the init system. Based on reading the ingest code and a single successful workaround deployment, we think this is primarily a config/defaults/documentation problem rather than a new leak or a code bug, but more investigation and controlled testing is needed before committing to any particular fix. This issue is mainly to (a) document what we saw, (b) enumerate the tuning knobs the worker already honors, (c) make the currently-hardcoded DataLoader knobs configurable, and (d) open a discussion about auto-tuning strategies.
What we originally thought
Our first theory was the same class of problem as #118 — a memory leak in the worker loop. #118 was a real leak that was fixed and deployed earlier. The behavior we saw here (OOM kills, DataLoader workers growing to 20+ GB RSS, cascading NATS max_deliver failures) looked identical to #118, which sent us down the wrong path for a while.
What seems to actually be happening is different: memory grows to its operating ceiling in seconds on startup, not over hours. It's the steady-state peak that appears to be the problem, and that peak is determined by (1) per-image tensor size, (2) antenna_api_batch_size, (3) prefetch_factor, and (4) pin_memory. When per-image size gets large, the math stops working. We haven't ruled out that there is also a leak on top of this — more careful instrumentation would be needed to say for sure.
Reproduction
Job characteristics: ~914 source images, ~24 MB JPEGs, roughly 4096×3072 (unusually large — most jobs in this deployment are ~1.5 MB images)
T+0s supervisor spawns ami worker
T+43s init system OOM-kills the supervisor cgroup
Mem peak: ~68 GB on a 68 GB instance
We also initially saw OSError(24, 'Too many open files') from multiprocessing.Pipe's os.pipe() call during DataLoader startup. That turned out to be an unrelated infrastructure problem — the default soft FD ulimit on many distros is 1024, and the worker legitimately needs ~1000+ FDs at steady state (SSL sockets × ThreadPoolExecutor, DataLoader IPC pipes, in-flight image downloads). Fixed out-of-band by raising the soft FD limit at the supervisor or systemd level. Not an ADC bug, but worth mentioning in the deployment docs as an ops requirement.
Known configuration parameters
All env vars use the AMI_ prefix (pydantic env_prefix = "ami_"). Current state of everything the worker honors:
Core paths / database
Env var
Default
Purpose
AMI_DATABASE_URL
sqlite at AMI_USER_DATA_PATH/trapdata.db
SQLAlchemy connection string. Supports PostgreSQL. For API workers, only used for local state.
AMI_USER_DATA_PATH
platform-specific app dir
Where model weights, thumbnails, reports, and the default SQLite DB live. Must exist before worker start or SQLite DB creation fails.
AMI_IMAGE_BASE_PATH
unset
Root folder for local-filesystem image workflows. Not used by the Antenna API worker path.
Model selection
Env var
Default
Purpose
AMI_LOCALIZATION_MODEL
DEFAULT_OBJECT_DETECTOR
Which localization model (enum in trapdata.ml.models)
AMI_BINARY_CLASSIFICATION_MODEL
DEFAULT_BINARY_CLASSIFIER
Moth / non-moth filter
AMI_SPECIES_CLASSIFICATION_MODEL
DEFAULT_SPECIES_CLASSIFIER
Fine-grained species classifier
AMI_FEATURE_EXTRACTOR
DEFAULT_FEATURE_EXTRACTOR
Feature extractor for downstream models
AMI_CLASSIFICATION_THRESHOLD
0.6
Minimum classification confidence to emit a detection
Inference batch sizes (model-level, not ingest-level)
Env var
Default
What it actually controls
AMI_LOCALIZATION_BATCH_SIZE
8
Images per localization (detector) forward pass. Affects VRAM during detection and model-level throughput. Does not limit how much data is loaded into RAM or how many tasks are fetched from the API.
AMI_CLASSIFICATION_BATCH_SIZE
20
Crops per classifier forward pass. Same caveat — model-side knob, not ingest-side knob.
AMI_NUM_WORKERS
4
Number of AMI worker instances (one per GPU, via torch.multiprocessing.spawn). Also reused as the DataLoader's num_workers inside each instance. This reuse is confusing and probably worth separating.
Antenna API worker settings
Env var
Default
Purpose
AMI_ANTENNA_API_BASE_URL
http://localhost:8000/api/v2
Base URL for the Antenna API
AMI_ANTENNA_API_AUTH_TOKEN
empty, required
Auth token for the worker's identity
AMI_ANTENNA_SERVICE_NAME
"AMI Data Companion"
Human-readable name shown in the ProcessingService admin
AMI_ANTENNA_API_BATCH_SIZE
24
Tasks per POST fetch from /api/v2/jobs/{id}/tasks/. This appears to be the single biggest lever on peak ingest memory. Currently undocumented in the example env. See analysis below.
Hardcoded in code (not configurable — proposed to become configurable in a PR from this issue)
Location
Value
What it does
trapdata/antenna/datasets.py::get_rest_dataloader
pin_memory=True
Allocates prefetched tensors in page-locked (non-swappable) system RAM. See explanation below.
trapdata/antenna/datasets.py::get_rest_dataloader
prefetch_factor=4
Each DataLoader subprocess keeps 4 batches already decoded and queued ahead of the main process.
8 concurrent image downloads per DataLoader subprocess.
trapdata/antenna/worker.py
MAX_PENDING_POSTS = 5
Result-poster concurrency ceiling.
trapdata/antenna/worker.py
SLEEP_TIME_SECONDS = 5
Polling interval when no jobs are available.
What pin_memory and prefetch_factor actually do
These two DataLoader knobs are the main multipliers on peak ingest memory, they interact in non-obvious ways, and they're both hardcoded today — worth being explicit about their tradeoffs.
pin_memory=True (affects system RAM, not VRAM)
Allocates tensors in page-locked RAM. CUDA's DMA engine requires pinned pages to transfer from host to device; if a tensor is in pageable memory, PyTorch has to copy it to a scratch pinned buffer first, then DMA from there. With pin_memory=True the prefetched tensor goes straight into pinned memory, skipping the extra copy, and .to(device, non_blocking=True) can overlap transfer with compute.
When it helps: CPU→GPU transfer is a meaningful fraction of wall-clock time, typically with fast local data and a high-bandwidth model.
When it hurts:
Pinned pages cannot be swapped. They count as hard RSS against physical RAM.
With num_workers > 0, each prefetched batch is held in pinned shared memory for IPC between the loader subprocess and the main process. Cost scales per prefetched batch.
On workers where data loading dominates wall time (for example HTTP-sourced images where download is ~8 s per batch vs ~20 ms CPU→GPU transfer), pinning provides essentially no throughput benefit.
Key point: pin_memory does not touch VRAM. Disabling it saves system RAM; GPU memory usage is the same either way.
prefetch_factor=N (affects system RAM)
With num_workers > 0, each DataLoader subprocess keeps N batches already-prepared and queued ahead of the main process's consumption (default 2). When the main process pulls one, the subprocess immediately starts on another.
When it helps: data loading is slow relative to inference, so the GPU would otherwise sit idle waiting. Prefetching hides that latency.
When it hurts: data loading is faster than inference — the queue is wasted memory.
For the HTTP-sourced workload: prefetching is genuinely useful. HTTP image download dominates wall time (~7–11 s per batch) relative to inference (~1 s per batch). Without prefetch the GPU would idle most of the time. The problem is only that the fixed prefetch_factor=4 combined with batch_size=24 and large images can exceed worker RAM. On small-image jobs the same settings are fine. The right value is job-dependent, not worker-dependent.
The useful middle ground
Keeping prefetch_factor high while disabling pin_memory gets you network-latency hiding without the unswappable-RAM cost. Prefetched batches live in pageable memory that the kernel can reclaim if pressure rises, and the pinned-shmem IPC duplication goes away. CPU→GPU transfers get about 20 ms slower per batch, which is noise compared to inference time on this workload.
This is probably the best default for any HTTP-sourced workload, regardless of image size. Needs to be validated against a small-image job to confirm it doesn't regress throughput on the common case.
Hypothesis: peak ingest memory scales with image size × fanout
Each 24 MB JPEG decodes to ~36 MB uint8, then image_transforms = ToTensor() converts it to float32, quadrupling it to ~144 MB per image as a CHW tensor. For ~1.5 MB JPEGs the equivalent is ~9 MB per tensor.
Rough estimates at current defaults (1 DataLoader worker)
Component
Large images (24 MB JPEG → 144 MB tensor)
Typical images (1.5 MB JPEG → 9 MB tensor)
Active batch (24 images)
3.5 GB
216 MB
Prefetch queue (4 batches)
13.8 GB
864 MB
Pinned IPC shmem copy
+13.8 GB
+864 MB
Detector working set
~0.5 GB
~0.5 GB
8× concurrent download buffers + PIL
~0.6 GB
~0.1 GB
Peak before classification
~32 GB
~2 GB
Typical images fit comfortably on small workers. Large images don't fit at all. The current defaults appear reasonable for the common case and the failure mode appears only on the uncommon large-image case — but these numbers are estimated from code reading, not measured. A real memory profile on a reproducer would be more trustworthy.
Rough estimates with AMI_ANTENNA_API_BATCH_SIZE=4
Component
Large images
Typical images
Active batch (4 images)
576 MB
36 MB
Prefetch queue (4 batches)
2.3 GB
144 MB
Pinned IPC shmem copy
+2.3 GB
+144 MB
Peak
~5.2 GB
~325 MB
Rough estimates with AMI_ANTENNA_API_BATCH_SIZE=4, pin_memory=False, prefetch_factor=4
Component
Large images
Typical images
Active batch (4 images)
576 MB
36 MB
Prefetch queue (4 batches, pageable)
2.3 GB
144 MB
Pinned IPC shmem copy
0 (not pinned)
0
Peak
~2.9 GB
~180 MB
Best tradeoff from the math: keeps the network-hiding benefit of prefetching, spends about half the RAM of the current defaults even on huge images, and the pageable pages are reclaimable under pressure.
The env var that appears to fix the immediate crash
Setting AMI_ANTENNA_API_BATCH_SIZE=4 in .env drops estimated peak ingest memory from ~32 GB to ~5 GB for 24 MB images. Observed on a test worker after the change: steady-state RSS ~13 GB (including Python baseline, model weights, buff/cache), processing at ~2 s/image with no crashes across several consecutive jobs.
This is a single successful deployment, not a controlled experiment, and we haven't yet isolated the effect of this variable on its own from other local changes applied at the same time — see "still to verify" below.
Discoverability problem: an operator seeing an OOMing worker reaches for AMI_LOCALIZATION_BATCH_SIZE and AMI_CLASSIFICATION_BATCH_SIZE because those sound like memory controls. They aren't — they control model inference batch sizes, which are a small fraction of peak memory. The knob that appears to actually limit ingest memory (AMI_ANTENNA_API_BATCH_SIZE) is currently undocumented in the example env and README.
What was applied locally to get past the immediate crash
Three changes. Current hypothesis is that only (2) was strictly needed for ADC; (1) and (3) are environmental hygiene and belt-and-suspenders.
Raised the supervisor / systemd soft FD limit from 1024 to 65536. Unrelated to this issue as an ADC bug; worth a deployment docs note.
AMI_ANTENNA_API_BATCH_SIZE=4 in .env — the suspected fix.
Local patch to trapdata/antenna/datasets.py: pin_memory=False, prefetch_factor=1. Extra headroom. Not yet independently verified as necessary.
What we still need to verify
Before any of the proposed directions below is taken to code:
Config-only reproducibility: revert the local datasets.py patch, leave only AMI_ANTENNA_API_BATCH_SIZE=4, and confirm the worker still runs large-image jobs cleanly. If yes, pin_memory/prefetch_factor don't need to change — they just need to become configurable.
Middle-ground test: run large-image and small-image jobs with pin_memory=False, prefetch_factor=4. Does the small-image case keep its current throughput? If yes, this is probably the right new default.
Small-image regression: run a typical ~1.5 MB job with everything at the proposed defaults and measure throughput vs the current defaults. Is prefetching materially helping throughput on the common case?
Real memory profile: run a reproducer under memray or tracemalloc on a large-image job. The estimates in the tables above are code-reading, not measurement. Would also catch any residual leak on top of the steady-state peak.
Download vs inference throughput ratio: measure typical load_time and (classification + detection)_time per batch across a mix of 1.5 MB and 24 MB jobs. The right prefetch factor depends on this ratio and it may differ per-job, which motivates the probe-based auto-tune below.
FD accounting: confirm the FD crash was solely a ulimit issue and not a leak of its own. Watch /proc/<pid>/fd count over a long-running job.
Possible directions (to discuss, not to immediately implement)
Six directions, listed in increasing effort. Not mutually exclusive.
Direction 1 — Documentation and example env (low risk)
Add AMI_ANTENNA_API_BATCH_SIZE to .env.example with a comment explaining what it does and how to pick a value. Add a "Tuning for large-image workloads" section to the README. Add a deployment note about the FD soft limit for supervisor/systemd setups.
Direction 2 — Expose the hardcoded DataLoader knobs (small code change)
Make these env-configurable with current values as defaults (no behavior change on its own):
Proposed env var
Default
Source
AMI_ANTENNA_API_DATALOADER_PIN_MEMORY
true
hardcoded pin_memory=True
AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR
4
hardcoded prefetch_factor=4
AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS
8
hardcoded ThreadPoolExecutor(max_workers=8)
AMI_ANTENNA_MAX_PENDING_POSTS
5
hardcoded MAX_PENDING_POSTS
AMI_ANTENNA_IDLE_SLEEP_SECONDS
5
hardcoded SLEEP_TIME_SECONDS
This enables Direction 4 and makes the verification work cheap to do.
Direction 3 — A tuning command (medium effort)
ami worker tune <project-or-pipeline> that fetches a small sample of pending tasks, downloads and measures them, reads psutil.virtual_memory().total, and prints recommended settings with the math. Optionally writes to .env. Valuable even if Direction 4 is also done, because it lets operators understand what the worker will do before running.
Direction 4 — Probe-then-commit auto-tune, per job
On entry to _process_job, probe a tiny sample (e.g. 2 tasks) to measure both per-image tensor size and download throughput, and use the ratio of download-time vs inference-time to pick the right batch size and prefetch factor for that job. Pseudocode:
Operator never sets the ingest knobs. Jobs with tiny images and slow networks get big batches and aggressive prefetch; jobs with huge images get small batches and minimal prefetch; jobs with fast networks and fast inference also get small batches because prefetch isn't helping. This matches the intuition that jobs should be as large as they need to be and take as long as they need to, letting workers consume what they can when they can.
Caveats: requires Direction 2 (runtime-configurable pin_memory / prefetch_factor). Rebuilding the DataLoader per job is a non-trivial refactor of _process_job. The probe needs to be cheap and needs to handle atypical sample images.
Direction 5 — Pre-prefetch to local disk (best long-term architecture)
Decouple download from RAM by adding a disk-caching stage between fetch and decode. Raw JPEG bytes land on local disk as they download, and the DataLoader reads from disk to decode. RAM never holds more than one active batch of decoded tensors; "prefetch" becomes disk space, which is usually abundant.
Benefits:
Bounded RAM regardless of batch_size, prefetch_factor, or input image size
Prefetch can be arbitrarily large — whole-job staging becomes feasible
Network and inference are fully decoupled (separate async stage), no more "load time dominates batch time"
Crash recovery: if a worker dies mid-job, the cache lets a restart resume without re-downloading
Makes image-size-based auto-tuning much simpler because the download throughput test is free
Costs:
Disk space (bounded, must be cleaned up at job completion)
Extra IO (negligible on SSD/NVMe, probably still negligible on spinning disk since HTTP fetch time dominates)
Refactor touches RESTDataset.__iter__ and probably adds a new staging class
This is the direction that most cleanly separates the two concerns (hiding network latency vs. limiting memory) that currently fight each other in the DataLoader config. Worth a design discussion separately.
Direction 6 — Resource-aware self-throttling (long-term safety net)
Independent of all of the above, workers should self-throttle based on observed peak RSS. After every N completed batches sample resource.getrusage(RUSAGE_SELF).ru_maxrss, compare to a configurable soft cap (say 70% of MemTotal), and if approaching reduce antenna_api_batch_size and/or prefetch_factor for the next fetch. A worker that runs slowly is strictly better than a worker that crashes — a crash loses in-flight NATS messages to max_deliver exhaustion and silently fails the job. This is a safety net that catches whatever the probe missed.
Summary
The
ami workerprocess OOMs when processing jobs that contain large source images. On a test worker with 70 GB system RAM and a 24 GB VRAM GPU slice, processing a job of ~914 source images at ~24 MB per JPEG, the worker consumed ~68 GB of RAM in roughly 40 seconds before being OOM-killed by the init system. Based on reading the ingest code and a single successful workaround deployment, we think this is primarily a config/defaults/documentation problem rather than a new leak or a code bug, but more investigation and controlled testing is needed before committing to any particular fix. This issue is mainly to (a) document what we saw, (b) enumerate the tuning knobs the worker already honors, (c) make the currently-hardcoded DataLoader knobs configurable, and (d) open a discussion about auto-tuning strategies.What we originally thought
Our first theory was the same class of problem as #118 — a memory leak in the worker loop. #118 was a real leak that was fixed and deployed earlier. The behavior we saw here (OOM kills, DataLoader workers growing to 20+ GB RSS, cascading NATS
max_deliverfailures) looked identical to #118, which sent us down the wrong path for a while.What seems to actually be happening is different: memory grows to its operating ceiling in seconds on startup, not over hours. It's the steady-state peak that appears to be the problem, and that peak is determined by (1) per-image tensor size, (2)
antenna_api_batch_size, (3)prefetch_factor, and (4)pin_memory. When per-image size gets large, the math stops working. We haven't ruled out that there is also a leak on top of this — more careful instrumentation would be needed to say for sure.Reproduction
Symptom timeline (single observation):
We also initially saw
OSError(24, 'Too many open files')frommultiprocessing.Pipe'sos.pipe()call during DataLoader startup. That turned out to be an unrelated infrastructure problem — the default soft FD ulimit on many distros is 1024, and the worker legitimately needs ~1000+ FDs at steady state (SSL sockets × ThreadPoolExecutor, DataLoader IPC pipes, in-flight image downloads). Fixed out-of-band by raising the soft FD limit at the supervisor or systemd level. Not an ADC bug, but worth mentioning in the deployment docs as an ops requirement.Known configuration parameters
All env vars use the
AMI_prefix (pydanticenv_prefix = "ami_"). Current state of everything the worker honors:Core paths / database
AMI_DATABASE_URLAMI_USER_DATA_PATH/trapdata.dbAMI_USER_DATA_PATHAMI_IMAGE_BASE_PATHModel selection
AMI_LOCALIZATION_MODELDEFAULT_OBJECT_DETECTORtrapdata.ml.models)AMI_BINARY_CLASSIFICATION_MODELDEFAULT_BINARY_CLASSIFIERAMI_SPECIES_CLASSIFICATION_MODELDEFAULT_SPECIES_CLASSIFIERAMI_FEATURE_EXTRACTORDEFAULT_FEATURE_EXTRACTORAMI_CLASSIFICATION_THRESHOLD0.6Inference batch sizes (model-level, not ingest-level)
AMI_LOCALIZATION_BATCH_SIZE8AMI_CLASSIFICATION_BATCH_SIZE20AMI_NUM_WORKERS4torch.multiprocessing.spawn). Also reused as the DataLoader'snum_workersinside each instance. This reuse is confusing and probably worth separating.Antenna API worker settings
AMI_ANTENNA_API_BASE_URLhttp://localhost:8000/api/v2AMI_ANTENNA_API_AUTH_TOKENAMI_ANTENNA_SERVICE_NAME"AMI Data Companion"AMI_ANTENNA_API_BATCH_SIZE24/api/v2/jobs/{id}/tasks/. This appears to be the single biggest lever on peak ingest memory. Currently undocumented in the example env. See analysis below.Hardcoded in code (not configurable — proposed to become configurable in a PR from this issue)
trapdata/antenna/datasets.py::get_rest_dataloaderpin_memory=Truetrapdata/antenna/datasets.py::get_rest_dataloaderprefetch_factor=4trapdata/antenna/datasets.py::RESTDataset._ensure_sessionsThreadPoolExecutor(max_workers=8)trapdata/antenna/worker.pyMAX_PENDING_POSTS = 5trapdata/antenna/worker.pySLEEP_TIME_SECONDS = 5What
pin_memoryandprefetch_factoractually doThese two DataLoader knobs are the main multipliers on peak ingest memory, they interact in non-obvious ways, and they're both hardcoded today — worth being explicit about their tradeoffs.
pin_memory=True(affects system RAM, not VRAM)Allocates tensors in page-locked RAM. CUDA's DMA engine requires pinned pages to transfer from host to device; if a tensor is in pageable memory, PyTorch has to copy it to a scratch pinned buffer first, then DMA from there. With
pin_memory=Truethe prefetched tensor goes straight into pinned memory, skipping the extra copy, and.to(device, non_blocking=True)can overlap transfer with compute.num_workers > 0, each prefetched batch is held in pinned shared memory for IPC between the loader subprocess and the main process. Cost scales per prefetched batch.pin_memorydoes not touch VRAM. Disabling it saves system RAM; GPU memory usage is the same either way.prefetch_factor=N(affects system RAM)With
num_workers > 0, each DataLoader subprocess keeps N batches already-prepared and queued ahead of the main process's consumption (default 2). When the main process pulls one, the subprocess immediately starts on another.prefetch_factor=4combined withbatch_size=24and large images can exceed worker RAM. On small-image jobs the same settings are fine. The right value is job-dependent, not worker-dependent.The useful middle ground
Keeping
prefetch_factorhigh while disablingpin_memorygets you network-latency hiding without the unswappable-RAM cost. Prefetched batches live in pageable memory that the kernel can reclaim if pressure rises, and the pinned-shmem IPC duplication goes away. CPU→GPU transfers get about 20 ms slower per batch, which is noise compared to inference time on this workload.This is probably the best default for any HTTP-sourced workload, regardless of image size. Needs to be validated against a small-image job to confirm it doesn't regress throughput on the common case.
Hypothesis: peak ingest memory scales with image size × fanout
Each 24 MB JPEG decodes to ~36 MB uint8, then
image_transforms = ToTensor()converts it tofloat32, quadrupling it to ~144 MB per image as a CHW tensor. For ~1.5 MB JPEGs the equivalent is ~9 MB per tensor.Rough estimates at current defaults (1 DataLoader worker)
Typical images fit comfortably on small workers. Large images don't fit at all. The current defaults appear reasonable for the common case and the failure mode appears only on the uncommon large-image case — but these numbers are estimated from code reading, not measured. A real memory profile on a reproducer would be more trustworthy.
Rough estimates with
AMI_ANTENNA_API_BATCH_SIZE=4Rough estimates with
AMI_ANTENNA_API_BATCH_SIZE=4, pin_memory=False, prefetch_factor=4Best tradeoff from the math: keeps the network-hiding benefit of prefetching, spends about half the RAM of the current defaults even on huge images, and the pageable pages are reclaimable under pressure.
The env var that appears to fix the immediate crash
Setting
AMI_ANTENNA_API_BATCH_SIZE=4in.envdrops estimated peak ingest memory from ~32 GB to ~5 GB for 24 MB images. Observed on a test worker after the change: steady-state RSS ~13 GB (including Python baseline, model weights, buff/cache), processing at ~2 s/image with no crashes across several consecutive jobs.This is a single successful deployment, not a controlled experiment, and we haven't yet isolated the effect of this variable on its own from other local changes applied at the same time — see "still to verify" below.
Discoverability problem: an operator seeing an OOMing worker reaches for
AMI_LOCALIZATION_BATCH_SIZEandAMI_CLASSIFICATION_BATCH_SIZEbecause those sound like memory controls. They aren't — they control model inference batch sizes, which are a small fraction of peak memory. The knob that appears to actually limit ingest memory (AMI_ANTENNA_API_BATCH_SIZE) is currently undocumented in the example env and README.What was applied locally to get past the immediate crash
Three changes. Current hypothesis is that only (2) was strictly needed for ADC; (1) and (3) are environmental hygiene and belt-and-suspenders.
AMI_ANTENNA_API_BATCH_SIZE=4in.env— the suspected fix.trapdata/antenna/datasets.py:pin_memory=False,prefetch_factor=1. Extra headroom. Not yet independently verified as necessary.What we still need to verify
Before any of the proposed directions below is taken to code:
datasets.pypatch, leave onlyAMI_ANTENNA_API_BATCH_SIZE=4, and confirm the worker still runs large-image jobs cleanly. If yes,pin_memory/prefetch_factordon't need to change — they just need to become configurable.pin_memory=False, prefetch_factor=4. Does the small-image case keep its current throughput? If yes, this is probably the right new default.memrayortracemallocon a large-image job. The estimates in the tables above are code-reading, not measurement. Would also catch any residual leak on top of the steady-state peak.load_timeand(classification + detection)_timeper batch across a mix of 1.5 MB and 24 MB jobs. The right prefetch factor depends on this ratio and it may differ per-job, which motivates the probe-based auto-tune below./proc/<pid>/fdcount over a long-running job.Possible directions (to discuss, not to immediately implement)
Six directions, listed in increasing effort. Not mutually exclusive.
Direction 1 — Documentation and example env (low risk)
Add
AMI_ANTENNA_API_BATCH_SIZEto.env.examplewith a comment explaining what it does and how to pick a value. Add a "Tuning for large-image workloads" section to the README. Add a deployment note about the FD soft limit for supervisor/systemd setups.Direction 2 — Expose the hardcoded DataLoader knobs (small code change)
Make these env-configurable with current values as defaults (no behavior change on its own):
AMI_ANTENNA_API_DATALOADER_PIN_MEMORYtruepin_memory=TrueAMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR4prefetch_factor=4AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS8ThreadPoolExecutor(max_workers=8)AMI_ANTENNA_MAX_PENDING_POSTS5MAX_PENDING_POSTSAMI_ANTENNA_IDLE_SLEEP_SECONDS5SLEEP_TIME_SECONDSThis enables Direction 4 and makes the verification work cheap to do.
Direction 3 — A tuning command (medium effort)
ami worker tune <project-or-pipeline>that fetches a small sample of pending tasks, downloads and measures them, readspsutil.virtual_memory().total, and prints recommended settings with the math. Optionally writes to.env. Valuable even if Direction 4 is also done, because it lets operators understand what the worker will do before running.Direction 4 — Probe-then-commit auto-tune, per job
On entry to
_process_job, probe a tiny sample (e.g. 2 tasks) to measure both per-image tensor size and download throughput, and use the ratio of download-time vs inference-time to pick the right batch size and prefetch factor for that job. Pseudocode:Operator never sets the ingest knobs. Jobs with tiny images and slow networks get big batches and aggressive prefetch; jobs with huge images get small batches and minimal prefetch; jobs with fast networks and fast inference also get small batches because prefetch isn't helping. This matches the intuition that jobs should be as large as they need to be and take as long as they need to, letting workers consume what they can when they can.
Caveats: requires Direction 2 (runtime-configurable pin_memory / prefetch_factor). Rebuilding the DataLoader per job is a non-trivial refactor of
_process_job. The probe needs to be cheap and needs to handle atypical sample images.Direction 5 — Pre-prefetch to local disk (best long-term architecture)
Decouple download from RAM by adding a disk-caching stage between fetch and decode. Raw JPEG bytes land on local disk as they download, and the DataLoader reads from disk to decode. RAM never holds more than one active batch of decoded tensors; "prefetch" becomes disk space, which is usually abundant.
Benefits:
batch_size,prefetch_factor, or input image sizeCosts:
RESTDataset.__iter__and probably adds a new staging classThis is the direction that most cleanly separates the two concerns (hiding network latency vs. limiting memory) that currently fight each other in the DataLoader config. Worth a design discussion separately.
Direction 6 — Resource-aware self-throttling (long-term safety net)
Independent of all of the above, workers should self-throttle based on observed peak RSS. After every N completed batches sample
resource.getrusage(RUSAGE_SELF).ru_maxrss, compare to a configurable soft cap (say 70% of MemTotal), and if approaching reduceantenna_api_batch_sizeand/orprefetch_factorfor the next fetch. A worker that runs slowly is strictly better than a worker that crashes — a crash loses in-flight NATS messages tomax_deliverexhaustion and silently fails the job. This is a safety net that catches whatever the probe missed.Related