Skip to content

bug(jobs/adc): exactly one worker batch's worth of images stranded after long-running jobs #1247

@mihow

Description

@mihow

Symptom

Long-running ML jobs (>5 min observed) consistently end with exactly worker.batch_size images stuck in Redis pending sets — 8 if the ADC worker is configured for batch_size=8, 16 if 16. Short jobs (<5 min) don't show this.

Most recent example: demo job 88 (project 5), 544 images total, 16 stranded after the NATS stream drained. PR #1244's reconciler (now live on demo) marked them failed at the next jobs_health_check tick (~22 min idle), and the job landed in SUCCESS (16/544 = 2.9% < FAILURE_THRESHOLD):

[04:30:00] WARNING jobs_health_check: marked 16 image(s) as failed (job idle past cutoff;
  NATS consumer num_pending=0 num_ack_pending=0 num_redelivered=16). IDs: [...]

num_redelivered=16 matching batch size = strong signal that one full ADC pull was the unit lost. Without #1244, check_stale_jobs would have REVOKED the job and lost 528 successful results.

Why this matters

PR #1244 is the right safety net but the underlying bug is deterministic. A job big enough to trigger it loses one batch; failure rate scales with 1/N images per job — invisible at small scale, painful at production scale. We should fix the root cause.

Refined hypothesis (based on code reading, not yet measured)

ADC's last in-flight batch is abandoned by the result-POST thread pool when it stalls or fails, leaving the NATS messages un-ACK'd because the ACK chain (Antenna SREM → ACK) only fires on a successful POST landing.

Architectural facts established by investigation:

  1. ADC does not poll NATS state. The _worker_loop in worker.py:93-129 is an unconditional while True. The pull loop terminates via the DataLoader iterator (datasets.py:275-290), which breaks when Antenna's /jobs/{id}/tasks/ REST endpoint returns an empty batch — not from NATS num_pending inspection.
  2. ADC is not the ACKer. _process_batch calls result_poster.post_async() (worker.py:501), which queues an HTTP POST to a thread pool with backpressure (max 5 pending) and returns immediately. Antenna receives the POST, runs process_nats_pipeline_result, and only then sends the NATS ACK on the reply_subject (tasks.py:292).
  3. The ACK chain is gated on the POST landing. Order in process_nats_pipeline_result: parse → SREM pending:process → save detections → SREM pending:results → progress update → ACK. If the POST itself never lands at Antenna, none of those steps run for that batch.
  4. The last-batch handoff is the smoking gun. When /jobs/{id}/tasks/ returns empty, the DataLoader iterator breaks and _process_job calls result_poster.wait_for_all_posts() with a bounded timeout (~210s = 60 + 5*30). If any pending POST in the thread pool times out, errors out, or is force-cancelled (result_posting.py:114-122), the batch is abandoned silently — the worker logs "Done, detections:..." at INFO and moves on. NATS never sees the ACK; messages enter num_ack_pending, get redelivered up to max_deliver, then sit there until cleanup.

This explains the "long jobs only" pattern: more batches → more chances for one to hit a transient HTTP blip → exactly-one-batch-worth of images stranded. Short jobs finish before the probability accumulates.

Verification gaps

The above is from code reading, not measurement. To confirm:

  • Reproduce on demand (dispatch a 100+ image job, force a transient error in Antenna's /result/ endpoint mid-job, confirm exactly batch_size remains)
  • Search ADC worker logs for the wait_for_all_posts timeout / result_posting cancellation lines around the moment of "Done, detections:..." in a known-stranded job
  • Confirm the abandoned POST does not retry (whether result_poster has retry logic, and if so, what its budget is)
  • Determine whether Antenna logs anything when the POST fails to land vs. when it lands but ACK fails — the diagnostic loudness matters for next-time triage

Fix directions (rank-ordered, no decision yet)

  1. Make wait_for_all_posts retries explicit and bounded. If a POST fails after retries, the batch should be marked "POST failure" in ADC's own logs at WARNING+ so the failure is at least visible. (Cheapest, doesn't reach NATS at all.)
  2. Have ADC's worker shutdown explicitly NACK or terminate any in-flight messages so NATS gets a fast redeliver instead of waiting ack_wait × max_deliver. (Requires NATS-aware shutdown path in ADC.)
  3. Have Antenna track POST receipt and emit a periodic "POST never landed for image X" log distinct from the existing "Job state keys not found" path. (Helps diagnose, doesn't fix.)

PR #1244's reconciler stays as the safety net regardless of which fix lands.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions