Symptom
Long-running ML jobs (>5 min observed) consistently end with exactly worker.batch_size images stuck in Redis pending sets — 8 if the ADC worker is configured for batch_size=8, 16 if 16. Short jobs (<5 min) don't show this.
Most recent example: demo job 88 (project 5), 544 images total, 16 stranded after the NATS stream drained. PR #1244's reconciler (now live on demo) marked them failed at the next jobs_health_check tick (~22 min idle), and the job landed in SUCCESS (16/544 = 2.9% < FAILURE_THRESHOLD):
[04:30:00] WARNING jobs_health_check: marked 16 image(s) as failed (job idle past cutoff;
NATS consumer num_pending=0 num_ack_pending=0 num_redelivered=16). IDs: [...]
num_redelivered=16 matching batch size = strong signal that one full ADC pull was the unit lost. Without #1244, check_stale_jobs would have REVOKED the job and lost 528 successful results.
Why this matters
PR #1244 is the right safety net but the underlying bug is deterministic. A job big enough to trigger it loses one batch; failure rate scales with 1/N images per job — invisible at small scale, painful at production scale. We should fix the root cause.
Refined hypothesis (based on code reading, not yet measured)
ADC's last in-flight batch is abandoned by the result-POST thread pool when it stalls or fails, leaving the NATS messages un-ACK'd because the ACK chain (Antenna SREM → ACK) only fires on a successful POST landing.
Architectural facts established by investigation:
- ADC does not poll NATS state. The
_worker_loop in worker.py:93-129 is an unconditional while True. The pull loop terminates via the DataLoader iterator (datasets.py:275-290), which breaks when Antenna's /jobs/{id}/tasks/ REST endpoint returns an empty batch — not from NATS num_pending inspection.
- ADC is not the ACKer.
_process_batch calls result_poster.post_async() (worker.py:501), which queues an HTTP POST to a thread pool with backpressure (max 5 pending) and returns immediately. Antenna receives the POST, runs process_nats_pipeline_result, and only then sends the NATS ACK on the reply_subject (tasks.py:292).
- The ACK chain is gated on the POST landing. Order in
process_nats_pipeline_result: parse → SREM pending:process → save detections → SREM pending:results → progress update → ACK. If the POST itself never lands at Antenna, none of those steps run for that batch.
- The last-batch handoff is the smoking gun. When
/jobs/{id}/tasks/ returns empty, the DataLoader iterator breaks and _process_job calls result_poster.wait_for_all_posts() with a bounded timeout (~210s = 60 + 5*30). If any pending POST in the thread pool times out, errors out, or is force-cancelled (result_posting.py:114-122), the batch is abandoned silently — the worker logs "Done, detections:..." at INFO and moves on. NATS never sees the ACK; messages enter num_ack_pending, get redelivered up to max_deliver, then sit there until cleanup.
This explains the "long jobs only" pattern: more batches → more chances for one to hit a transient HTTP blip → exactly-one-batch-worth of images stranded. Short jobs finish before the probability accumulates.
Verification gaps
The above is from code reading, not measurement. To confirm:
Fix directions (rank-ordered, no decision yet)
- Make
wait_for_all_posts retries explicit and bounded. If a POST fails after retries, the batch should be marked "POST failure" in ADC's own logs at WARNING+ so the failure is at least visible. (Cheapest, doesn't reach NATS at all.)
- Have ADC's worker shutdown explicitly NACK or terminate any in-flight messages so NATS gets a fast redeliver instead of waiting
ack_wait × max_deliver. (Requires NATS-aware shutdown path in ADC.)
- Have Antenna track POST receipt and emit a periodic "POST never landed for image X" log distinct from the existing "Job state keys not found" path. (Helps diagnose, doesn't fix.)
PR #1244's reconciler stays as the safety net regardless of which fix lands.
Related
Symptom
Long-running ML jobs (>5 min observed) consistently end with exactly
worker.batch_sizeimages stuck in Redis pending sets — 8 if the ADC worker is configured for batch_size=8, 16 if 16. Short jobs (<5 min) don't show this.Most recent example: demo job 88 (project 5), 544 images total, 16 stranded after the NATS stream drained. PR #1244's reconciler (now live on demo) marked them failed at the next
jobs_health_checktick (~22 min idle), and the job landed in SUCCESS (16/544 = 2.9% <FAILURE_THRESHOLD):num_redelivered=16matching batch size = strong signal that one full ADC pull was the unit lost. Without #1244,check_stale_jobswould have REVOKED the job and lost 528 successful results.Why this matters
PR #1244 is the right safety net but the underlying bug is deterministic. A job big enough to trigger it loses one batch; failure rate scales with
1/Nimages per job — invisible at small scale, painful at production scale. We should fix the root cause.Refined hypothesis (based on code reading, not yet measured)
ADC's last in-flight batch is abandoned by the result-POST thread pool when it stalls or fails, leaving the NATS messages un-ACK'd because the ACK chain (Antenna SREM → ACK) only fires on a successful POST landing.
Architectural facts established by investigation:
_worker_loopinworker.py:93-129is an unconditionalwhile True. The pull loop terminates via the DataLoader iterator (datasets.py:275-290), which breaks when Antenna's/jobs/{id}/tasks/REST endpoint returns an empty batch — not from NATSnum_pendinginspection._process_batchcallsresult_poster.post_async()(worker.py:501), which queues an HTTP POST to a thread pool with backpressure (max 5 pending) and returns immediately. Antenna receives the POST, runsprocess_nats_pipeline_result, and only then sends the NATS ACK on thereply_subject(tasks.py:292).process_nats_pipeline_result: parse → SREMpending:process→ save detections → SREMpending:results→ progress update → ACK. If the POST itself never lands at Antenna, none of those steps run for that batch./jobs/{id}/tasks/returns empty, the DataLoader iterator breaks and_process_jobcallsresult_poster.wait_for_all_posts()with a bounded timeout (~210s =60 + 5*30). If any pending POST in the thread pool times out, errors out, or is force-cancelled (result_posting.py:114-122), the batch is abandoned silently — the worker logs "Done, detections:..." at INFO and moves on. NATS never sees the ACK; messages enternum_ack_pending, get redelivered up tomax_deliver, then sit there until cleanup.This explains the "long jobs only" pattern: more batches → more chances for one to hit a transient HTTP blip → exactly-one-batch-worth of images stranded. Short jobs finish before the probability accumulates.
Verification gaps
The above is from code reading, not measurement. To confirm:
/result/endpoint mid-job, confirm exactly batch_size remains)wait_for_all_poststimeout /result_postingcancellation lines around the moment of "Done, detections:..." in a known-stranded jobresult_posterhas retry logic, and if so, what its budget is)Fix directions (rank-ordered, no decision yet)
wait_for_all_postsretries explicit and bounded. If a POST fails after retries, the batch should be marked "POST failure" in ADC's own logs at WARNING+ so the failure is at least visible. (Cheapest, doesn't reach NATS at all.)ack_wait×max_deliver. (Requires NATS-aware shutdown path in ADC.)PR #1244's reconciler stays as the safety net regardless of which fix lands.
Related
docs/claude/processing-lifecycle.mddocs/claude/debugging/chaos-scenarios.mdScenario F