bug(jobs/adc): exactly one worker batch's worth of images stranded after long-running jobs

## Symptom

Long-running ML jobs (>5 min observed) consistently end with **exactly `worker.batch_size` images stuck** in Redis pending sets — 8 if the ADC worker is configured for batch_size=8, 16 if 16. Short jobs (<5 min) don't show this.

Most recent example: demo job 88 (project 5), 544 images total, **16 stranded** after the NATS stream drained. PR #1244's reconciler (now live on demo) marked them failed at the next `jobs_health_check` tick (~22 min idle), and the job landed in SUCCESS (16/544 = 2.9% < `FAILURE_THRESHOLD`):

```
[04:30:00] WARNING jobs_health_check: marked 16 image(s) as failed (job idle past cutoff;
  NATS consumer num_pending=0 num_ack_pending=0 num_redelivered=16). IDs: [...]
```

`num_redelivered=16` matching batch size = strong signal that one full ADC pull was the unit lost. Without #1244, `check_stale_jobs` would have REVOKED the job and lost 528 successful results.

## Why this matters

PR #1244 is the right safety net but the underlying bug is **deterministic**. A job big enough to trigger it loses one batch; failure rate scales with `1/N` images per job — invisible at small scale, painful at production scale. We should fix the root cause.

## Refined hypothesis (based on code reading, not yet measured)

ADC's last in-flight batch is **abandoned by the result-POST thread pool** when it stalls or fails, leaving the NATS messages un-ACK'd because the ACK chain (Antenna SREM → ACK) only fires on a successful POST landing.

Architectural facts established by [investigation](https://github.com/RolnickLab/ami-data-companion):

1. **ADC does not poll NATS state.** The `_worker_loop` in `worker.py:93-129` is an unconditional `while True`. The pull loop terminates via the DataLoader iterator (`datasets.py:275-290`), which breaks when Antenna's `/jobs/{id}/tasks/` REST endpoint returns an empty batch — *not* from NATS `num_pending` inspection.
2. **ADC is not the ACKer.** `_process_batch` calls `result_poster.post_async()` (`worker.py:501`), which queues an HTTP POST to a thread pool with backpressure (max 5 pending) and returns immediately. Antenna receives the POST, runs `process_nats_pipeline_result`, and only then sends the NATS ACK on the `reply_subject` ([`tasks.py:292`](https://github.com/RolnickLab/antenna/blob/main/ami/jobs/tasks.py)).
3. **The ACK chain is gated on the POST landing.** Order in `process_nats_pipeline_result`: parse → SREM `pending:process` → save detections → SREM `pending:results` → progress update → ACK. If the POST itself never lands at Antenna, *none* of those steps run for that batch.
4. **The last-batch handoff is the smoking gun.** When `/jobs/{id}/tasks/` returns empty, the DataLoader iterator breaks and `_process_job` calls `result_poster.wait_for_all_posts()` with a bounded timeout (~210s = `60 + 5*30`). **If any pending POST in the thread pool times out, errors out, or is force-cancelled (`result_posting.py:114-122`), the batch is abandoned silently** — the worker logs "Done, detections:..." at INFO and moves on. NATS never sees the ACK; messages enter `num_ack_pending`, get redelivered up to `max_deliver`, then sit there until cleanup.

This explains the "long jobs only" pattern: more batches → more chances for one to hit a transient HTTP blip → exactly-one-batch-worth of images stranded. Short jobs finish before the probability accumulates.

## Verification gaps

The above is from code reading, not measurement. To confirm:

- [ ] Reproduce on demand (dispatch a 100+ image job, force a transient error in Antenna's `/result/` endpoint mid-job, confirm exactly batch_size remains)
- [ ] Search ADC worker logs for the `wait_for_all_posts` timeout / `result_posting` cancellation lines around the moment of "Done, detections:..." in a known-stranded job
- [ ] Confirm the abandoned POST does not retry (whether `result_poster` has retry logic, and if so, what its budget is)
- [ ] Determine whether Antenna logs anything when the POST fails to land vs. when it lands but ACK fails — the diagnostic loudness matters for next-time triage

## Fix directions (rank-ordered, no decision yet)

1. **Make `wait_for_all_posts` retries explicit and bounded.** If a POST fails after retries, the batch should be marked "POST failure" in ADC's own logs at WARNING+ so the failure is at least visible. (Cheapest, doesn't reach NATS at all.)
2. **Have ADC's worker shutdown explicitly NACK or terminate any in-flight messages** so NATS gets a fast redeliver instead of waiting `ack_wait` × `max_deliver`. (Requires NATS-aware shutdown path in ADC.)
3. **Have Antenna track POST receipt and emit a periodic "POST never landed for image X" log** distinct from the existing "Job state keys not found" path. (Helps diagnose, doesn't fix.)

PR #1244's reconciler stays as the safety net regardless of which fix lands.

## Related

- Discovered while validating PR #1244 in production (demo job 88).
- Lifecycle reference: `docs/claude/processing-lifecycle.md`
- Chaos runbook for the symptom: `docs/claude/debugging/chaos-scenarios.md` Scenario F

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(jobs/adc): exactly one worker batch's worth of images stranded after long-running jobs #1247

Symptom

Why this matters

Refined hypothesis (based on code reading, not yet measured)

Verification gaps

Fix directions (rank-ordered, no decision yet)

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(jobs/adc): exactly one worker batch's worth of images stranded after long-running jobs #1247

Description

Symptom

Why this matters

Refined hypothesis (based on code reading, not yet measured)

Verification gaps

Fix directions (rank-ordered, no decision yet)

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions