fix(jobs): mark failed or lost images as failed so job can be marked complete by mihow · Pull Request #1244 · RolnickLab/antenna

mihow · 2026-04-17T06:26:39Z

Summary

When NATS gives up redelivering a message (after max_deliver=2) because the handler kept raising, Redis still tracks that image as pending. The job sits at something like 17/20 done indefinitely. At the 10-min STALLED_JOBS_MAX_MINUTES cutoff, check_stale_jobs REVOKEs the whole job — throwing away the 17 images of successful work alongside the 3 that are actually lost.

This PR adds a reconciler inside jobs_health_check that runs before check_stale_jobs: for each running async_api job that has been idle past the cutoff while Redis still tracks images as pending, mark those images as failed and let the existing completion logic land the job in SUCCESS or FAILURE based on the same FAILURE_THRESHOLD = 0.5 used everywhere else. A 17/20 job lands SUCCESS with failed=3. A 2/20 job still lands FAILURE, but with the full stage breakdown visible.

Complements the three PRs merged this week (#1231, #1234, #1235). Those fixed the transient/terminal Redis boundary, the ACK/SREM ordering, and made the stale-job reaper faster. Those plus this close the loop: the reaper is now the last-resort safety net for truly unreachable jobs, not the primary recovery path for a single stuck image.

Real production symptom this addresses

demo.antenna.insectai.org/projects/5/jobs/83 — job completed 6 short of 100% and will hang for 10 minutes before the whole job is marked FAILURE, rather than being able to land SUCCESS / FAILURE on the 6 lost images alone.

What this is NOT

Not a retry mechanism. Lost images stay lost. That is the fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) #1235 footnote's "managed retry pattern" — a separate, bigger design question.
Not a max_deliver change. The 5→2 reduction in fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234 is the right knob for persistent bad-data; this PR is downstream of that setting and agnostic to its value.
Not a Celery/NATS configuration change. No ack_wait / max_deliver / retry_backoff tuning. Pure reconciliation.

Architecture

Matches the sub-check pattern established by #1239 (zombie streams) and #1235 (stale jobs):

mark_lost_images_failed() in ami/jobs/tasks.py — scans running async_api jobs with updated_at < now - STALLED_JOBS_MAX_MINUTES, checks each for leftover Redis pending ids, and reconciles the ones it finds.
_reconcile_lost_images() — SREMs the lost ids from both pending_images:process and pending_images:results, SADDs them to failed_images, and runs _update_job_progress for both stages so the existing completion path (which calls cleanup_async_job_if_needed) fires naturally.
_run_mark_lost_images_failed_check() — wraps the call in an IntegrityCheckResult and adds it to JobsHealthCheckResult. Runs before _run_stale_jobs_check inside the jobs_health_check umbrella, so a reconcilable job never gets REVOKEd.

Reuses existing primitives — no new FAILURE path, no new threshold, no new Celery task, no new settings. Adds three small pieces:

ConsumerState dataclass + TaskQueueManager.get_consumer_state() (diagnostic only — reads num_pending / num_ack_pending / num_redelivered for the log line).
AsyncJobStateManager.get_pending_image_ids() (SUNION of both stage sets).
chaos_monkey exhaust_max_deliver subcommand — drives a consumer past max_deliver without ADC for validation. Pairs with the existing flush redis / flush nats pattern from fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234's chaos runbook.

Design choice: the idle signal is `Job.updated_at`, not NATS counters

Initial draft gated on num_pending == 0 AND num_ack_pending == 0 AND num_redelivered > 0. Empirical finding during e2e (chaos_monkey exhaust_max_deliver against a live NATS JetStream): after max_deliver=2 is hit, the messages stay in num_ack_pending indefinitely — JetStream does not clear that counter until the stream is deleted. Three more ack_wait cycles: still num_ack_pending=3.

This means a num_ack_pending > 0 guard would skip the exact production case. Removed both counter guards. The idle signal is now the one that actually correlates with stuckness: Job.updated_at is bumped on every successful result save, so a job that hasn't updated in 10+ min has genuinely stopped making progress — whatever the NATS counters say.

Why this is safe

Late deliveries after reconciliation are idempotent, as documented in #1234's processing-lifecycle.md §2:

save_results dedupes on (detection, source_image) — detections still land if ADC recovers and pushes the result.
SREM on already-removed ids is a no-op with newly_removed == 0 gating counter accumulation, so detections/classifications/captures don't double.
_update_job_progress clamps percentage with max() and preserves FAILURE once set — a late result can't regress progress or un-fail a job.

Worst case: a genuinely slow-but-alive ADC gets its 10-min-old images marked failed, but its late result still saves detections. The Job shows FAILURE in the UI; the detections exist in the DB — same as any partial-failure async job.

Files

File	Change
`ami/ml/orchestration/nats_queue.py`	`ConsumerState` dataclass + `TaskQueueManager.get_consumer_state`
`ami/ml/orchestration/async_job_state.py`	`AsyncJobStateManager.get_pending_image_ids`
`ami/jobs/tasks.py`	`mark_lost_images_failed` + `_reconcile_lost_images` + `_run_mark_lost_images_failed_check` sub-check + `jobs_health_check` wiring
`ami/jobs/tests/test_tasks.py`	`TestMarkLostImagesFailed` — 8 TransactionTestCase
`ami/jobs/tests/test_periodic_beat_tasks.py`	Cover the new `lost_images` field on `JobsHealthCheckResult`
`ami/jobs/management/commands/chaos_monkey.py`	`exhaust_max_deliver` subcommand
`docs/claude/processing-lifecycle.md`	New failure-mode row
`docs/claude/debugging/chaos-scenarios.md`	Scenario F runbook

Verified locally

Unit — `ami.jobs` suite, 84 tests pass

docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.jobs --keepdb — green.

TestMarkLostImagesFailed (8 cases, TransactionTestCase with real Redis and real DB, TaskQueueManager mocked) specifically covers:

Main job-2421 shape — 7/10 processed, 1 explicit failure, 2 NATS-lost → reconciler SADDs 2 to failed, SREMs 2 from pending; final process.failed = 3, remaining = 0; job lands SUCCESS (3/10 < threshold). Asserts stage params, Redis drained, diagnostic in progress.errors.
>50% FAILURE path — failure ratio above FAILURE_THRESHOLD → job lands FAILURE not SUCCESS, with accurate per-image counts. Verifies the reconciler doesn't accidentally mask genuine failure.
Mixed failures without double-counting — already-SADDed explicit_failures plus newly-SADDed lost_ids land as a single set; SADD is idempotent on accidental overlap. Verifies we don't inflate failed by counting overlap twice.
Idle cutoff is load-bearing — a job with fresh updated_at (inside the cutoff window) skips reconciliation even when all other signals say "stuck". This is the one true guard.
Consumer-state counters don't gate — both num_pending > 0 and num_ack_pending > 0 cases now reconcile (tests flipped from "noop" to "reconciles") because the empirical NATS behavior showed those counters are unreliable indicators.
Bug A resurrection case — num_redelivered=0 with stuck Redis ids (the pre-fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234 ACK-before-SREM signature) also reconciles cleanly. If that bug ever resurfaces, this reconciler catches it.
Sub-check ordering — verifies jobs_health_check runs lost_images before stale_jobs so a reconcilable job lands in its natural terminal state instead of REVOKE.

Updates to test_periodic_beat_tasks.py confirm the new lost_images IntegrityCheckResult field populates correctly across healthy / idle-deployment paths.

E2E path 1 — `chaos_monkey exhaust_max_deliver` + reconciler against live stack (fake job_id)

Validates what unit tests cannot mock: real NATS JetStream consumer behavior.

What it verified:

TaskQueueManager.get_consumer_state() returns None for a non-existent consumer (tolerant path).
get_consumer_state() returns correctly-populated ConsumerState against a real consumer: num_pending=3, num_ack_pending=0, num_redelivered=0 after publishing 3 messages pre-delivery.
The empirical max_deliver exhaustion resting state is num_pending=0, num_ack_pending=3, num_redelivered=3 — and it stays there through three more ack_wait cycles. This is the finding that drove the guard-removal design change; it is now covered by a test in the suite.
AsyncJobStateManager.get_pending_image_ids() returns the SUNION of both stage sets against real Redis.
Full reconciler path end-to-end: SREM / SADD / _update_job_progress / cleanup_async_job_if_needed all fire in sequence. Redis cleaned up. NATS stream + consumer deleted on completion (via the existing cleanup trigger, not new code). Diagnostic appended to progress.errors with the real counter values.

E2E path 2 — real ADC happy path

Dispatched job 1529: project 18, pipeline quebec_vermont_moths_2023, 20-image collection, dispatch_mode=async_api. ADC worker running against the local stack, pulling real messages.

What it verified:

Job runs to SUCCESS in 18s: 20 captures, 226 detections, 360 classifications.
The new sub-check correctly does not fire on a healthy job (candidate list is empty because updated_at stays fresh as batches land).
jobs_health_check umbrella task continues to run without interference from the new sub-check.

E2E path 3 — real Job row stuck via `chaos_monkey` (ADC stopped)

Job 1530, real DB row referencing real pipeline + collection + project, dispatch_mode=ASYNC_API, status=STARTED. Back-dated updated_at by 15 min to cross the cutoff. chaos_monkey exhaust_max_deliver --job-id 1530 --image-ids img-x,img-y,img-z --ensure-stream drove the consumer past max_deliver.

What it verified:

Reconciler picks up the real Job row via the updated_at < cutoff + dispatch_mode=ASYNC_API + status in running_states filter (not a synthetic fake-job edge case).
All three stages transition: process and results land at 100% FAILURE (3/3 = 100% > threshold).
cleanup_async_job_if_needed fires on is_complete(), draining Redis and deleting the NATS stream — proves the reconciler integrates correctly with the existing cleanup lifecycle.
Diagnostic line in progress.errors contains the actual live counter values.

To verify in staging

Before any synthetic traffic

jobs_health_check beat-task duration on the first few ticks post-deploy — confirm the new sub-check doesn't materially increase tick time on an idle stack (expected: tens of ms).
lost_images IntegrityCheckResult counters on healthy jobs: checked=0, fixed=0, unfixable=0 — anything non-zero without a matching real stuck job warrants investigation.

Real-traffic observation (first 24-72h)

Rate of jobs_health_check: marked N image(s) as failed (job idle past cutoff; ...) WARN lines in celeryworker logs. Expected: rare (only when a real handler failure + max_deliver exhaustion happens). Any spike should correlate with a visible cluster of stuck jobs.
Rate of check_stale_jobs REVOKEs should drop relative to pre-deploy — partial-failure jobs now finalize through the reconciler instead of being reaped.
Any job that lands in FAILURE via the reconciler should have a populated progress.errors starting with jobs_health_check: — visible both via API and in the Job-detail UI.
No false positives: a mid-flight job that's just slow should not be touched. Double-check any fixed > 0 entry against the job's actual progress — if the job was still making progress within 10 min, that's a bug.

Targeted scenario — reproduce a real max_deliver exhaustion

Production can't wait 10 min reliably, so this is best done in staging with a controlled job:

Dispatch a real async_api job with a fresh collection (ensure filter_processed_images doesn't strip everything).
While the job is mid-flight (process stage 30-60%), stop the ADC worker. NATS begins to backfill redelivery attempts on in-flight messages.
Wait 11 min (past STALLED_JOBS_MAX_MINUTES). jobs_health_check beat tick runs mark_lost_images_failed before check_stale_jobs.
Verify: job lands in SUCCESS or FAILURE based on how many images were successfully processed before ADC stopped. progress.errors has the diagnostic. NATS stream cleaned up. Redis keys cleaned up. No REVOKE in the Celery task state for this job.
Repeat with ADC returning after the reconciler fired — late results should save detections idempotently (see "Why this is safe" above) even though the job is already terminal. No errors logged from save_results or process_nats_pipeline_result.

Sanity checks to run manually on any stuck-looking job post-deploy

From docs/claude/processing-lifecycle.md §5 (copy-paste block, replace JOB_ID):

Job row state (status, progress, errors trailer).
Redis state (SCARD on both pending sets + failed_images).
NATS consumer state (num_pending, num_ack_pending, num_redelivered).
Last 50 log lines referencing the job.

A job showing status=STARTED + process/results < 100% + updated_at < 10 min old is not reconcilable — that's an in-flight job, possibly slow. A job showing status=STARTED + Redis pending > 0 + updated_at > 10 min old is reconcilable and should have flipped to SUCCESS/FAILURE on the most recent beat tick.

Relationship to concurrent / recently-merged work

fix: retry on Redis blips instead of failing the job #1231 (merged 2026-04-15) — Separated transient Redis errors (autoretry) from terminal state loss (_fail_job). This PR relies on the crisp update_state contract it established.
fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234 (merged 2026-04-15) — ACK-last ordering, counter-inflation guard, task_failure guard, max_deliver 5→2. This PR's raison d'être is the downstream symptom of max_deliver=2 exhaustion. Touches the same processing-lifecycle.md section (new row).
fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) #1235 (merged 2026-04-15) — Split FAILED_CUTOFF_HOURS into display-hours vs. reaper-minutes. Reaper runs at 10 min. This PR's reconciler runs before the reaper at the same 10-min cutoff so jobs land naturally rather than being REVOKEd.
feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing #1241 (draft) — Makes missing-state FAILUREs self-diagnosing by appending to progress.errors. Both PRs append to progress.errors; the formats differ but coexist without conflict. If feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing #1241 lands first, a small _append_progress_error(job, reason) helper could consolidate the two sites.
fix(jobs): don't fail jobs prematurely, but fail stalled jobs sooner (3 days -> 10 min) #1235 footnote — flagged a "managed retry pattern" as future direction. This PR is adjacent but narrower: reconciliation to land the job in its natural terminal state, not automated re-dispatch. That design discussion stays open as a follow-up.

Generated with Claude Code

Two small primitives the upcoming mark_lost_images_failed reconciler (next commit) needs to inspect running async_api jobs. TaskQueueManager.get_consumer_state(job_id) → ConsumerState | None Thin, mockable projection over nats.js.api.ConsumerInfo. Returns only num_pending / num_ack_pending / num_redelivered — the counters that matter for diagnostic logging. A missing stream/consumer returns None rather than raising, mirroring log_consumer_stats_snapshot's tolerance: a job that was already cleaned up should not blow up a caller iterating over a batch. AsyncJobStateManager.get_pending_image_ids() → set[str] SUNION of both stage pending sets. Used by the reconciler to find ids that Redis still tracks as pending-for-processing. Returns empty set on RedisError so one pooled-connection blip does not abort a sweep. No call sites added in this commit. Unit tests for these primitives come in the reconciler commit, where they are exercised as part of the reconciliation path. Co-Authored-By: Claude <noreply@anthropic.com>

When NATS exhausts max_deliver=2 (per #1234) for a message whose processing kept failing non-retriably, JetStream stops redelivering but keeps the message in the consumer's num_ack_pending indefinitely. Redis still tracks the image id in pending_images:{process,results}. The job sits at <100% progress until check_stale_jobs REVOKEs it at 10 min, wiping all the legitimately processed work alongside the lost images. Adds mark_lost_images_failed() — a new sub-check wired into jobs_health_check ahead of check_stale_jobs. For each running async_api job whose updated_at is older than STALLED_JOBS_MAX_MINUTES and whose Redis pending sets still hold ids, SREMs those ids from pending, SADDs them to failed_images, and runs _update_job_progress for both stages. The existing completion logic (FAILURE_THRESHOLD = 0.5) then decides SUCCESS vs FAILURE on accurate per-image counts instead of REVOKE-ing the whole job. Design decisions: * Idle-threshold signal is Job.updated_at, not NATS consumer counters. JetStream does not clear num_ack_pending when max_deliver is hit — guarding on num_ack_pending > 0 would block the exact production case this is meant to reconcile. Verified empirically with chaos_monkey exhaust_max_deliver (next commit) against a live NATS JetStream. * get_consumer_state() is still called per candidate, but only to log current counters in the diagnostic appended to progress.errors. The reconcile decision itself does not read them. * Runs BEFORE _run_stale_jobs_check so a reconcilable job lands in its natural completion state instead of being REVOKEd. Tests: * 8 TransactionTestCase cases in TestMarkLostImagesFailed covering the main job-2421 shape, the mixed-failure path, the >50% FAILURE threshold, the idle-threshold guard, and two cases (num_pending > 0, num_ack_pending > 0) that now reconcile because the idle cutoff is the load-bearing signal — previously (in earlier revisions of this branch) those were guarded noops. Co-Authored-By: Claude <noreply@anthropic.com>

Fault-injection tool for validating mark_lost_images_failed (and any future reconciler logic) against a live NATS JetStream without needing to run ADC and a real ML pipeline. Given a job_id and a set of image ids, publishes one message per id on the job's subject, then pulls without ACK and waits ack_wait — repeating NATS_MAX_DELIVER times so each message hits its delivery budget. After this the consumer sits in (num_pending=0, num_ack_pending=len(ids), num_redelivered>0), which empirically is the post-exhaustion resting state (JetStream does not clear num_ack_pending for messages that hit max_deliver). Complements the existing 'flush redis' / 'flush nats' subcommands. Takes ~NATS_MAX_DELIVER × (TASK_TTR + 3s) to run (default ≈ 66s). Run either against a real dispatched job, or with --ensure-stream to create the stream+consumer itself (useful for standalone reconciler tests against a fake job_id). Usage: python manage.py chaos_monkey exhaust_max_deliver \ --job-id 999999 \ --image-ids img-a,img-b,img-c \ --ensure-stream Co-Authored-By: Claude <noreply@anthropic.com>

… + chaos runbook * processing-lifecycle.md — add a new failure-mode row describing the symptom (job stuck at <100% indefinitely, some images still in Redis pending sets, NATS consumer shows num_redelivered > 0 and num_ack_pending stays elevated) and point at mark_lost_images_failed as the fix. Distinct from Bug A (pre-#1234 ACK/SREM ordering) — that was a crash-window race; this one is max_deliver giving up. * chaos-scenarios.md — add Scenario F: a step-by-step runbook for validating the reconciler against a live stack using the new chaos_monkey exhaust_max_deliver subcommand. No code patches or celeryworker restart required, unlike Scenarios B/D which rely on sentinel-file sentinel patches. Co-Authored-By: Claude <noreply@anthropic.com>

netlify · 2026-04-17T06:26:45Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`9c38aa5`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69e1e25c9cc1590008672ed2

netlify · 2026-04-17T06:26:45Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`9c38aa5`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69e1e25cbab1b40008278e0c

coderabbitai · 2026-04-17T06:26:46Z

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 44 minutes and 30 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 44 minutes and 30 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34263b09-3024-4a86-b1ba-6b5156e0d36c

📥 Commits

Reviewing files that changed from the base of the PR and between 585cd55 and 9c38aa5.

📒 Files selected for processing (8)

ami/jobs/management/commands/chaos_monkey.py
ami/jobs/tasks.py
ami/jobs/tests/test_periodic_beat_tasks.py
ami/jobs/tests/test_tasks.py
ami/ml/orchestration/async_job_state.py
ami/ml/orchestration/nats_queue.py
docs/claude/debugging/chaos-scenarios.md
docs/claude/processing-lifecycle.md

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mark-lost-images-failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds a reconciliation path to finalize async_api jobs that become stuck when NATS stops redelivering messages but Redis still tracks a small remainder of pending image IDs, so jobs land in their natural SUCCESS/FAILURE outcome (via the existing FAILURE_THRESHOLD) instead of being blanket-REVOKEd by the stale-job reaper.

Changes:

Add a mark_lost_images_failed() sub-check (run before check_stale_jobs) that moves “lost” pending IDs into failed_images and drives progress completion via existing _update_job_progress logic.
Add small supporting primitives: ConsumerState + TaskQueueManager.get_consumer_state(), and AsyncJobStateManager.get_pending_image_ids().
Add tests, a chaos-monkey scenario to exhaust max_deliver, and docs/runbook updates describing the new failure mode and recovery workflow.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
docs/claude/processing-lifecycle.md	Documents the new “max_deliver exhaustion / pending IDs remain in Redis” failure mode and points to the reconciler fix.
docs/claude/debugging/chaos-scenarios.md	Adds a runbook scenario to reproduce max-deliver exhaustion and validate reconciliation behavior.
ami/ml/orchestration/nats_queue.py	Introduces `ConsumerState` and a tolerant `get_consumer_state()` helper for health-check diagnostics.
ami/ml/orchestration/async_job_state.py	Adds `get_pending_image_ids()` to query union of pending IDs across stages.
ami/jobs/tasks.py	Implements `mark_lost_images_failed()`, reconciliation helper, and wires it into `jobs_health_check` before the stale-job reaper.
ami/jobs/tests/test_tasks.py	Adds transaction-level regression tests covering reconciliation outcomes and integration ordering in `jobs_health_check`.
ami/jobs/tests/test_periodic_beat_tasks.py	Updates expected `jobs_health_check` result shape to include the new `lost_images` sub-check.
ami/jobs/management/commands/chaos_monkey.py	Adds `exhaust_max_deliver` subcommand to drive a consumer past `max_deliver` without ADC.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The module defined ``logger = logging.getLogger(__name__)`` but never used it (all output goes through ``self.stdout.write``). flake8 F401/F841 would flag on a future lint pass. Spotted in Copilot review of #1244. Co-Authored-By: Claude <noreply@anthropic.com>

The new failure-mode row was committed with a placeholder PR reference; replace it now that #1244 is open. Co-Authored-By: Claude <noreply@anthropic.com>

Three Copilot review findings on #1244 rolled up: - Re-validate inside ``select_for_update`` at the top of ``_reconcile_lost_images`` before issuing any Redis SREM/SADD. Closes the window where a late ``process_nats_pipeline_result`` arrives between ``mark_lost_images_failed``'s candidate selection and reconciliation, bumps ``updated_at``, and would otherwise cause counter inflation (same image counted as both processed and failed). - Cap the id list embedded in ``progress.errors`` at 10 entries plus "and N more". The full list still goes to ``job.logger.warning`` — recoverable from the UI's logs panel without bloating the JSONB field surfaced in the job detail payload. - Fix the ``_run_mark_lost_images_failed_check`` docstring that claimed the reconciler skips when ``num_redelivered=0``. It does not — that guard was removed when we discovered NATS keeps exhausted messages in ``num_ack_pending`` indefinitely; the only skip conditions now are consumer missing or no pending ids. ``_reconcile_lost_images`` now returns one of ``"marked_failed"``, ``"raced"``, or ``"state_disappeared"`` so the caller can record the distinction in the per-tick result dict. Two new tests: - ``test_reconcile_skips_when_job_updated_at_bumped_after_candidate_select`` — TDD-verified: fails without the new ``cutoff`` parameter. - ``test_progress_errors_truncates_long_id_list`` — TDD-verified: fails without the new preview-limit constant. Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-04-17T07:25:57Z

@coderabbitai review now?

coderabbitai · 2026-04-17T07:26:11Z

@mihow Sure! I'll kick off a review of this PR right now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…ages-failed

mihow · 2026-04-17T08:33:36Z

Claude says: End-to-end validated in production on demo job 88 — first real stuck-job recovery via this reconciler.

Job had 544 images, 16 stranded after the NATS stream drained (last Successfully saved results at 04:07:49, last result counter processed=536/544). Reconciler fired on the next */15 beat tick at 04:30:00 (~22 min idle, past the 10-min STALLED_JOBS_MAX_MINUTES cutoff):

[04:30:00] WARNING jobs_health_check: marked 16 image(s) as failed (job idle past cutoff;
  NATS consumer num_pending=0 num_ack_pending=0 num_redelivered=16).
  IDs: ['2868', '2879', '2951', '2959', '2982', '3079', '3142', '3265', '3328', '3396']
       ... and 6 more
  (full IDs: ['2868', '2879', '2951', '2959', '2982', '3079', '3142', '3265', '3328',
              '3396', '3484', '3534', '3643', '3864', '3923', '4255'])
[04:30:00] INFO is_complete()=True after stage='results' update; firing cleanup.
           Stages: collect=100.0% SUCCESS, process=100.0% SUCCESS, results=100.0% SUCCESS
[04:30:00] INFO Cleaned up NATS resources for job 88
[04:30:00] INFO Deleted NATS stream job_88 for job '88'
[04:30:00] INFO Deleted NATS consumer job-88-consumer for job '88'
[04:30:00] INFO Cleaned up Redis state for job 88

Final: Job.status=SUCCESS (16/544 = 2.9% < FAILURE_THRESHOLD=0.5). Without this PR, check_stale_jobs would have REVOKED the job and lost 528 successful results.

What the validation confirms:

Reconciliation path — candidate selection, select_for_update re-validation, Redis SREM/SADD, both-stages progress update, completion-state computation, and cleanup chaining all behave as designed.
Diagnostic format — the truncation cap from the Copilot triage is working: 10-id preview + "and 6 more" lands in progress.errors (UI-bounded), full 16-id list goes to job.logger.warning (recoverable from logs panel).
num_redelivered=16 matches the batch size exactly. Strong signal that the underlying bug is deterministic — not random NATS flakiness — and that one full ADC-worker pull is the unit being lost. Filing as a separate issue: the reconciler is the right safety net, but root cause likely lives in ADC's pull-loop termination condition. Investigation in progress.
Empirical wrinkle vs local chaos: this deployment showed num_ack_pending=0 at reconciler time, while Scenario F in chaos-scenarios.md saw it stay at N. Doesn't affect the diagnosis (Redis pending sets are the source of truth), but the per-deployment difference is worth a note in the chaos doc once we understand why.

Ready for review.

mihow and others added 4 commits April 16, 2026 23:24

Copilot AI review requested due to automatic review settings April 17, 2026 06:26

Copilot started reviewing on behalf of mihow April 17, 2026 06:27 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

mihow changed the title ~~fix(jobs): reconcile async jobs stuck on NATS-lost images~~ fix(jobs): mark failed or lost images as failed so job can finish Apr 17, 2026

mihow changed the title ~~fix(jobs): mark failed or lost images as failed so job can finish~~ fix(jobs): mark failed or lost images as failed so job can be marked complete Apr 17, 2026

mihow mentioned this pull request Apr 17, 2026

feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing #1241

Open

mihow and others added 3 commits April 17, 2026 00:20

docs(jobs): replace PR #TODO with PR #1244 in lifecycle doc

fd022ca

The new failure-mode row was committed with a placeholder PR reference; replace it now that #1244 is open. Co-Authored-By: Claude <noreply@anthropic.com>

mihow added a commit that referenced this pull request Apr 17, 2026

Merge mark-lost-images-failed (PR #1244) into demo/integration

ec788f6

Merge remote-tracking branch 'origin/main' into worktree-mark-lost-im…

9c38aa5

…ages-failed

This was referenced Apr 17, 2026

bug(jobs/adc): exactly one worker batch's worth of images stranded after long-running jobs #1247

Open

fix(deps): pin redis-py 5.3.1 to avoid ConnectionPool deadlock #1246

Merged

mihow merged commit 2714195 into main Apr 17, 2026
7 checks passed

mihow deleted the mark-lost-images-failed branch April 17, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(jobs): mark failed or lost images as failed so job can be marked complete#1244

fix(jobs): mark failed or lost images as failed so job can be marked complete#1244
mihow merged 8 commits intomainfrom
mark-lost-images-failed

mihow commented Apr 17, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 17, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihow commented Apr 17, 2026

Uh oh!

coderabbitai bot commented Apr 17, 2026

Uh oh!

mihow commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Real production symptom this addresses

What this is NOT

Architecture

Design choice: the idle signal is Job.updated_at, not NATS counters

Why this is safe

Files

Verified locally

Unit — ami.jobs suite, 84 tests pass

E2E path 1 — chaos_monkey exhaust_max_deliver + reconciler against live stack (fake job_id)

E2E path 2 — real ADC happy path

E2E path 3 — real Job row stuck via chaos_monkey (ADC stopped)

To verify in staging

Before any synthetic traffic

Real-traffic observation (first 24-72h)

Targeted scenario — reproduce a real max_deliver exhaustion

Sanity checks to run manually on any stuck-looking job post-deploy

Relationship to concurrent / recently-merged work

Uh oh!

netlify bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

coderabbitai bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihow commented Apr 17, 2026

Uh oh!

coderabbitai bot commented Apr 17, 2026

Uh oh!

mihow commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Apr 17, 2026 •

edited

Loading

Design choice: the idle signal is `Job.updated_at`, not NATS counters

Unit — `ami.jobs` suite, 84 tests pass

E2E path 1 — `chaos_monkey exhaust_max_deliver` + reconciler against live stack (fake job_id)

E2E path 3 — real Job row stuck via `chaos_monkey` (ADC stopped)

netlify bot commented Apr 17, 2026 •

edited

Loading

netlify bot commented Apr 17, 2026 •

edited

Loading

coderabbitai bot commented Apr 17, 2026 •

edited

Loading