Skip to content

fix(jobs): mark failed or lost images as failed so job can be marked complete#1244

Merged
mihow merged 8 commits intomainfrom
mark-lost-images-failed
Apr 17, 2026
Merged

fix(jobs): mark failed or lost images as failed so job can be marked complete#1244
mihow merged 8 commits intomainfrom
mark-lost-images-failed

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Apr 17, 2026

Summary

When NATS gives up redelivering a message (after max_deliver=2) because the handler kept raising, Redis still tracks that image as pending. The job sits at something like 17/20 done indefinitely. At the 10-min STALLED_JOBS_MAX_MINUTES cutoff, check_stale_jobs REVOKEs the whole job — throwing away the 17 images of successful work alongside the 3 that are actually lost.

This PR adds a reconciler inside jobs_health_check that runs before check_stale_jobs: for each running async_api job that has been idle past the cutoff while Redis still tracks images as pending, mark those images as failed and let the existing completion logic land the job in SUCCESS or FAILURE based on the same FAILURE_THRESHOLD = 0.5 used everywhere else. A 17/20 job lands SUCCESS with failed=3. A 2/20 job still lands FAILURE, but with the full stage breakdown visible.

Complements the three PRs merged this week (#1231, #1234, #1235). Those fixed the transient/terminal Redis boundary, the ACK/SREM ordering, and made the stale-job reaper faster. Those plus this close the loop: the reaper is now the last-resort safety net for truly unreachable jobs, not the primary recovery path for a single stuck image.

Real production symptom this addresses

demo.antenna.insectai.org/projects/5/jobs/83 — job completed 6 short of 100% and will hang for 10 minutes before the whole job is marked FAILURE, rather than being able to land SUCCESS / FAILURE on the 6 lost images alone.

What this is NOT

Architecture

Matches the sub-check pattern established by #1239 (zombie streams) and #1235 (stale jobs):

  1. mark_lost_images_failed() in ami/jobs/tasks.py — scans running async_api jobs with updated_at < now - STALLED_JOBS_MAX_MINUTES, checks each for leftover Redis pending ids, and reconciles the ones it finds.
  2. _reconcile_lost_images() — SREMs the lost ids from both pending_images:process and pending_images:results, SADDs them to failed_images, and runs _update_job_progress for both stages so the existing completion path (which calls cleanup_async_job_if_needed) fires naturally.
  3. _run_mark_lost_images_failed_check() — wraps the call in an IntegrityCheckResult and adds it to JobsHealthCheckResult. Runs before _run_stale_jobs_check inside the jobs_health_check umbrella, so a reconcilable job never gets REVOKEd.

Reuses existing primitives — no new FAILURE path, no new threshold, no new Celery task, no new settings. Adds three small pieces:

  • ConsumerState dataclass + TaskQueueManager.get_consumer_state() (diagnostic only — reads num_pending / num_ack_pending / num_redelivered for the log line).
  • AsyncJobStateManager.get_pending_image_ids() (SUNION of both stage sets).
  • chaos_monkey exhaust_max_deliver subcommand — drives a consumer past max_deliver without ADC for validation. Pairs with the existing flush redis / flush nats pattern from fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234's chaos runbook.

Design choice: the idle signal is Job.updated_at, not NATS counters

Initial draft gated on num_pending == 0 AND num_ack_pending == 0 AND num_redelivered > 0. Empirical finding during e2e (chaos_monkey exhaust_max_deliver against a live NATS JetStream): after max_deliver=2 is hit, the messages stay in num_ack_pending indefinitely — JetStream does not clear that counter until the stream is deleted. Three more ack_wait cycles: still num_ack_pending=3.

This means a num_ack_pending > 0 guard would skip the exact production case. Removed both counter guards. The idle signal is now the one that actually correlates with stuckness: Job.updated_at is bumped on every successful result save, so a job that hasn't updated in 10+ min has genuinely stopped making progress — whatever the NATS counters say.

Why this is safe

Late deliveries after reconciliation are idempotent, as documented in #1234's processing-lifecycle.md §2:

  • save_results dedupes on (detection, source_image) — detections still land if ADC recovers and pushes the result.
  • SREM on already-removed ids is a no-op with newly_removed == 0 gating counter accumulation, so detections/classifications/captures don't double.
  • _update_job_progress clamps percentage with max() and preserves FAILURE once set — a late result can't regress progress or un-fail a job.

Worst case: a genuinely slow-but-alive ADC gets its 10-min-old images marked failed, but its late result still saves detections. The Job shows FAILURE in the UI; the detections exist in the DB — same as any partial-failure async job.

Files

File Change
ami/ml/orchestration/nats_queue.py ConsumerState dataclass + TaskQueueManager.get_consumer_state
ami/ml/orchestration/async_job_state.py AsyncJobStateManager.get_pending_image_ids
ami/jobs/tasks.py mark_lost_images_failed + _reconcile_lost_images + _run_mark_lost_images_failed_check sub-check + jobs_health_check wiring
ami/jobs/tests/test_tasks.py TestMarkLostImagesFailed — 8 TransactionTestCase
ami/jobs/tests/test_periodic_beat_tasks.py Cover the new lost_images field on JobsHealthCheckResult
ami/jobs/management/commands/chaos_monkey.py exhaust_max_deliver subcommand
docs/claude/processing-lifecycle.md New failure-mode row
docs/claude/debugging/chaos-scenarios.md Scenario F runbook

Verified locally

Unit — ami.jobs suite, 84 tests pass

docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.jobs --keepdb — green.

TestMarkLostImagesFailed (8 cases, TransactionTestCase with real Redis and real DB, TaskQueueManager mocked) specifically covers:

  • Main job-2421 shape — 7/10 processed, 1 explicit failure, 2 NATS-lost → reconciler SADDs 2 to failed, SREMs 2 from pending; final process.failed = 3, remaining = 0; job lands SUCCESS (3/10 < threshold). Asserts stage params, Redis drained, diagnostic in progress.errors.
  • >50% FAILURE path — failure ratio above FAILURE_THRESHOLD → job lands FAILURE not SUCCESS, with accurate per-image counts. Verifies the reconciler doesn't accidentally mask genuine failure.
  • Mixed failures without double-counting — already-SADDed explicit_failures plus newly-SADDed lost_ids land as a single set; SADD is idempotent on accidental overlap. Verifies we don't inflate failed by counting overlap twice.
  • Idle cutoff is load-bearing — a job with fresh updated_at (inside the cutoff window) skips reconciliation even when all other signals say "stuck". This is the one true guard.
  • Consumer-state counters don't gate — both num_pending > 0 and num_ack_pending > 0 cases now reconcile (tests flipped from "noop" to "reconciles") because the empirical NATS behavior showed those counters are unreliable indicators.
  • Bug A resurrection casenum_redelivered=0 with stuck Redis ids (the pre-fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234 ACK-before-SREM signature) also reconciles cleanly. If that bug ever resurfaces, this reconciler catches it.
  • Sub-check ordering — verifies jobs_health_check runs lost_images before stale_jobs so a reconcilable job lands in its natural terminal state instead of REVOKE.

Updates to test_periodic_beat_tasks.py confirm the new lost_images IntegrityCheckResult field populates correctly across healthy / idle-deployment paths.

E2E path 1 — chaos_monkey exhaust_max_deliver + reconciler against live stack (fake job_id)

Validates what unit tests cannot mock: real NATS JetStream consumer behavior.

What it verified:

  • TaskQueueManager.get_consumer_state() returns None for a non-existent consumer (tolerant path).
  • get_consumer_state() returns correctly-populated ConsumerState against a real consumer: num_pending=3, num_ack_pending=0, num_redelivered=0 after publishing 3 messages pre-delivery.
  • The empirical max_deliver exhaustion resting state is num_pending=0, num_ack_pending=3, num_redelivered=3 — and it stays there through three more ack_wait cycles. This is the finding that drove the guard-removal design change; it is now covered by a test in the suite.
  • AsyncJobStateManager.get_pending_image_ids() returns the SUNION of both stage sets against real Redis.
  • Full reconciler path end-to-end: SREM / SADD / _update_job_progress / cleanup_async_job_if_needed all fire in sequence. Redis cleaned up. NATS stream + consumer deleted on completion (via the existing cleanup trigger, not new code). Diagnostic appended to progress.errors with the real counter values.

E2E path 2 — real ADC happy path

Dispatched job 1529: project 18, pipeline quebec_vermont_moths_2023, 20-image collection, dispatch_mode=async_api. ADC worker running against the local stack, pulling real messages.

What it verified:

  • Job runs to SUCCESS in 18s: 20 captures, 226 detections, 360 classifications.
  • The new sub-check correctly does not fire on a healthy job (candidate list is empty because updated_at stays fresh as batches land).
  • jobs_health_check umbrella task continues to run without interference from the new sub-check.

E2E path 3 — real Job row stuck via chaos_monkey (ADC stopped)

Job 1530, real DB row referencing real pipeline + collection + project, dispatch_mode=ASYNC_API, status=STARTED. Back-dated updated_at by 15 min to cross the cutoff. chaos_monkey exhaust_max_deliver --job-id 1530 --image-ids img-x,img-y,img-z --ensure-stream drove the consumer past max_deliver.

What it verified:

  • Reconciler picks up the real Job row via the updated_at < cutoff + dispatch_mode=ASYNC_API + status in running_states filter (not a synthetic fake-job edge case).
  • All three stages transition: process and results land at 100% FAILURE (3/3 = 100% > threshold).
  • cleanup_async_job_if_needed fires on is_complete(), draining Redis and deleting the NATS stream — proves the reconciler integrates correctly with the existing cleanup lifecycle.
  • Diagnostic line in progress.errors contains the actual live counter values.

To verify in staging

Before any synthetic traffic

  • jobs_health_check beat-task duration on the first few ticks post-deploy — confirm the new sub-check doesn't materially increase tick time on an idle stack (expected: tens of ms).
  • lost_images IntegrityCheckResult counters on healthy jobs: checked=0, fixed=0, unfixable=0 — anything non-zero without a matching real stuck job warrants investigation.

Real-traffic observation (first 24-72h)

  • Rate of jobs_health_check: marked N image(s) as failed (job idle past cutoff; ...) WARN lines in celeryworker logs. Expected: rare (only when a real handler failure + max_deliver exhaustion happens). Any spike should correlate with a visible cluster of stuck jobs.
  • Rate of check_stale_jobs REVOKEs should drop relative to pre-deploy — partial-failure jobs now finalize through the reconciler instead of being reaped.
  • Any job that lands in FAILURE via the reconciler should have a populated progress.errors starting with jobs_health_check: — visible both via API and in the Job-detail UI.
  • No false positives: a mid-flight job that's just slow should not be touched. Double-check any fixed > 0 entry against the job's actual progress — if the job was still making progress within 10 min, that's a bug.

Targeted scenario — reproduce a real max_deliver exhaustion

Production can't wait 10 min reliably, so this is best done in staging with a controlled job:

  1. Dispatch a real async_api job with a fresh collection (ensure filter_processed_images doesn't strip everything).
  2. While the job is mid-flight (process stage 30-60%), stop the ADC worker. NATS begins to backfill redelivery attempts on in-flight messages.
  3. Wait 11 min (past STALLED_JOBS_MAX_MINUTES). jobs_health_check beat tick runs mark_lost_images_failed before check_stale_jobs.
  4. Verify: job lands in SUCCESS or FAILURE based on how many images were successfully processed before ADC stopped. progress.errors has the diagnostic. NATS stream cleaned up. Redis keys cleaned up. No REVOKE in the Celery task state for this job.
  5. Repeat with ADC returning after the reconciler fired — late results should save detections idempotently (see "Why this is safe" above) even though the job is already terminal. No errors logged from save_results or process_nats_pipeline_result.

Sanity checks to run manually on any stuck-looking job post-deploy

From docs/claude/processing-lifecycle.md §5 (copy-paste block, replace JOB_ID):

  • Job row state (status, progress, errors trailer).
  • Redis state (SCARD on both pending sets + failed_images).
  • NATS consumer state (num_pending, num_ack_pending, num_redelivered).
  • Last 50 log lines referencing the job.

A job showing status=STARTED + process/results < 100% + updated_at < 10 min old is not reconcilable — that's an in-flight job, possibly slow. A job showing status=STARTED + Redis pending > 0 + updated_at > 10 min old is reconcilable and should have flipped to SUCCESS/FAILURE on the most recent beat tick.

Relationship to concurrent / recently-merged work


Generated with Claude Code

mihow and others added 4 commits April 16, 2026 23:24
Two small primitives the upcoming mark_lost_images_failed reconciler
(next commit) needs to inspect running async_api jobs.

TaskQueueManager.get_consumer_state(job_id) → ConsumerState | None

  Thin, mockable projection over nats.js.api.ConsumerInfo. Returns only
  num_pending / num_ack_pending / num_redelivered — the counters that
  matter for diagnostic logging. A missing stream/consumer returns None
  rather than raising, mirroring log_consumer_stats_snapshot's
  tolerance: a job that was already cleaned up should not blow up a
  caller iterating over a batch.

AsyncJobStateManager.get_pending_image_ids() → set[str]

  SUNION of both stage pending sets. Used by the reconciler to find ids
  that Redis still tracks as pending-for-processing. Returns empty set
  on RedisError so one pooled-connection blip does not abort a sweep.

No call sites added in this commit. Unit tests for these primitives
come in the reconciler commit, where they are exercised as part of the
reconciliation path.

Co-Authored-By: Claude <noreply@anthropic.com>
When NATS exhausts max_deliver=2 (per #1234) for a message whose
processing kept failing non-retriably, JetStream stops redelivering but
keeps the message in the consumer's num_ack_pending indefinitely. Redis
still tracks the image id in pending_images:{process,results}. The job
sits at <100% progress until check_stale_jobs REVOKEs it at 10 min,
wiping all the legitimately processed work alongside the lost images.

Adds mark_lost_images_failed() — a new sub-check wired into
jobs_health_check ahead of check_stale_jobs. For each running async_api
job whose updated_at is older than STALLED_JOBS_MAX_MINUTES and whose
Redis pending sets still hold ids, SREMs those ids from pending, SADDs
them to failed_images, and runs _update_job_progress for both stages.
The existing completion logic (FAILURE_THRESHOLD = 0.5) then decides
SUCCESS vs FAILURE on accurate per-image counts instead of REVOKE-ing
the whole job.

Design decisions:

* Idle-threshold signal is Job.updated_at, not NATS consumer counters.
  JetStream does not clear num_ack_pending when max_deliver is hit —
  guarding on num_ack_pending > 0 would block the exact production case
  this is meant to reconcile. Verified empirically with chaos_monkey
  exhaust_max_deliver (next commit) against a live NATS JetStream.
* get_consumer_state() is still called per candidate, but only to log
  current counters in the diagnostic appended to progress.errors. The
  reconcile decision itself does not read them.
* Runs BEFORE _run_stale_jobs_check so a reconcilable job lands in its
  natural completion state instead of being REVOKEd.

Tests:

* 8 TransactionTestCase cases in TestMarkLostImagesFailed covering the
  main job-2421 shape, the mixed-failure path, the >50% FAILURE
  threshold, the idle-threshold guard, and two cases (num_pending > 0,
  num_ack_pending > 0) that now reconcile because the idle cutoff is
  the load-bearing signal — previously (in earlier revisions of this
  branch) those were guarded noops.

Co-Authored-By: Claude <noreply@anthropic.com>
Fault-injection tool for validating mark_lost_images_failed (and any
future reconciler logic) against a live NATS JetStream without needing
to run ADC and a real ML pipeline.

Given a job_id and a set of image ids, publishes one message per id on
the job's subject, then pulls without ACK and waits ack_wait — repeating
NATS_MAX_DELIVER times so each message hits its delivery budget. After
this the consumer sits in (num_pending=0, num_ack_pending=len(ids),
num_redelivered>0), which empirically is the post-exhaustion resting
state (JetStream does not clear num_ack_pending for messages that hit
max_deliver).

Complements the existing 'flush redis' / 'flush nats' subcommands.
Takes ~NATS_MAX_DELIVER × (TASK_TTR + 3s) to run (default ≈ 66s).

Run either against a real dispatched job, or with --ensure-stream to
create the stream+consumer itself (useful for standalone reconciler
tests against a fake job_id).

Usage:

    python manage.py chaos_monkey exhaust_max_deliver \
        --job-id 999999 \
        --image-ids img-a,img-b,img-c \
        --ensure-stream

Co-Authored-By: Claude <noreply@anthropic.com>
… + chaos runbook

* processing-lifecycle.md — add a new failure-mode row describing the
  symptom (job stuck at <100% indefinitely, some images still in Redis
  pending sets, NATS consumer shows num_redelivered > 0 and
  num_ack_pending stays elevated) and point at mark_lost_images_failed
  as the fix. Distinct from Bug A (pre-#1234 ACK/SREM ordering) — that
  was a crash-window race; this one is max_deliver giving up.
* chaos-scenarios.md — add Scenario F: a step-by-step runbook for
  validating the reconciler against a live stack using the new
  chaos_monkey exhaust_max_deliver subcommand. No code patches or
  celeryworker restart required, unlike Scenarios B/D which rely on
  sentinel-file sentinel patches.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 06:26
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 17, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 9c38aa5
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69e1e25c9cc1590008672ed2

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 17, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 9c38aa5
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69e1e25cbab1b40008278e0c

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 17, 2026

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 44 minutes and 30 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 44 minutes and 30 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34263b09-3024-4a86-b1ba-6b5156e0d36c

📥 Commits

Reviewing files that changed from the base of the PR and between 585cd55 and 9c38aa5.

📒 Files selected for processing (8)
  • ami/jobs/management/commands/chaos_monkey.py
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_periodic_beat_tasks.py
  • ami/jobs/tests/test_tasks.py
  • ami/ml/orchestration/async_job_state.py
  • ami/ml/orchestration/nats_queue.py
  • docs/claude/debugging/chaos-scenarios.md
  • docs/claude/processing-lifecycle.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mark-lost-images-failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reconciliation path to finalize async_api jobs that become stuck when NATS stops redelivering messages but Redis still tracks a small remainder of pending image IDs, so jobs land in their natural SUCCESS/FAILURE outcome (via the existing FAILURE_THRESHOLD) instead of being blanket-REVOKEd by the stale-job reaper.

Changes:

  • Add a mark_lost_images_failed() sub-check (run before check_stale_jobs) that moves “lost” pending IDs into failed_images and drives progress completion via existing _update_job_progress logic.
  • Add small supporting primitives: ConsumerState + TaskQueueManager.get_consumer_state(), and AsyncJobStateManager.get_pending_image_ids().
  • Add tests, a chaos-monkey scenario to exhaust max_deliver, and docs/runbook updates describing the new failure mode and recovery workflow.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/claude/processing-lifecycle.md Documents the new “max_deliver exhaustion / pending IDs remain in Redis” failure mode and points to the reconciler fix.
docs/claude/debugging/chaos-scenarios.md Adds a runbook scenario to reproduce max-deliver exhaustion and validate reconciliation behavior.
ami/ml/orchestration/nats_queue.py Introduces ConsumerState and a tolerant get_consumer_state() helper for health-check diagnostics.
ami/ml/orchestration/async_job_state.py Adds get_pending_image_ids() to query union of pending IDs across stages.
ami/jobs/tasks.py Implements mark_lost_images_failed(), reconciliation helper, and wires it into jobs_health_check before the stale-job reaper.
ami/jobs/tests/test_tasks.py Adds transaction-level regression tests covering reconciliation outcomes and integration ordering in jobs_health_check.
ami/jobs/tests/test_periodic_beat_tasks.py Updates expected jobs_health_check result shape to include the new lost_images sub-check.
ami/jobs/management/commands/chaos_monkey.py Adds exhaust_max_deliver subcommand to drive a consumer past max_deliver without ADC.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/tasks.py Outdated
Comment thread ami/jobs/management/commands/chaos_monkey.py Outdated
Comment thread docs/claude/processing-lifecycle.md Outdated
Comment thread ami/ml/orchestration/async_job_state.py
Comment thread ami/jobs/tasks.py
Comment thread ami/jobs/tasks.py Outdated
@mihow mihow changed the title fix(jobs): reconcile async jobs stuck on NATS-lost images fix(jobs): mark failed or lost images as failed so job can finish Apr 17, 2026
@mihow mihow changed the title fix(jobs): mark failed or lost images as failed so job can finish fix(jobs): mark failed or lost images as failed so job can be marked complete Apr 17, 2026
mihow and others added 3 commits April 17, 2026 00:20
The module defined ``logger = logging.getLogger(__name__)`` but never used it
(all output goes through ``self.stdout.write``). flake8 F401/F841 would flag
on a future lint pass. Spotted in Copilot review of #1244.

Co-Authored-By: Claude <noreply@anthropic.com>
The new failure-mode row was committed with a placeholder PR reference;
replace it now that #1244 is open.

Co-Authored-By: Claude <noreply@anthropic.com>
Three Copilot review findings on #1244 rolled up:

- Re-validate inside ``select_for_update`` at the top of
  ``_reconcile_lost_images`` before issuing any Redis SREM/SADD. Closes the
  window where a late ``process_nats_pipeline_result`` arrives between
  ``mark_lost_images_failed``'s candidate selection and reconciliation,
  bumps ``updated_at``, and would otherwise cause counter inflation
  (same image counted as both processed and failed).

- Cap the id list embedded in ``progress.errors`` at 10 entries plus
  "and N more". The full list still goes to ``job.logger.warning`` —
  recoverable from the UI's logs panel without bloating the JSONB field
  surfaced in the job detail payload.

- Fix the ``_run_mark_lost_images_failed_check`` docstring that claimed
  the reconciler skips when ``num_redelivered=0``. It does not — that
  guard was removed when we discovered NATS keeps exhausted messages
  in ``num_ack_pending`` indefinitely; the only skip conditions now are
  consumer missing or no pending ids.

``_reconcile_lost_images`` now returns one of ``"marked_failed"``,
``"raced"``, or ``"state_disappeared"`` so the caller can record the
distinction in the per-tick result dict.

Two new tests:

- ``test_reconcile_skips_when_job_updated_at_bumped_after_candidate_select``
  — TDD-verified: fails without the new ``cutoff`` parameter.
- ``test_progress_errors_truncates_long_id_list``
  — TDD-verified: fails without the new preview-limit constant.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 17, 2026

@coderabbitai review now?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 17, 2026

@mihow Sure! I'll kick off a review of this PR right now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 17, 2026

Claude says: End-to-end validated in production on demo job 88 — first real stuck-job recovery via this reconciler.

Job had 544 images, 16 stranded after the NATS stream drained (last Successfully saved results at 04:07:49, last result counter processed=536/544). Reconciler fired on the next */15 beat tick at 04:30:00 (~22 min idle, past the 10-min STALLED_JOBS_MAX_MINUTES cutoff):

[04:30:00] WARNING jobs_health_check: marked 16 image(s) as failed (job idle past cutoff;
  NATS consumer num_pending=0 num_ack_pending=0 num_redelivered=16).
  IDs: ['2868', '2879', '2951', '2959', '2982', '3079', '3142', '3265', '3328', '3396']
       ... and 6 more
  (full IDs: ['2868', '2879', '2951', '2959', '2982', '3079', '3142', '3265', '3328',
              '3396', '3484', '3534', '3643', '3864', '3923', '4255'])
[04:30:00] INFO is_complete()=True after stage='results' update; firing cleanup.
           Stages: collect=100.0% SUCCESS, process=100.0% SUCCESS, results=100.0% SUCCESS
[04:30:00] INFO Cleaned up NATS resources for job 88
[04:30:00] INFO Deleted NATS stream job_88 for job '88'
[04:30:00] INFO Deleted NATS consumer job-88-consumer for job '88'
[04:30:00] INFO Cleaned up Redis state for job 88

Final: Job.status=SUCCESS (16/544 = 2.9% < FAILURE_THRESHOLD=0.5). Without this PR, check_stale_jobs would have REVOKED the job and lost 528 successful results.

What the validation confirms:

  • Reconciliation path — candidate selection, select_for_update re-validation, Redis SREM/SADD, both-stages progress update, completion-state computation, and cleanup chaining all behave as designed.
  • Diagnostic format — the truncation cap from the Copilot triage is working: 10-id preview + "and 6 more" lands in progress.errors (UI-bounded), full 16-id list goes to job.logger.warning (recoverable from logs panel).
  • num_redelivered=16 matches the batch size exactly. Strong signal that the underlying bug is deterministic — not random NATS flakiness — and that one full ADC-worker pull is the unit being lost. Filing as a separate issue: the reconciler is the right safety net, but root cause likely lives in ADC's pull-loop termination condition. Investigation in progress.
  • Empirical wrinkle vs local chaos: this deployment showed num_ack_pending=0 at reconciler time, while Scenario F in chaos-scenarios.md saw it stay at N. Doesn't affect the diagnosis (Redis pending sets are the source of truth), but the per-deployment difference is worth a note in the chaos doc once we understand why.

Ready for review.

@mihow mihow merged commit 2714195 into main Apr 17, 2026
7 checks passed
@mihow mihow deleted the mark-lost-images-failed branch April 17, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants