feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing by mihow · Pull Request #1241 · RolnickLab/antenna

mihow · 2026-04-16T04:13:45Z

Summary

Complements #1244 — that PR handles pending_ids left in Redis after NATS gives up redelivering; this PR handles the case where the state keys are missing entirely. The two paths are independent and coexist without conflict.

When process_nats_pipeline_result finds the job's total-images key missing from Redis, the FAILURE it produces is close to silent. The reason string is hardcoded — "Job state keys not found in Redis (likely cleaned up concurrently)" — and ends up only in job.logger, not in progress.errors. The Redis target is never logged. Three distinct causes collapse into one misleading line:

DB-index drift across hosts (one worker talks to Redis DB 0, another to DB 1 — run_job initializes state in one DB, process_nats_pipeline_result looks for it in the other)
Key eviction under memory pressure
Never-initialized state (legitimate concurrent-cleanup race)

This PR keeps the failure path but makes it name the actual cause.

What changes

AsyncJobStateManager.diagnose_missing_state() — returns a one-line snapshot: redis=host:port/dbN keys_for_job=<SCAN listing with SCARDs>. keys_for_job=<none> ⇒ wrong DB or never-initialized; SCARDs present ⇒ partial cleanup / eviction. The SCAN runs only on the failure path and the per-job key fanout is at most four keys (pending:process, pending:results, failed, total), so the cost is negligible.
update_state() — emits a WARN with the snapshot immediately before returning None. The trigger is now visible in the worker log even if the caller's FAILURE log is filtered.
process_nats_pipeline_result — passes the live snapshot to _fail_job instead of the hardcoded string. Also annotates the stage (process / results).
_fail_job — appends reason to job.progress.errors before save. The UI now shows FAILUREs with a cause instead of errors=[].
run_job — logs Running job X on redis=HOST:PORT/dbN at start. Cross-host DB drift becomes visible in every job's log without needing to already suspect it.

Why

This is motivated by an observed production incident where three consecutive async_api jobs flipped straight to FAILURE at ~46% with zero errors recorded and no log line explaining why. The per-job log showed STARTED → FAILURE with nothing in between. The only clue was the hardcoded string, which pattern-matched an unrelated recent change and sent triage down the wrong path. The real cause turned out to be a worker host whose REDIS_URL pointed at a different DB index than the rest of the stack. Any of the five items above would have caught it on the first failed job.

The fix for the underlying misconfig is an infra change, not in this PR. This PR is about making the same class of failure instantly debuggable next time.

Tests

Added:

TestTaskStateManager.test_diagnose_missing_state_when_never_initialized — asserts redis=, /db, and keys_for_job=<none> in the diagnostic string.
TestTaskStateManager.test_diagnose_missing_state_lists_present_keys — after wiping only the total key, asserts the surviving pending-set SCARDs appear.
TestFailJob.test_fail_job_appends_reason_to_progress_errors — verifies the reason string lands in job.progress.errors both in memory and after refresh_from_db(). The existing _fail_job call-site tests in TestProcessNatsPipelineResultError mock _fail_job entirely, so a silent regression on the append would not have been caught. Validated via TDD: temporarily skipped the append in production code, ran the test and watched it fail with the empty-list assertion, then restored.
TestFailJob.test_fail_job_is_noop_on_already_final_job — regression guard for the early-return branch.

Updated:

TestProcessNatsPipelineResultError.test_genuinely_missing_state_acks_and_fails_job — reason-string assertion updated to the new "Job state missing from Redis (stage=process): …" form.

Verified locally

docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.jobs --keepdb → 78 tests pass, including the new TestFailJob cases.
docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.jobs.tests.test_tasks.TestFailJob ami.jobs.tests.test_tasks.TestProcessNatsPipelineResultError ami.ml.tests.TestTaskStateManager --keepdb → 24 tests pass (the targeted slices touched by this PR).
TDD confirmation of the progress.errors append: with job.progress.errors.append(reason) commented out, test_fail_job_appends_reason_to_progress_errors fails with AssertionError: '…' not found in []. With the line restored, it passes. The test exercises the real code path.
Lint/format: pre-commit hooks (black, isort, flake8, pyupgrade, django-upgrade) pass on all touched files.

Manual verification to run in staging

In a dev env, start an async_api job, then redis-cli -n <db> DEL job:{id}:pending_images_total while it's running. Confirm the resulting FAILURE shows the diagnostic in both the UI's progress.errors and the worker log, and that run_job's opening log line identifies the Redis target.

Relationship to concurrent / recently-merged work

fix: retry on Redis blips instead of failing the job #1231 — Separated transient Redis errors (autoretry) from terminal state loss (_fail_job). This PR relies on the crisp update_state contract it established.
fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234 — ACK-last ordering and counter-inflation guard. Same _fail_job function, different entry points — both contribute reason strings into progress.errors.
fix(jobs): mark failed or lost images as failed so job can be marked complete #1244 (sibling PR, open) — reconciles jobs where Redis still tracks images as pending after NATS has given up on redelivery. That PR uses the same _fail_job mechanism to populate progress.errors with a different reason-string format. The two formats coexist without conflict; if fix(jobs): mark failed or lost images as failed so job can be marked complete #1244 lands first, a small _append_progress_error(job, reason) helper could consolidate the two append sites.

netlify · 2026-04-16T04:13:52Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`111d5d6`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69e1db2bbef2ed000968e963

coderabbitai · 2026-04-16T04:13:54Z

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 6 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 6 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 678b5979-4f88-40b3-94c8-d082dbbf8e7c

📥 Commits

Reviewing files that changed from the base of the PR and between 201cfa7 and 111d5d6.

📒 Files selected for processing (4)

ami/jobs/tasks.py
ami/jobs/tests/test_tasks.py
ami/ml/orchestration/async_job_state.py
ami/ml/tests.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/1241-redis-state-loud-diagnostics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

netlify · 2026-04-16T04:13:54Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`111d5d6`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69e1db2b2ab6a500098a8ddb

Before: when process_nats_pipeline_result found the job's total-images key missing, it failed the job with the hardcoded reason "Job state keys not found in Redis (likely cleaned up concurrently)". The reason string went to job.logger only — not progress.errors — and collapsed three distinct causes (DB-index drift across hosts, key eviction, never-initialized state) into one misleading line. The Redis target (host:port/db) was never logged, so operators couldn't tell a cache-DB split from a cleanup race without shelling into workers. This commit makes that path name the actual cause: 1. AsyncJobStateManager.diagnose_missing_state() returns a one-line snapshot: masked host:port, DB index, and SCAN output for job:{id}:* with SCARDs. "keys_for_job=<none>" ⇒ never initialized or wrong DB; SCARDs present ⇒ partial cleanup / eviction. 2. update_state() emits a WARN with that snapshot immediately before returning None, so the trigger shows up in the worker log even if the caller's FAILURE log is filtered out. 3. process_nats_pipeline_result passes the live snapshot to _fail_job instead of the hardcoded string. 4. _fail_job appends the reason to job.progress.errors before save, so the UI surfaces FAILUREs with a cause instead of errors=[]. 5. run_job logs "Running job X on redis=HOST:PORT/dbN" at start. Cross- host DB drift becomes visible in every job's log without needing to already suspect it. Tests added: - diagnose_missing_state with never-initialized state (keys_for_job=<none>) - diagnose_missing_state after partial cleanup (SCARDs of surviving sets listed) - Existing "genuinely missing state" test assertion updated to match the richer reason string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add TestFailJob with two TDD-confirmed cases: - ``test_fail_job_appends_reason_to_progress_errors`` — verifies the reason string lands in ``job.progress.errors`` (both in-memory and after refresh_from_db) so UI surfaces the cause of the FAILURE. The existing ``_fail_job`` call-site tests in ``TestProcessNatsPipelineResultError`` mock ``_fail_job`` entirely, so a regression that stops appending to ``progress.errors`` would slip through undetected. - ``test_fail_job_is_noop_on_already_final_job`` — regression guard for the early-return branch when the job is already in a final state. Also: - Comment the bare ``except Exception: pass`` around the ``progress.errors.append`` in ``_fail_job`` to explain why we swallow diagnostic-write failures. - Extend the ``AsyncJobStateManager.diagnose_missing_state`` docstring with a one-paragraph note about the SCAN cost (failure-path only, per-job fanout of at most four keys) to head off the obvious review question. Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull request overview

Improves debuggability of async ML job failures when Redis-backed job state is missing by adding Redis-target diagnostics, propagating the failure reason into Job.progress.errors, and updating tests to lock in the new behavior.

Changes:

Add AsyncJobStateManager.diagnose_missing_state() and log a diagnostic snapshot on the missing-state path.
Include stage-specific missing-state diagnostics in _fail_job reasons and persist those reasons into progress.errors for UI visibility.
Add/adjust unit tests covering diagnostics output and _fail_job persistence semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`ami/ml/orchestration/async_job_state.py`	Adds missing-state diagnostic helper and warning log in `update_state`.
`ami/jobs/tasks.py`	Logs Redis target at job start; passes diagnostics into `_fail_job`; appends reason into `progress.errors`.
`ami/ml/tests.py`	Adds tests for `diagnose_missing_state()` output in “never initialized” and “partial keys present” cases.
`ami/jobs/tests/test_tasks.py`	Updates missing-state reason assertions; adds regression tests for `_fail_job` writing `progress.errors`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T07:07:51Z

+        ``_fail_job`` so the FAILURE log and the UI ``progress.errors`` entry
+        distinguish the three common causes — DB mismatch across hosts, key
+        eviction, and never-initialized state — instead of a single hardcoded
+        "likely cleaned up concurrently" guess that all three collapse to.


The diagnose_missing_state() docstring says it's called from _fail_job, but _fail_job doesn't invoke it (only update_state and process_nats_pipeline_result do). Please update the docstring to match actual call sites so future readers don’t look for diagnostics propagation in _fail_job.

Suggested change

``_fail_job`` so the FAILURE log and the UI ``progress.errors`` entry

distinguish the three common causes — DB mismatch across hosts, key

eviction, and never-initialized state — instead of a single hardcoded

"likely cleaned up concurrently" guess that all three collapse to.

``process_nats_pipeline_result`` so the FAILURE log and the UI

``progress.errors`` entry distinguish the three common causes — DB

mismatch across hosts, key eviction, and never-initialized state —

instead of a single hardcoded "likely cleaned up concurrently" guess

that all three collapse to.

Copilot · 2026-04-17T07:07:51Z

        _ack_task_via_nats(reply_subject, logger)
-        _fail_job(job_id, "Job state keys not found in Redis (likely cleaned up concurrently)")
+        _fail_job(
+            job_id,
+            f"Job state missing from Redis (stage=process): {state_manager.diagnose_missing_state()}",
+        )


In the missing-state path, update_state() already logs a warning that calls diagnose_missing_state(), and this call site invokes diagnose_missing_state() again to build the failure reason. That duplicates the Redis diagnostic work on the same event. Consider capturing the diagnosis once (e.g., have update_state return it alongside None/raise a dedicated exception carrying it) and reuse it for both logging and _fail_job/UI errors.

Copilot · 2026-04-17T07:07:52Z

+def _describe_redis_target() -> str:
+    """Return a ``host:port/dbN`` string for the "default" Redis connection.
+
+    Logged at the start of every ``run_job`` so DB-index drift across hosts
+    (the class of misconfig that manifests as silent ``process_nats_pipeline_result``
+    FAILUREs on whichever worker happens to read state from the wrong DB) is
+    visible in each job's log without requiring a separate diagnostic.
+    """
+    try:
+        from django_redis import get_redis_connection
+
+        redis = get_redis_connection("default")
+        kwargs = getattr(redis.connection_pool, "connection_kwargs", {}) or {}
+        return f"redis={kwargs.get('host', '?')}:{kwargs.get('port', '?')}/db{kwargs.get('db', '?')}"


_describe_redis_target() claims it returns a host:port/dbN string, but it currently returns redis=host:port/dbN (and the exception case returns redis=(unavailable: ...)). Align the docstring with the actual format, or adjust the return value to match the documented output.

Copilot · 2026-04-17T07:07:52Z

+            # Cursor-safe SCAN over the job's keyspace — cheap even on a busy
+            # Redis because it's filtered server-side and the per-job fanout is
+            # at most a handful of keys (pending:process, pending:results,
+            # failed, total).
+            keys = sorted(k.decode() if isinstance(k, bytes) else k for k in redis.scan_iter(match=self._pattern()))
+            sizes: list[str] = []


diagnose_missing_state() uses scan_iter(match=job:{id}:*), which iterates the full Redis keyspace (SCAN walks the entire DB even when MATCH filters server-side). On a large/loaded Redis (or if this failure repeats due to misconfig), this can add significant latency and load. Consider checking the small, fixed set of expected keys directly (pending:process, pending:results, failed, total) via exists/scard/get, instead of scanning.

mihow mentioned this pull request Apr 17, 2026

fix(jobs): mark failed or lost images as failed so job can be marked complete #1244

Merged

6 tasks

mihow and others added 2 commits April 16, 2026 23:55

mihow force-pushed the fix/1241-redis-state-loud-diagnostics branch from 6730cfb to 111d5d6 Compare April 17, 2026 07:03

mihow marked this pull request as ready for review April 17, 2026 07:03

Copilot AI review requested due to automatic review settings April 17, 2026 07:03

Copilot started reviewing on behalf of mihow April 17, 2026 07:04 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing#1241

feat(jobs): make missing-Redis-state FAILUREs loud and self-diagnosing#1241
mihow wants to merge 2 commits intomainfrom
fix/1241-redis-state-loud-diagnostics

mihow commented Apr 16, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

netlify bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes

Why

Tests

Verified locally

Manual verification to run in staging

Relationship to concurrent / recently-merged work

Uh oh!

netlify bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

netlify bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Apr 16, 2026 •

edited

Loading

netlify bot commented Apr 16, 2026 •

edited

Loading

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

netlify bot commented Apr 16, 2026 •

edited

Loading