Commit 1c6be7a
feat(jobs): make NATS queue activity visible in async ML job logs (#1222)
* feat(nats): surface queue lifecycle events on the per-job logger
TaskQueueManager now accepts an optional job_logger. When set, lifecycle
events (stream/consumer create+reuse with state/config snapshot, publish
failures, cleanup deletions, and a forensic consumer-stats line before
deletion) are mirrored to the per-job logger in addition to the module
logger. The UI job log now reflects what the NATS layer is actually
doing for that specific job instead of a silent gap.
Per-message and per-poll paths (publish_task success, reserve_tasks,
acknowledge_task) intentionally stay on the module logger only — a
10k-image job would otherwise drown its own log. Lifecycle log lines are
deduped per manager session so a loop over N images still only emits a
single "Created NATS stream" line per job.
cleanup_async_job_resources and queue_images_to_nats pass job.logger
through to TaskQueueManager so real async_api jobs pick up the new
logging without further caller changes.
Closes #1220
* fix(nats): bridge job-logger mirroring through sync_to_async
TaskQueueManager._log() was calling job_logger.log() synchronously from
inside async methods (_ensure_stream, _ensure_consumer, cleanup). That
triggers JobLogHandler.emit() which does a Django ORM refresh_from_db +
save — forbidden from an event loop. Every lifecycle line was silently
dropped with "Failed to save logs for job #N: You cannot call this from
an async context", defeating the point of the original change.
Convert _log to async and await sync_to_async(job_logger.log)(...) so
the ORM work runs in a thread. Update all call sites to await. Apply
the same fix to the publish-failure path in queue_images_to_nats.
Verified on ami-demo with job #74: lifecycle lines fired on module
logger but "Failed to save logs" errors swallowed the job-logger mirror.
* fix(nats): narrow consumer_info exception + stop forwarding module logger to TaskQueueManager in cleanup
Addresses CodeRabbit review on #1222.
1. _ensure_consumer caught broad Exception when consumer_info() failed,
masking auth/API/transient JetStream errors as "consumer missing" and
emitting misleading creation logs. Narrowed to NotFoundError to match
the pattern already used in _ensure_stream (line 208).
2. cleanup_async_job_resources forwarded its `_logger` argument into
TaskQueueManager as `job_logger`. One caller (_fail_job on the
Job.DoesNotExist path in ami/jobs/tasks.py:198) passes a plain module
logger, which would then have cleanup lifecycle lines mirrored into an
unrelated logger via sync_to_async. Added a separate `job_logger`
parameter, defaulted to None, and updated the two callers that have
real job context (_fail_job happy path, cleanup_async_job_if_needed)
to pass `job.logger` explicitly. The DoesNotExist path leaves
job_logger=None, so TaskQueueManager falls through to the module
logger only.
Tests: 18/18 in test_nats_queue.py pass, pre-commit clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(jobs): simplify cleanup_async_job_resources to match codebase job_logger pattern
Drop the separate `_logger` parameter that was redundant with `job_logger`.
Follow the existing convention (e.g. save_results in ami/jobs/tasks.py):
`_log = job_logger or logger` — use per-job logger when available, module
logger otherwise.
Callers now read cleanly:
cleanup_async_job_resources(job.pk, job_logger=job.logger) # has job
cleanup_async_job_resources(job_id) # no job
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(nats): read consumer config from ConsumerInfo instead of hardcoding in creation log
The "Created NATS consumer" log line was hardcoding max_deliver=5 and
interpolating TASK_TTR for ack_wait. Now reads from the ConsumerInfo
returned by add_consumer(), so the log always reflects what the server
accepted. Added _format_consumer_config() alongside the existing
_format_consumer_stats() for the two different log contexts (creation
vs runtime stats).
Updated test mocks to return ConsumerInfo-like objects with a config
sub-object from add_consumer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(nats): skip redundant NATS calls after first ensure, fix enum rendering in config log
Three cleanup items found on self-review:
1. _ensure_stream and _ensure_consumer were calling stream_info/consumer_info
on every publish_task (once per image). Since the stream and consumer are
never deleted mid-flight (cleanup uses a separate manager session), added
an early return when job_id is already in the logged set. Saves 2 NATS
round-trips per image after the first.
2. _format_consumer_config was rendering enum fields as "DeliverPolicy.ALL"
instead of "all" — added a _val() helper that unwraps .value when present.
3. Removed now-redundant inner dedup checks (the early return makes them
always-true).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(nats): preserve traceback on publish failure + defensive stream state access
Addresses two CodeRabbit findings:
1. queue_images_to_nats publish exception path only wrote a stringified
error to the job log. Added logger.exception() on the module logger so
ops dashboards keep the full traceback. The job.logger.error bridge
still writes the user-facing message.
2. _ensure_stream accessed info.state.messages / info.state.last_seq
without defending against None, unlike _format_consumer_stats which
already does. Match the defensive pattern so the reuse line doesn't
blow up if the server ever returns a StreamInfo with no state.
Also pushing back on the "ack_wait nanoseconds" finding in the review
thread — nats-py does convert it to seconds in from_response via
_convert_nanoseconds (source: val / _NANOSECOND where _NANOSECOND = 1e9).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(jobs): resolve job logger internally in cleanup_async_job_resources
Match the pattern used by save_results in ami/jobs/tasks.py: take only
job_id, resolve job.logger internally, fall back to the module logger
when the Job row is missing. Keeps call sites consistent across the
codebase — cleanup_async_job_resources(job.pk) everywhere — and makes
the job object available inside the function for future use.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(jobs): rename _log to job_logger to avoid name collision with TaskQueueManager._log
Two _log identifiers in the same module had incompatible semantics:
- nats_queue.TaskQueueManager._log(level, msg) — async coroutine that
fans out to module + job loggers with sync_to_async bridging.
- jobs.cleanup_async_job_resources local _log = job_logger or logger —
a plain Logger instance called as _log.info(...) / .error(...).
Rename the local to job_logger and assign directly from
`job.logger if job else logger`, matching the save_results pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore(nats): tidy up minor code smells from self-review
- Drop redundant ternary: cfg = info.config, not `info.config if info.config is not None else None`
- Merge two-line f-string in _log_final_consumer_stats
- Drop redundant `except asyncio.TimeoutError: raise` in _ensure_consumer
(now that the catch is narrowed to NotFoundError, TimeoutError propagates
naturally — matches _ensure_stream style)
- Explicit comment on the intentionally-broad `except Exception` in
_log_final_consumer_stats clarifying it's different from _ensure_consumer
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(nats): unify log fan-out behind manager.log_async
Rename TaskQueueManager._log → log_async and route every log line in the
queue_images_to_nats async block through it (debug, error + traceback).
Drops the ad-hoc sync_to_async(job.logger.error) bridge and the separate
logger.exception call — one consistent API, one place that knows how to
bridge JobLogHandler's ORM save through sync_to_async.
log_async also now accepts exc_info=True so callers don't need to pair a
module-only logger.exception with a job-logger error call.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(nats): note future intent to route, not mirror, lifecycle logs
log_async currently mirrors granular per-job lifecycle to both the module
and job loggers. Document why this is intentional for now (async ML
processing still stabilizing, stdout visibility helping us debug) and
the eventual target shape (route to job logger only at INFO/DEBUG, mirror
at WARNING+). Matches the pattern in ami.jobs.tasks.save_results, where
job.logger.propagate=False keeps granular per-job state out of ops logs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(nats): correct safety claim for _ensure_stream/_ensure_consumer dedup
The original docstring claimed the stream/consumer won't be deleted
mid-flight because cleanup uses a separate manager session. That's
incomplete — Job.cancel() runs cleanup_async_job_resources in the
request thread while queue_images_to_nats is still in its publish loop
in the Celery worker. So a concurrent delete across manager sessions
is possible.
The early-return is still safe in that scenario, but for a different
reason than the original claim: downstream publish_task fails loudly
(returns False, logs ERROR) when the stream is gone, rather than
silently recreating an orphan stream without a consumer (which is what
the non-deduped baseline would do).
Updates _ensure_stream and _ensure_consumer docstrings to describe the
actual safety argument accurately.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf(nats): gate log_async on isEnabledFor before sync_to_async mirror
Without this gate, every log_async call fires the job-logger mirror
through sync_to_async — a ThreadPoolExecutor submit per call —
regardless of whether the effective level would drop the record. For a
10k-image queue this amounts to 10k unnecessary thread-pool submissions
when DEBUG is off.
stdlib Logger.log does the same isEnabledFor check internally before
formatting. We need to do it explicitly here because the mirror goes
through sync_to_async, bypassing the in-logger short-circuit.
No behavior change when at least one logger is enabled for the level;
pure short-circuit when both are gated out.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 8f8b177 commit 1c6be7a
4 files changed
Lines changed: 568 additions & 58 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
195 | | - | |
| 195 | + | |
196 | 196 | | |
197 | 197 | | |
198 | | - | |
| 198 | + | |
199 | 199 | | |
200 | 200 | | |
201 | 201 | | |
| |||
423 | 423 | | |
424 | 424 | | |
425 | 425 | | |
426 | | - | |
| 426 | + | |
427 | 427 | | |
428 | 428 | | |
429 | 429 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
| 14 | + | |
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
24 | 30 | | |
25 | | - | |
26 | | - | |
| 31 | + | |
27 | 32 | | |
28 | 33 | | |
29 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
30 | 44 | | |
31 | 45 | | |
32 | 46 | | |
33 | 47 | | |
34 | 48 | | |
35 | 49 | | |
36 | 50 | | |
37 | | - | |
| 51 | + | |
38 | 52 | | |
39 | 53 | | |
40 | | - | |
| 54 | + | |
41 | 55 | | |
42 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
43 | 59 | | |
44 | | - | |
| 60 | + | |
45 | 61 | | |
46 | 62 | | |
47 | 63 | | |
48 | 64 | | |
49 | 65 | | |
50 | | - | |
| 66 | + | |
51 | 67 | | |
52 | | - | |
| 68 | + | |
53 | 69 | | |
54 | | - | |
| 70 | + | |
55 | 71 | | |
56 | 72 | | |
57 | 73 | | |
| |||
97 | 113 | | |
98 | 114 | | |
99 | 115 | | |
100 | | - | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
101 | 123 | | |
102 | 124 | | |
103 | | - | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
104 | 129 | | |
105 | 130 | | |
106 | 131 | | |
107 | 132 | | |
108 | 133 | | |
109 | | - | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
110 | 139 | | |
111 | 140 | | |
112 | 141 | | |
| |||
0 commit comments