Skip to content

feat: status-aware task Submit with Run counter and livez endpoint (Alpha P3)#61

Merged
bdchatham merged 2 commits intomainfrom
feat/status-aware-submit-and-livez
Apr 3, 2026
Merged

feat: status-aware task Submit with Run counter and livez endpoint (Alpha P3)#61
bdchatham merged 2 commits intomainfrom
feat/status-aware-submit-and-livez

Conversation

@bdchatham
Copy link
Copy Markdown
Contributor

Summary

Cloud-API model for the sidecar task engine: the controller submits stable keys, and the sidecar owns the execution lifecycle.

  • Status-aware Submit: failed tasks are transparently re-executed on re-submit; running/completed are idempotent no-ops
  • Run counter on TaskResult: tracks execution count under the same stable ID. Increments on failed→re-execute, NOT on crash-recovery rehydration
  • Concurrency safety: sync.Mutex + inFlight map prevents double-execution of the same failed task ID
  • /v0/livez endpoint: SQLite liveness check via Ping() — distinct from /v0/healthz (readiness). Use as Kubernetes liveness probe.
  • SQLite migration v3: adds run column

Why

Deterministic task IDs + PVC-persisted SQLite = permanently stuck failed tasks after pod restart. The engine's dedup check returned the cached failure without re-executing. This is the sidecar half of the Alpha P3 task reliability initiative. The controller half (plan IDs, simplified retry, failure diagnostics) follows in a separate PR on sei-node-controller-networking.

Test plan

  • TestSubmitReExecutesFailedTask — failed task re-executes, Run increments to 2
  • TestSubmitReExecutesFailedTaskThatFailsAgain — persistent failure increments Run
  • TestSubmitDoesNotIncrementRunOnRehydration — crash recovery preserves Run=1
  • TestSubmitConcurrentSameFailedID — mutex prevents double-execution
  • TestSubmitRunFieldOnFirstSubmit — new tasks start at Run=1
  • TestLivezReturns200WhenStoreHealthy / TestLivezReturns200BeforeReady
  • All 40+ existing engine, server, and store tests pass
  • go vet clean

🤖 Generated with Claude Code

Cloud-API model for task lifecycle: the controller submits stable keys
and the sidecar owns execution lifecycle. Failed tasks are transparently
re-executed on re-submit; running and completed tasks are idempotent
no-ops.

Engine changes:
- Submit branches on existing task status: failed → increment Run,
  reset to running, re-execute. Running/completed → return existing ID.
- sync.Mutex + inFlight map prevents concurrent double-execution of
  the same failed task ID.
- Run counter on TaskResult tracks how many times a task has been
  executed under the same ID. Starts at 1, increments only on
  failed→re-execute (NOT on stale-task rehydration).
- SQLite migration v3 adds the run column.

Observability:
- /v0/livez endpoint checks SQLite responsiveness via Ping(). Use as
  a Kubernetes liveness probe (distinct from /v0/healthz readiness).
- Run counter included in submit/complete/fail log lines.

Tests: 6 new tests covering re-execution, rehydration stability,
concurrent dedup, and Run field correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses Tide review and PR feedback:
- Remove inFlight map — the mutex alone serializes Submit; the
  microsecond race between execute-return and store.Save is handled
  by the next controller poll seeing "running" then "failed".
- Migration default 0 → 1 (pre-existing tasks completed their first run)
- Fix concurrent test: use blocking handler to actually test mutex
- Add livez failure path test (503 when store is closed)
- Add store Ping and Run round-trip tests
- Log stale task Save errors instead of silently discarding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit a595641 into main Apr 3, 2026
2 checks passed
@bdchatham bdchatham deleted the feat/status-aware-submit-and-livez branch April 3, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant