Conversation
Cloud-API model for task lifecycle: the controller submits stable keys and the sidecar owns execution lifecycle. Failed tasks are transparently re-executed on re-submit; running and completed tasks are idempotent no-ops. Engine changes: - Submit branches on existing task status: failed → increment Run, reset to running, re-execute. Running/completed → return existing ID. - sync.Mutex + inFlight map prevents concurrent double-execution of the same failed task ID. - Run counter on TaskResult tracks how many times a task has been executed under the same ID. Starts at 1, increments only on failed→re-execute (NOT on stale-task rehydration). - SQLite migration v3 adds the run column. Observability: - /v0/livez endpoint checks SQLite responsiveness via Ping(). Use as a Kubernetes liveness probe (distinct from /v0/healthz readiness). - Run counter included in submit/complete/fail log lines. Tests: 6 new tests covering re-execution, rehydration stability, concurrent dedup, and Run field correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
bdchatham
commented
Apr 3, 2026
Addresses Tide review and PR feedback: - Remove inFlight map — the mutex alone serializes Submit; the microsecond race between execute-return and store.Save is handled by the next controller poll seeing "running" then "failed". - Migration default 0 → 1 (pre-existing tasks completed their first run) - Fix concurrent test: use blocking handler to actually test mutex - Add livez failure path test (503 when store is closed) - Add store Ping and Run round-trip tests - Log stale task Save errors instead of silently discarding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cloud-API model for the sidecar task engine: the controller submits stable keys, and the sidecar owns the execution lifecycle.
Submit: failed tasks are transparently re-executed on re-submit; running/completed are idempotent no-opsRuncounter onTaskResult: tracks execution count under the same stable ID. Increments on failed→re-execute, NOT on crash-recovery rehydrationsync.Mutex+inFlightmap prevents double-execution of the same failed task ID/v0/livezendpoint: SQLite liveness check viaPing()— distinct from/v0/healthz(readiness). Use as Kubernetes liveness probe.runcolumnWhy
Deterministic task IDs + PVC-persisted SQLite = permanently stuck failed tasks after pod restart. The engine's dedup check returned the cached failure without re-executing. This is the sidecar half of the Alpha P3 task reliability initiative. The controller half (plan IDs, simplified retry, failure diagnostics) follows in a separate PR on sei-node-controller-networking.
Test plan
TestSubmitReExecutesFailedTask— failed task re-executes, Run increments to 2TestSubmitReExecutesFailedTaskThatFailsAgain— persistent failure increments RunTestSubmitDoesNotIncrementRunOnRehydration— crash recovery preserves Run=1TestSubmitConcurrentSameFailedID— mutex prevents double-executionTestSubmitRunFieldOnFirstSubmit— new tasks start at Run=1TestLivezReturns200WhenStoreHealthy/TestLivezReturns200BeforeReadygo vetclean🤖 Generated with Claude Code