feat: plan-scoped task IDs, simplified retry, failure diagnostics (Alpha P3) by bdchatham · Pull Request #50 · sei-protocol/sei-k8s-controller

bdchatham · 2026-04-03T18:51:38Z

Summary

Cloud-API model for task lifecycle: the controller submits stable keys and the sidecar owns execution lifecycle. Companion to sei-protocol/seictl#61.

Changes

Plan ID on TaskPlan — uuid.New() generated at plan creation. Task IDs derived from DeterministicTaskID(planID, taskType, planIndex). Unique across rebuilds, stable within a plan.
retryTask deleted — on failure with retries remaining, the executor resets task status to Pending with the same ID and requeues. The sidecar transparently re-executes failed tasks.
Failure diagnostics — FailedTaskIndex and FailedTaskDetail on TaskPlan record which task failed, its error, retry count, and max retries. Works for both SeiNode and SeiNodeGroup.
DeterministicTaskID refactored — signature (planID, taskType, planIndex) replaces (nodeName, taskType, attempt). All 8 call sites updated.

Design doc

Full implementation brief included at docs/design-alpha-phase3-implementation.md.

What this does NOT include (deferred)

sei.io/retry-plan annotation — dropped from alpha scope
Auto-replacement of Failed SeiNodes by group controller
task_events history table in sidecar

Test plan

TestExecutePlan_RetryOnFailure — task ID stable on retry, RetryCount incremented
TestExecutePlan_ExhaustedRetries_FailsPlan — FailedTaskIndex and FailedTaskDetail recorded
TestExecuteGroupPlan_CompletesSuccessfully — plan-based IDs work for group plans
TestBuildPlan_UniqueIDsAcrossRebuilds — different plan IDs produce different task IDs
TestBuildGroupAssemblyPlan_UniqueIDsAcrossRebuilds — same for group plans
All existing tests pass, make test green

🤖 Generated with Claude Code

…pha P3) Cloud-API task model: the controller submits stable keys derived from plan ID + task type + plan index. The sidecar owns execution lifecycle and transparently re-executes failed tasks on re-submit. Plan ID: - Each TaskPlan gets a uuid.New() at creation, stored in TaskPlan.ID. - DeterministicTaskID signature changes from (nodeName, taskType, attempt) to (planID, taskType, planIndex). Same plan = same IDs. New plan = new IDs. Simplified retry: - retryTask function deleted. On ExecutionFailed with retries remaining, the executor resets task status to Pending (same ID) and requeues. The sidecar's status-aware Submit handles re-execution transparently. Failure diagnostics: - TaskPlan gains FailedTaskIndex (*int) and FailedTaskDetail (*FailedTaskInfo) for operator triage without inspecting the full task list. - failTask records both fields before marking the plan as Failed. - Works for both SeiNode and SeiNodeGroup (both carry *TaskPlan). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bdchatham force-pushed the feat/alpha-p3-plan-id-and-failure-diagnostics branch from ddac8f8 to 0ece2b3 Compare April 3, 2026 19:19

bdchatham merged commit f8f4300 into main Apr 3, 2026
2 checks passed

bdchatham mentioned this pull request Apr 3, 2026

chore: consolidate duplicate buildPlannedTask functions #52

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: plan-scoped task IDs, simplified retry, failure diagnostics (Alpha P3)#50

feat: plan-scoped task IDs, simplified retry, failure diagnostics (Alpha P3)#50
bdchatham merged 1 commit intomainfrom
feat/alpha-p3-plan-id-and-failure-diagnostics

bdchatham commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented Apr 3, 2026

Summary

Changes

Design doc

What this does NOT include (deferred)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant