Skip to content

feat: plan-scoped task IDs, simplified retry, failure diagnostics (Alpha P3)#50

Merged
bdchatham merged 1 commit intomainfrom
feat/alpha-p3-plan-id-and-failure-diagnostics
Apr 3, 2026
Merged

feat: plan-scoped task IDs, simplified retry, failure diagnostics (Alpha P3)#50
bdchatham merged 1 commit intomainfrom
feat/alpha-p3-plan-id-and-failure-diagnostics

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Cloud-API model for task lifecycle: the controller submits stable keys and the sidecar owns execution lifecycle. Companion to sei-protocol/seictl#61.

Changes

  • Plan ID on TaskPlanuuid.New() generated at plan creation. Task IDs derived from DeterministicTaskID(planID, taskType, planIndex). Unique across rebuilds, stable within a plan.
  • retryTask deleted — on failure with retries remaining, the executor resets task status to Pending with the same ID and requeues. The sidecar transparently re-executes failed tasks.
  • Failure diagnosticsFailedTaskIndex and FailedTaskDetail on TaskPlan record which task failed, its error, retry count, and max retries. Works for both SeiNode and SeiNodeGroup.
  • DeterministicTaskID refactored — signature (planID, taskType, planIndex) replaces (nodeName, taskType, attempt). All 8 call sites updated.

Design doc

Full implementation brief included at docs/design-alpha-phase3-implementation.md.

What this does NOT include (deferred)

  • sei.io/retry-plan annotation — dropped from alpha scope
  • Auto-replacement of Failed SeiNodes by group controller
  • task_events history table in sidecar

Test plan

  • TestExecutePlan_RetryOnFailure — task ID stable on retry, RetryCount incremented
  • TestExecutePlan_ExhaustedRetries_FailsPlan — FailedTaskIndex and FailedTaskDetail recorded
  • TestExecuteGroupPlan_CompletesSuccessfully — plan-based IDs work for group plans
  • TestBuildPlan_UniqueIDsAcrossRebuilds — different plan IDs produce different task IDs
  • TestBuildGroupAssemblyPlan_UniqueIDsAcrossRebuilds — same for group plans
  • All existing tests pass, make test green

🤖 Generated with Claude Code

…pha P3)

Cloud-API task model: the controller submits stable keys derived from
plan ID + task type + plan index. The sidecar owns execution lifecycle
and transparently re-executes failed tasks on re-submit.

Plan ID:
- Each TaskPlan gets a uuid.New() at creation, stored in TaskPlan.ID.
- DeterministicTaskID signature changes from (nodeName, taskType, attempt)
  to (planID, taskType, planIndex). Same plan = same IDs. New plan = new IDs.

Simplified retry:
- retryTask function deleted. On ExecutionFailed with retries remaining,
  the executor resets task status to Pending (same ID) and requeues.
  The sidecar's status-aware Submit handles re-execution transparently.

Failure diagnostics:
- TaskPlan gains FailedTaskIndex (*int) and FailedTaskDetail (*FailedTaskInfo)
  for operator triage without inspecting the full task list.
- failTask records both fields before marking the plan as Failed.
- Works for both SeiNode and SeiNodeGroup (both carry *TaskPlan).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham force-pushed the feat/alpha-p3-plan-id-and-failure-diagnostics branch from ddac8f8 to 0ece2b3 Compare April 3, 2026 19:19
@bdchatham bdchatham merged commit f8f4300 into main Apr 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant