Skip to content

Replace GetNode heartbeat hack with proper Heartbeat RPC#109

Merged
powderluv merged 1 commit intoROCm:mainfrom
shiv-tyagi:fix-heartbeats
Apr 19, 2026
Merged

Replace GetNode heartbeat hack with proper Heartbeat RPC#109
powderluv merged 1 commit intoROCm:mainfrom
shiv-tyagi:fix-heartbeats

Conversation

@shiv-tyagi
Copy link
Copy Markdown
Member

The agent heartbeat was piggybacking on GetNode RPCs, which gave GetNode a write side-effect and discarded the telemetry the agent collected (always sending 0,0 for cpu_load/free_memory_mb).

  • Add unary Heartbeat RPC (replaces both the GetNode hack and the unused streaming Heartbeat stub) that carries real telemetry
  • Remove write side-effect from GetNode (now a pure read)
  • Return NotFound for heartbeats from unregistered nodes so agents get visibility into controller state mismatches
  • Remove inline string-matching recovery from update_heartbeat; node recovery is handled solely by check_node_health via the state machine (admin_locked flag, not reason strings)
  • Delete dead WalOperation::NodeHeartbeat (never proposed, heartbeat telemetry is ephemeral and does not belong in the WAL)
  • Remove redundant tests covered by existing state machine and health check unit tests
  • Remove unused tokio-stream dependency and streaming stub from spurctld
  • Update spur-k8s heartbeat to use the new RPC; remove the re-register-on-NotFound fallback (Raft WAL preserves node state across restarts, registration is the node watcher's job)

The agent heartbeat was piggybacking on GetNode RPCs, which gave
GetNode a write side-effect and discarded the telemetry the agent
collected (always sending 0,0 for cpu_load/free_memory_mb).

- Add unary Heartbeat RPC (replaces both the GetNode hack and the
  unused streaming Heartbeat stub) that carries real telemetry
- Remove write side-effect from GetNode (now a pure read)
- Return NotFound for heartbeats from unregistered nodes so agents
  get visibility into controller state mismatches
- Remove inline string-matching recovery from update_heartbeat;
  node recovery is handled solely by check_node_health via the
  state machine (admin_locked flag, not reason strings)
- Delete dead WalOperation::NodeHeartbeat (never proposed, heartbeat
  telemetry is ephemeral and does not belong in the WAL)
- Remove redundant tests covered by existing state machine and
  health check unit tests
- Remove unused tokio-stream dependency and streaming stub from
  spurctld
- Update spur-k8s heartbeat to use the new RPC; remove the
  re-register-on-NotFound fallback (Raft WAL preserves node state
  across restarts, registration is the node watcher's job)

Made-with: Cursor
@shiv-tyagi shiv-tyagi requested a review from powderluv April 19, 2026 09:24
@powderluv powderluv merged commit 6ff27cb into ROCm:main Apr 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants