[improvement] : remove driver upgrade label from nodes#2123
[improvement] : remove driver upgrade label from nodes#2123rahulait wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds functionality to clean up stale upgrade labels from nodes that no longer have driver pods scheduled on them. This addresses the scenario where nodeSelector changes cause driver pods to be removed from certain nodes, but the upgrade state labels remain on those nodes.
Changes:
- Added
clearUpgradeLabelsWhereDriverNotRunningfunction to remove upgrade labels from nodes without driver pods - Integrated the cleanup function into the main Reconcile loop as a best-effort operation
- The cleanup skips nodes actively being managed by the upgrade process to avoid interference
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
530ced1 to
f5b193c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f5b193c to
604eaa6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0d17a35 to
b2f136d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
this commit removes driver upgrade label from nodes which don't have any driver pod running on them Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
b2f136d to
f25f905
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
this commit removes driver upgrade label from nodes which don't have any driver pod running on them
Description
This PR adds automatic cleanup of stale upgrade labels from nodes where GPU driver pods are no longer scheduled, addressing the edge case where
nodeSelectorchanges cause pods to terminate but leavenvidia.com/gpu-driver-upgrade-statelabels behind.Problem
When a
NVIDIADriverCR'snodeSelectoris updated:nvidia.com/gpu-driver-upgrade-state) remain on those nodes indefinitelyExample scenario:
Nodes without
nvidia.com/gpu-type: "A100"lose their driver pods but keep upgrade labels.Solution
Added
clearUpgradeLabelsWhereDriverNotRunning()function to the upgrade controller that:Checklist
make lint)make validate-generated-assets)make validate-modules)Testing