Skip to content

[improvement] : remove driver upgrade label from nodes#2123

Open
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:cleanup-upgrade-labels
Open

[improvement] : remove driver upgrade label from nodes#2123
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:cleanup-upgrade-labels

Conversation

@rahulait
Copy link
Contributor

@rahulait rahulait commented Feb 13, 2026

this commit removes driver upgrade label from nodes which don't have any driver pod running on them

Description

This PR adds automatic cleanup of stale upgrade labels from nodes where GPU driver pods are no longer scheduled, addressing the edge case where nodeSelector changes cause pods to terminate but leave nvidia.com/gpu-driver-upgrade-state labels behind.

Problem

When a NVIDIADriver CR's nodeSelector is updated:

  1. The DaemonSet updates its pod template with the new selector
  2. Pods on nodes no longer matching the selector are terminated
  3. The upgrade labels (nvidia.com/gpu-driver-upgrade-state) remain on those nodes indefinitely
  4. This creates confusion and incorrect upgrade state tracking

Example scenario:

# Initial: nodeSelector targets GPU nodes with specific label
spec:
  nodeSelector:
    nvidia.com/gpu: "true"

# Updated: nodeSelector narrows to specific GPU type
spec:
  nodeSelector:
    nvidia.com/gpu: "true"
    nvidia.com/gpu-type: "A100"

Nodes without nvidia.com/gpu-type: "A100" lose their driver pods but keep upgrade labels.

Solution

Added clearUpgradeLabelsWhereDriverNotRunning() function to the upgrade controller that:

  • Runs automatically every 2 minutes during normal upgrade controller reconciliation
  • Identifies stale labels by finding nodes with upgrade labels but no driver pods
  • Protects active upgrades by skipping nodes in ClusterUpgradeState.NodeStates (actively managed by upgrade process)
  • Removes labels safely using Patch() for efficient, conflict-resistant updates

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds functionality to clean up stale upgrade labels from nodes that no longer have driver pods scheduled on them. This addresses the scenario where nodeSelector changes cause driver pods to be removed from certain nodes, but the upgrade state labels remain on those nodes.

Changes:

  • Added clearUpgradeLabelsWhereDriverNotRunning function to remove upgrade labels from nodes without driver pods
  • Integrated the cleanup function into the main Reconcile loop as a best-effort operation
  • The cleanup skips nodes actively being managed by the upgrade process to avoid interference

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rahulait rahulait force-pushed the cleanup-upgrade-labels branch 3 times, most recently from 0d17a35 to b2f136d Compare February 13, 2026 20:56
@rahulait rahulait requested a review from Copilot February 13, 2026 20:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

this commit removes driver upgrade label from nodes which don't have any driver pod running on them

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant