Skip to content

AGENT-1449: Add IRI registry credential rotation support#5766

Open
rwsu wants to merge 4 commits intoopenshift:mainfrom
rwsu:AGENT-1449-auth-rotation
Open

AGENT-1449: Add IRI registry credential rotation support#5766
rwsu wants to merge 4 commits intoopenshift:mainfrom
rwsu:AGENT-1449-auth-rotation

Conversation

@rwsu
Copy link
Copy Markdown

@rwsu rwsu commented Mar 13, 2026

- What I did

Implement safe credential rotation for the IRI registry using a desired-vs-current pattern with generation-numbered usernames. The auth secret holds the desired password; the pull secret (read from rendered MachineConfig) holds the deployed password. When they differ, a three-phase rotation is performed:

  1. Deploy dual htpasswd (old + new credentials with different usernames)
  2. Update pull secret after all MCPs finish rolling out
  3. Clean up dual htpasswd to single entry after new pull secret is deployed

This avoids authentication deadlocks during rolling MachineConfig updates because the pull secret always contains the old credentials, which are present in every version of the htpasswd. Mid-rotation password changes are handled by verifying htpasswd hashes with bcrypt.CompareHashAndPassword and regenerating if they don't match.

Key changes:

  • Add MachineConfigPool lister/informer to IRI controller
  • Add reconcileAuthCredentials with three-case rotation logic
  • Add getDeployedIRICredentials (reads from rendered MC, not API)
  • Add areAllPoolsUpdated (checks all pools including workers)
  • Add credential rotation design doc
  • Add IRI htpasswd file to no-reboot list (NoneStatusAction) since the Distribution registry re-reads htpasswd on modification time change
  • Pull secret validation — validatePullSecretAuthsPreserved that ensures the merge never drops or modifies existing registry auth entries (not mentioned)

- How to verify it

Update the password to trigger the rotation to start:

oc -n openshift-machine-config-operator patch secret internal-release-image-registry-auth \
  --type merge -p '{"data":{"password":"'$(echo -n "new-password" | base64)'"}}'

Verify the /etc/iri-registry/auth/htpasswd has been updated.
Verify iri-registry works for both new and old credentials during rollout.
Verify global pull-secret contains the new credentials after rollout is complete.

- Description for the changelog

Add credential rotation support for the IRI registry. When the auth secret's password field is updated, the controller performs a three-phase rotation: (1) deploys a dual htpasswd with both old and new credentials so all nodes accept both passwords during rollout, (2) updates the global pull secret with the new credentials after all MachineConfigPools are fully updated, and (3) cleans up the dual htpasswd to a single entry once the new credentials are deployed everywhere. This avoids authentication deadlocks caused by api-int load-balancing requests across master nodes that may be at different stages of the rollout.

Also adds e2e tests for registry authentication (401 on unauthenticated requests, 200 with valid credentials) and an end-to-end credential rotation test that exercises all three phases.

Depends on #5765.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 13, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d17e3a87-f022-434a-b812-b8ffff606b3d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds IRI registry authentication and a multi‑phase credential rotation flow: introduces an IRI auth secret and htpasswd handling, merges auth into the global pull secret, makes the InternalReleaseImage controller MachineConfigPool-aware, propagates auth into render/bootstrap paths, updates node templates/daemon behavior, and adds unit and e2e tests plus documentation.

Changes

Cohort / File(s) Summary
Controller core & rotation
pkg/controller/internalreleaseimage/internalreleaseimage_controller.go, pkg/controller/internalreleaseimage/internalreleaseimage_bootstrap.go
Adds MCP-aware credential rotation state machine, ErrInconsistentIRICredentials, MCP informer wiring/handlers, reading deployed credentials from rendered MachineConfigs, reconcileAuthCredentials flow, and forwards auth secret into bootstrap rendering.
Renderer & templates
pkg/controller/internalreleaseimage/internalreleaseimage_renderer.go, pkg/controller/internalreleaseimage/templates/master/files/iri-registry-auth-htpasswd.yaml, pkg/controller/internalreleaseimage/templates/master/files/usr-local-bin-load-registry-image-sh.yaml, pkg/controller/internalreleaseimage/templates/master/units/iri-registry.service.yaml
Renderer gains optional iriAuthSecret/IriHtpasswd; templates add htpasswd file, conditional mount/env vars, and conditional authfile usage in pull script.
Pull‑secret & htpasswd utilities + tests
pkg/controller/internalreleaseimage/pullsecret.go, pkg/controller/internalreleaseimage/pullsecret_test.go
New utilities for merging/extracting IRI creds into/from dockerconfigjson, username generation, htpasswd generation/validation, and comprehensive unit tests for these behaviors.
Bootstrap wiring & tests
pkg/controller/bootstrap/bootstrap.go, pkg/controller/internalreleaseimage/internalreleaseimage_bootstrap_test.go
Reads internal-release-image-registry-auth Secret during bootstrap, merges IRI auth into pull secret when gated, and updates bootstrap tests to cover auth present/absent cases.
Controller tests & fixtures
pkg/controller/internalreleaseimage/internalreleaseimage_controller_test.go, pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
Extends fixtures/builders (mcp, iri auth, rendered MC), wires MCP informer/indexer into tests, adds TestCredentialRotation and pull-secret verification, and updates ignition/content assertions for auth.
Daemon / node handling
pkg/daemon/constants/constants.go, pkg/daemon/update.go, pkg/daemon/update_test.go, pkg/apihelpers/apihelpers.go
Adds IRIRegistryAuthFile constant, treats htpasswd diffs as no-op for post-config actions, adds default cluster policy entry for the auth file, and updates related tests.
Start wiring
cmd/machine-config-controller/start.go
Passes MachineConfigPool informer into InternalReleaseImage controller constructor.
Constants
pkg/controller/common/constants.go
Adds InternalReleaseImageAuthSecretName = "internal-release-image-registry-auth".
Documentation
docs/IRIAuthCredentialRotation.md
New operational doc describing three‑phase credential rotation, detection, triggers, and safety properties.
End‑to‑end tests
test/e2e-iri/iri_test.go
Adds E2E tests and helpers for unauthenticated 401, authenticated access using pull secret, and full credential rotation workflow with MCP rollout verification.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rwsu: An error was encountered searching for bug AGENT-1449 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. No response returned: Get "https://issues.redhat.com/rest/api/2/issue/AGENT-1449": GET https://issues.redhat.com/rest/api/2/issue/AGENT-1449 giving up after 5 attempt(s)

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

- What I did

Implement safe credential rotation for the IRI registry using a desired-vs-current pattern with generation-numbered usernames. The auth secret holds the desired password; the pull secret (read from rendered MachineConfig) holds the deployed password. When they differ, a three-phase rotation is performed:

  1. Deploy dual htpasswd (old + new credentials with different usernames)
  2. Update pull secret after all MCPs finish rolling out
  3. Clean up dual htpasswd to single entry after new pull secret is deployed

This avoids authentication deadlocks during rolling MachineConfig updates because the pull secret always contains the old credentials, which are present in every version of the htpasswd. Mid-rotation password changes are handled by verifying htpasswd hashes with bcrypt.CompareHashAndPassword and regenerating if they don't match.

Key changes:

  • Add MachineConfigPool lister/informer to IRI controller
  • Add reconcileAuthCredentials with three-case rotation logic
  • Add getDeployedIRICredentials (reads from rendered MC, not API)
  • Add areAllPoolsUpdated (checks all pools including workers)
  • Add HtpasswdHasValidEntry, GenerateHtpasswdEntry, GenerateDualHtpasswd,
  • NextIRIUsername, ExtractIRICredentialsFromPullSecret helpers
  • Vendor golang.org/x/crypto/bcrypt for htpasswd hash generation
  • Add credential rotation design doc

- How to verify it

Update the password to trigger the rotation to start:

oc -n openshift-machine-config-operator patch secret internal-release-image-registry-auth \
 --type merge -p '{"data":{"password":"'$(echo -n "new-password" | base64)'"}}'

Verify the /etc/iri-registry/auth/htpasswd has been updated.
Verify iri-registry works for both new and old credentials during rollout.
Verify global pull-secret contains the new credentials after rollout is complete.

- Description for the changelog

Add credential rotation support for the IRI registry. When the auth secret's password field is updated, the controller performs a three-phase rotation: (1) deploys a dual htpasswd with both old and new credentials so all nodes accept both passwords during rollout, (2) updates the global pull secret with the new credentials after all MachineConfigPools are fully updated, and (3) cleans up the dual htpasswd to a single entry once the new credentials are deployed everywhere. This avoids authentication deadlocks caused by api-int load-balancing requests across master nodes that may be at different stages of the rollout.

Also adds e2e tests for registry authentication (401 on unauthenticated requests, 200 with valid credentials) and an end-to-end credential rotation test that exercises all three phases.

Depends on #5765.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rwsu
Once this PR has been reviewed and has the lgtm label, please assign rishabhsaini for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rwsu
Copy link
Copy Markdown
Author

rwsu commented Mar 13, 2026

/cc @andfasano

@openshift-ci openshift-ci bot requested a review from andfasano March 13, 2026 22:08
@rwsu rwsu force-pushed the AGENT-1449-auth-rotation branch 2 times, most recently from 7762c05 to 2e3bd60 Compare March 16, 2026 22:15
@rwsu
Copy link
Copy Markdown
Author

rwsu commented Mar 16, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 16, 2026

@rwsu: This pull request references AGENT-1449 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rwsu rwsu force-pushed the AGENT-1449-auth-rotation branch 2 times, most recently from 89067e0 to fa5b7cc Compare March 17, 2026 03:32
@rwsu rwsu changed the title [WIP] AGENT-1449: Add IRI registry credential rotation support AGENT-1449: Add IRI registry credential rotation support Mar 17, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 17, 2026

@rwsu: This pull request references AGENT-1449 which is a valid jira issue.

Details

In response to this:

- What I did

Implement safe credential rotation for the IRI registry using a desired-vs-current pattern with generation-numbered usernames. The auth secret holds the desired password; the pull secret (read from rendered MachineConfig) holds the deployed password. When they differ, a three-phase rotation is performed:

  1. Deploy dual htpasswd (old + new credentials with different usernames)
  2. Update pull secret after all MCPs finish rolling out
  3. Clean up dual htpasswd to single entry after new pull secret is deployed

This avoids authentication deadlocks during rolling MachineConfig updates because the pull secret always contains the old credentials, which are present in every version of the htpasswd. Mid-rotation password changes are handled by verifying htpasswd hashes with bcrypt.CompareHashAndPassword and regenerating if they don't match.

Key changes:

  • Add MachineConfigPool lister/informer to IRI controller
  • Add reconcileAuthCredentials with three-case rotation logic
  • Add getDeployedIRICredentials (reads from rendered MC, not API)
  • Add areAllPoolsUpdated (checks all pools including workers)
  • Add HtpasswdHasValidEntry, GenerateHtpasswdEntry, GenerateDualHtpasswd,
  • NextIRIUsername, ExtractIRICredentialsFromPullSecret helpers
  • Vendor golang.org/x/crypto/bcrypt for htpasswd hash generation
  • Add credential rotation design doc
  • Add IRI htpasswd file to no-reboot list (NoneStatusAction) since the Distribution registry re-reads htpasswd on modification time change

- How to verify it

Update the password to trigger the rotation to start:

oc -n openshift-machine-config-operator patch secret internal-release-image-registry-auth \
 --type merge -p '{"data":{"password":"'$(echo -n "new-password" | base64)'"}}'

Verify the /etc/iri-registry/auth/htpasswd has been updated.
Verify iri-registry works for both new and old credentials during rollout.
Verify global pull-secret contains the new credentials after rollout is complete.

- Description for the changelog

Add credential rotation support for the IRI registry. When the auth secret's password field is updated, the controller performs a three-phase rotation: (1) deploys a dual htpasswd with both old and new credentials so all nodes accept both passwords during rollout, (2) updates the global pull secret with the new credentials after all MachineConfigPools are fully updated, and (3) cleans up the dual htpasswd to a single entry once the new credentials are deployed everywhere. This avoids authentication deadlocks caused by api-int load-balancing requests across master nodes that may be at different stages of the rollout.

Also adds e2e tests for registry authentication (401 on unauthenticated requests, 200 with valid credentials) and an end-to-end credential rotation test that exercises all three phases.

Depends on #5765.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rwsu rwsu force-pushed the AGENT-1449-auth-rotation branch from fa5b7cc to e6a9246 Compare March 17, 2026 03:35
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go (1)

375-376: Handle json.Marshal errors in the test builder to avoid silent fixture corruption.

Ignoring this error can mask setup failures and produce misleading downstream assertions.

Suggested refactor
-		raw, _ := json.Marshal(ignCfg)
+		raw, err := json.Marshal(ignCfg)
+		if err != nil {
+			panic(fmt.Sprintf("failed to marshal ignition config in renderedMCBuilder: %v", err))
+		}
 		rb.obj.Spec.Config.Raw = raw
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go`
around lines 375 - 376, The test builder currently ignores the error from
json.Marshal(ignCfg) before setting rb.obj.Spec.Config.Raw which can hide
fixture setup failures; change the code to capture the error (err :=
json.Marshal(ignCfg)) and handle it instead of discarding it—either return the
error from the builder or call the test failure helper (e.g., t.Fatalf) so
failures surface, and then only assign rb.obj.Spec.Config.Raw = raw when err ==
nil.
docs/IRIAuthCredentialRotation.md (1)

87-116: Add language specifier to the fenced code block.

The state diagram uses a fenced code block without a language identifier. Consider adding a language hint (e.g., text or plaintext) to satisfy linting rules and improve rendering consistency.

-```
+```text
               ┌────────────────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/IRIAuthCredentialRotation.md` around lines 87 - 116, The fenced ASCII
state-diagram block lacks a language tag; update the opening fence for that
diagram (the block starting with the box-drawing diagram
"┌────────────────────────┐" / "No IRI entry in pull secret") to include a plain
text language hint (for example change "```" to "```text" or "```plaintext") so
linting and rendering recognize it as preformatted text.
test/e2e-iri/iri_test.go (1)

205-212: Consider setting MinVersion in the IRI test client for consistency with production TLS hardening practices.

getIRIRegistryClient relies on Go's default TLS protocol negotiation. While not critical for test code, explicitly setting MinVersion: tls.VersionTLS13 would align with production Kubernetes patterns and ensure stable protocol selection across runtime versions.

Suggested change
 	return &http.Client{
 		Transport: &http.Transport{
 			TLSClientConfig: &tls.Config{
+				MinVersion: tls.VersionTLS13,
 				RootCAs: roots,
 			},
 		},
 		Timeout: 30 * time.Second,
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e-iri/iri_test.go` around lines 205 - 212, The TLS config returned by
getIRIRegistryClient in iri_test.go does not set a minimum TLS version; update
the TLSConfig used in the http.Client (the TLSClientConfig inside the Transport
returned by getIRIRegistryClient) to include MinVersion: tls.VersionTLS13 so the
test client explicitly enforces TLS 1.3 for consistency with production
hardening and stable protocol selection across Go runtimes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 383-394: The Secret returned by ctrl.secretLister (iriAuthSecret)
is being mutated later (authSecret.Data["htpasswd"] in
reconcileAuthCredentials), which corrupts the informer cache; make a deep copy
of the Secret before any mutation. Either call iriAuthSecret =
iriAuthSecret.DeepCopy() immediately after retrieving it (before passing into
reconcileAuthCredentials) or perform the DeepCopy at the start of
reconcileAuthCredentials and then mutate the copy. Ensure all mutations to
authSecret.Data (the htpasswd updates at the locations referencing lines ~590
and ~635) operate on the DeepCopy, not the lister-provided object.
- Around line 303-307: The current gating only compares
UpdatedMachineCount/ReadyMachineCount; update both updateMachineConfigPool() and
areAllPoolsUpdated() to also consider configuration-name convergence: in
updateMachineConfigPool(), besides comparing
oldPool.Status.UpdatedMachineCount/ReadyMachineCount, also enqueue when
oldPool.Spec.Configuration.Name != curPool.Spec.Configuration.Name or
oldPool.Status.Configuration.Name != curPool.Status.Configuration.Name so
config-name changes trigger a sync; in areAllPoolsUpdated(), in addition to
checking UpdatedMachineCount and ReadyMachineCount equality, require that for
each pool curPool.Spec.Configuration.Name == curPool.Status.Configuration.Name
(and matches the controller’s desired config name if applicable) before treating
the pool as “updated” to prevent advancing credential rotation while nodes are
still converging on the new configuration.

---

Nitpick comments:
In `@docs/IRIAuthCredentialRotation.md`:
- Around line 87-116: The fenced ASCII state-diagram block lacks a language tag;
update the opening fence for that diagram (the block starting with the
box-drawing diagram "┌────────────────────────┐" / "No IRI entry in pull
secret") to include a plain text language hint (for example change "```" to
"```text" or "```plaintext") so linting and rendering recognize it as
preformatted text.

In `@pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go`:
- Around line 375-376: The test builder currently ignores the error from
json.Marshal(ignCfg) before setting rb.obj.Spec.Config.Raw which can hide
fixture setup failures; change the code to capture the error (err :=
json.Marshal(ignCfg)) and handle it instead of discarding it—either return the
error from the builder or call the test failure helper (e.g., t.Fatalf) so
failures surface, and then only assign rb.obj.Spec.Config.Raw = raw when err ==
nil.

In `@test/e2e-iri/iri_test.go`:
- Around line 205-212: The TLS config returned by getIRIRegistryClient in
iri_test.go does not set a minimum TLS version; update the TLSConfig used in the
http.Client (the TLSClientConfig inside the Transport returned by
getIRIRegistryClient) to include MinVersion: tls.VersionTLS13 so the test client
explicitly enforces TLS 1.3 for consistency with production hardening and stable
protocol selection across Go runtimes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af005b0e-1fe3-4910-81e2-b8af5b4ddbf7

📥 Commits

Reviewing files that changed from the base of the PR and between b06051f and e6a9246.

⛔ Files ignored due to path filters (3)
  • vendor/golang.org/x/crypto/bcrypt/base64.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/bcrypt/bcrypt.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (20)
  • cmd/machine-config-controller/start.go
  • docs/IRIAuthCredentialRotation.md
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootstrap/bootstrap.go
  • pkg/controller/common/constants.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_bootstrap.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_bootstrap_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_renderer.go
  • pkg/controller/internalreleaseimage/pullsecret.go
  • pkg/controller/internalreleaseimage/pullsecret_test.go
  • pkg/controller/internalreleaseimage/templates/master/files/iri-registry-auth-htpasswd.yaml
  • pkg/controller/internalreleaseimage/templates/master/files/usr-local-bin-load-registry-image-sh.yaml
  • pkg/controller/internalreleaseimage/templates/master/units/iri-registry.service.yaml
  • pkg/daemon/constants/constants.go
  • pkg/daemon/update.go
  • pkg/daemon/update_test.go
  • test/e2e-iri/iri_test.go

@rwsu rwsu force-pushed the AGENT-1449-auth-rotation branch from e6a9246 to 7a98820 Compare March 17, 2026 14:45
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/controller/internalreleaseimage/internalreleaseimage_controller.go (1)

685-688: Consider adding retry.RetryOnConflict for pull secret update.

Other secret updates in this file (e.g., status updates at line 480) use retry logic. While the controller's handleErr provides eventual retry, an inline retry would converge faster on conflicts.

♻️ Optional: Add retry wrapper
-	pullSecret.Data[corev1.DockerConfigJsonKey] = mergedBytes
-	if _, err := ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(
-		context.TODO(), pullSecret, metav1.UpdateOptions{}); err != nil {
-		return fmt.Errorf("failed to update pull secret with rotated credentials: %w", err)
-	}
+	if err := retry.RetryOnConflict(updateBackoff, func() error {
+		pullSecret, err := ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Get(
+			context.TODO(), ctrlcommon.GlobalPullSecretName, metav1.GetOptions{})
+		if err != nil {
+			return err
+		}
+		mergedBytes, err := MergeIRIAuthIntoPullSecretWithUsername(
+			pullSecret.Data[corev1.DockerConfigJsonKey], newUsername, desiredPassword, baseDomain)
+		if err != nil {
+			return err
+		}
+		pullSecret.Data[corev1.DockerConfigJsonKey] = mergedBytes
+		_, err = ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(
+			context.TODO(), pullSecret, metav1.UpdateOptions{})
+		return err
+	}); err != nil {
+		return fmt.Errorf("failed to update pull secret with rotated credentials: %w", err)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`
around lines 685 - 688, Replace the direct Secrets().Update call with a
retry.RetryOnConflict loop so we fetch the latest secret, apply the rotated
credentials to the fetched object, and attempt Update until success or
non-conflict error; specifically wrap the operation that currently calls
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(),
pullSecret, metav1.UpdateOptions{}) in retry.RetryOnConflict(retry.DefaultRetry,
func() error { s, err :=
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Get(context.TODO(),
pullSecret.Name, metav1.GetOptions{}); if err != nil { return err }; modify s
with the rotated credentials from pullSecret; _, err =
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(),
s, metav1.UpdateOptions{}); return err }); ensure errors are returned unchanged
from the wrapper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/internalreleaseimage/pullsecret.go`:
- Around line 173-184: GenerateDualHtpasswd must reject identical usernames: add
an explicit check in GenerateDualHtpasswd comparing currentUsername and
newUsername and return a descriptive error (e.g., via fmt.Errorf) if they are
equal, before calling GenerateHtpasswdEntry; keep existing error wrapping for
GenerateHtpasswdEntry calls and return the concatenated entries only when
usernames differ.

In `@test/e2e-iri/iri_test.go`:
- Around line 296-405: The test mutates the cluster auth secret but never
restores it; add a t.Cleanup closure immediately after reading originalPassword
that updates authSecret.Data["password"] back to originalPassword via
cs.Secrets(...).Update, then poll/wait until the pull secret returns the
original credentials (use iri.ExtractIRICredentialsFromPullSecret) and wait for
MCP rollout completion with helpers.WaitForPoolCompleteAny for "master" and
"worker" so the cluster is returned to its original state even on test failure.

---

Nitpick comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 685-688: Replace the direct Secrets().Update call with a
retry.RetryOnConflict loop so we fetch the latest secret, apply the rotated
credentials to the fetched object, and attempt Update until success or
non-conflict error; specifically wrap the operation that currently calls
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(),
pullSecret, metav1.UpdateOptions{}) in retry.RetryOnConflict(retry.DefaultRetry,
func() error { s, err :=
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Get(context.TODO(),
pullSecret.Name, metav1.GetOptions{}); if err != nil { return err }; modify s
with the rotated credentials from pullSecret; _, err =
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(),
s, metav1.UpdateOptions{}); return err }); ensure errors are returned unchanged
from the wrapper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1eb2ba4c-3936-4b9c-a498-4d8c164858f1

📥 Commits

Reviewing files that changed from the base of the PR and between e6a9246 and 7a98820.

⛔ Files ignored due to path filters (3)
  • vendor/golang.org/x/crypto/bcrypt/base64.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/bcrypt/bcrypt.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (12)
  • cmd/machine-config-controller/start.go
  • docs/IRIAuthCredentialRotation.md
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
  • pkg/controller/internalreleaseimage/pullsecret.go
  • pkg/controller/internalreleaseimage/pullsecret_test.go
  • pkg/daemon/constants/constants.go
  • pkg/daemon/update.go
  • pkg/daemon/update_test.go
  • test/e2e-iri/iri_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/daemon/update_test.go
  • pkg/controller/internalreleaseimage/pullsecret_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 17, 2026

@rwsu: This pull request references AGENT-1449 which is a valid jira issue.

Details

In response to this:

- What I did

Implement safe credential rotation for the IRI registry using a desired-vs-current pattern with generation-numbered usernames. The auth secret holds the desired password; the pull secret (read from rendered MachineConfig) holds the deployed password. When they differ, a three-phase rotation is performed:

  1. Deploy dual htpasswd (old + new credentials with different usernames)
  2. Update pull secret after all MCPs finish rolling out
  3. Clean up dual htpasswd to single entry after new pull secret is deployed

This avoids authentication deadlocks during rolling MachineConfig updates because the pull secret always contains the old credentials, which are present in every version of the htpasswd. Mid-rotation password changes are handled by verifying htpasswd hashes with bcrypt.CompareHashAndPassword and regenerating if they don't match.

Key changes:

  • Add MachineConfigPool lister/informer to IRI controller
  • Add reconcileAuthCredentials with three-case rotation logic
  • Add getDeployedIRICredentials (reads from rendered MC, not API)
  • Add areAllPoolsUpdated (checks all pools including workers)
  • Add credential rotation design doc
  • Add IRI htpasswd file to no-reboot list (NoneStatusAction) since the Distribution registry re-reads htpasswd on modification time change

- How to verify it

Update the password to trigger the rotation to start:

oc -n openshift-machine-config-operator patch secret internal-release-image-registry-auth \
 --type merge -p '{"data":{"password":"'$(echo -n "new-password" | base64)'"}}'

Verify the /etc/iri-registry/auth/htpasswd has been updated.
Verify iri-registry works for both new and old credentials during rollout.
Verify global pull-secret contains the new credentials after rollout is complete.

- Description for the changelog

Add credential rotation support for the IRI registry. When the auth secret's password field is updated, the controller performs a three-phase rotation: (1) deploys a dual htpasswd with both old and new credentials so all nodes accept both passwords during rollout, (2) updates the global pull secret with the new credentials after all MachineConfigPools are fully updated, and (3) cleans up the dual htpasswd to a single entry once the new credentials are deployed everywhere. This avoids authentication deadlocks caused by api-int load-balancing requests across master nodes that may be at different stages of the rollout.

Also adds e2e tests for registry authentication (401 on unauthenticated requests, 200 with valid credentials) and an end-to-end credential rotation test that exercises all three phases.

Depends on #5765.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 17, 2026

@rwsu: This pull request references AGENT-1449 which is a valid jira issue.

Details

In response to this:

- What I did

Implement safe credential rotation for the IRI registry using a desired-vs-current pattern with generation-numbered usernames. The auth secret holds the desired password; the pull secret (read from rendered MachineConfig) holds the deployed password. When they differ, a three-phase rotation is performed:

  1. Deploy dual htpasswd (old + new credentials with different usernames)
  2. Update pull secret after all MCPs finish rolling out
  3. Clean up dual htpasswd to single entry after new pull secret is deployed

This avoids authentication deadlocks during rolling MachineConfig updates because the pull secret always contains the old credentials, which are present in every version of the htpasswd. Mid-rotation password changes are handled by verifying htpasswd hashes with bcrypt.CompareHashAndPassword and regenerating if they don't match.

Key changes:

  • Add MachineConfigPool lister/informer to IRI controller
  • Add reconcileAuthCredentials with three-case rotation logic
  • Add getDeployedIRICredentials (reads from rendered MC, not API)
  • Add areAllPoolsUpdated (checks all pools including workers)
  • Add credential rotation design doc
  • Add IRI htpasswd file to no-reboot list (NoneStatusAction) since the Distribution registry re-reads htpasswd on modification time change
  • Pull secret validation — validatePullSecretAuthsPreserved that ensures the merge never drops or modifies existing registry auth entries (not mentioned)

- How to verify it

Update the password to trigger the rotation to start:

oc -n openshift-machine-config-operator patch secret internal-release-image-registry-auth \
 --type merge -p '{"data":{"password":"'$(echo -n "new-password" | base64)'"}}'

Verify the /etc/iri-registry/auth/htpasswd has been updated.
Verify iri-registry works for both new and old credentials during rollout.
Verify global pull-secret contains the new credentials after rollout is complete.

- Description for the changelog

Add credential rotation support for the IRI registry. When the auth secret's password field is updated, the controller performs a three-phase rotation: (1) deploys a dual htpasswd with both old and new credentials so all nodes accept both passwords during rollout, (2) updates the global pull secret with the new credentials after all MachineConfigPools are fully updated, and (3) cleans up the dual htpasswd to a single entry once the new credentials are deployed everywhere. This avoids authentication deadlocks caused by api-int load-balancing requests across master nodes that may be at different stages of the rollout.

Also adds e2e tests for registry authentication (401 on unauthenticated requests, 200 with valid credentials) and an end-to-end credential rotation test that exercises all three phases.

Depends on #5765.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rwsu rwsu force-pushed the AGENT-1449-auth-rotation branch from 7a98820 to 6df639c Compare March 17, 2026 21:35
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 387-392: The secret retrieval currently ignores lister errors by
discarding the error from ctrl.secretLister.Secrets(...).Get; change the
iriAuthSecret, _ := ...Get(...) call to capture the error, check it with
apierrors.IsNotFound(err) and only treat NotFound as acceptable, but for any
other error return/requeue the error from the reconcile so the controller
doesn't proceed silently; keep the existing iriAuthSecret.DeepCopy() logic when
iriAuthSecret != nil. Ensure you reference
ctrl.secretLister.Secrets(ctrlcommon.MCONamespace).Get, the iriAuthSecret
variable, and ctrlcommon.InternalReleaseImageAuthSecretName when locating the
change.

In `@test/e2e-iri/iri_test.go`:
- Around line 343-350: The cleanup restore is merging the IRI auth using the
stale originalPullSecretRaw and originalPassword which can overwrite concurrent
updates and restore the wrong password; instead after fetching the current
secret into ps (from cs.Secrets(...).Get(...)) use the current secret data
(ps.Data[corev1.DockerConfigJsonKey]) as the base input to
iri.MergeIRIAuthIntoPullSecretWithUsername and pass the originally-observed
password variable originalPullSecretPassword (not originalPassword) so you merge
on the latest secret state and restore the exact original observed password;
update the call site where MergeIRIAuthIntoPullSecretWithUsername is invoked to
use those variables (and keep existing error handling).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5882d527-dc41-499a-9719-d10640ec938d

📥 Commits

Reviewing files that changed from the base of the PR and between 7a98820 and 6df639c.

⛔ Files ignored due to path filters (3)
  • vendor/golang.org/x/crypto/bcrypt/base64.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/bcrypt/bcrypt.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (12)
  • cmd/machine-config-controller/start.go
  • docs/IRIAuthCredentialRotation.md
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
  • pkg/controller/internalreleaseimage/pullsecret.go
  • pkg/controller/internalreleaseimage/pullsecret_test.go
  • pkg/daemon/constants/constants.go
  • pkg/daemon/update.go
  • pkg/daemon/update_test.go
  • test/e2e-iri/iri_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/controller/internalreleaseimage/pullsecret.go
  • pkg/daemon/update_test.go

@rwsu rwsu force-pushed the AGENT-1449-auth-rotation branch from 6df639c to 6f21ddd Compare March 18, 2026 02:49
@rwsu
Copy link
Copy Markdown
Author

rwsu commented Mar 18, 2026

/verified by TestIRIAuth_CredentialRotation e2e test and @rwsu

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 18, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rwsu: This PR has been marked as verified by TestIRIAuth_CredentialRotation e2e test and @rwsu.

Details

In response to this:

/verified by TestIRIAuth_CredentialRotation e2e test and @rwsu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026
rwsu added 4 commits March 23, 2026 10:31
Add htpasswd-based authentication to the IRI registry. The installer
generates credentials and provides them via a bootstrap secret. The MCO
mounts the htpasswd file into the registry container and configures
registry auth environment variables. The registry password is merged
into the node pull secret so kubelet can authenticate when pulling
the release image.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
The IRI controller merges registry auth credentials into the global
pull secret after bootstrap. This triggers the template controller to
re-render template MCs (00-master, etc.) with the updated pull secret,
producing a different rendered MC hash than what bootstrap created.

The mismatch causes the MCD DaemonSet pod to fail during bootstrap:
it reads the bootstrap-rendered MC name from the node annotation, but
that MC no longer exists in-cluster (replaced by the re-rendered one).
The MCD falls back to reading /etc/machine-config-daemon/currentconfig,
which was never written because the firstboot MCD detected "no changes"
and skipped it. Both master nodes go Degraded and never recover.

Fix by merging IRI auth into the pull secret during bootstrap before
template MC rendering, so both bootstrap and in-cluster produce
identical rendered MC hashes.

Extract the pull secret merge logic into a shared MergeIRIAuthIntoPullSecret
function used by both the bootstrap path and the in-cluster IRI controller.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Implement safe credential rotation for the IRI registry using a
desired-vs-current pattern with generation-numbered usernames.

The auth secret holds the desired password; the pull secret (read from
rendered MachineConfig) holds the deployed password. When they differ, a
three-phase rotation is performed:

1. Deploy dual htpasswd (old + new credentials with different usernames)
2. Update pull secret after all MCPs finish rolling out
3. Clean up dual htpasswd to single entry after new pull secret is deployed

Key changes:

- Add MachineConfigPool event handler to IRI controller so that rotation
  phases advance automatically when pool rollouts complete
- Add reconcileAuthCredentials as the main dispatcher, with
  reconcileCredentialRotation (Phase 1/2) and
  reconcileCredentialsInSync (Phase 3) as extracted helpers
- Add getDeployedIRICredentials (reads from rendered MC, not API)
- Add areAllPoolsUpdated (checks all pools including workers)
- Add HtpasswdHasValidEntry, GenerateHtpasswdEntry, GenerateDualHtpasswd,
  NextIRIUsername, ExtractIRICredentialsFromPullSecret helpers
- Vendor golang.org/x/crypto/bcrypt for htpasswd hash generation
- Add credential rotation design doc
- Add IRI htpasswd file to no-reboot list (NoneStatusAction) since the
  Distribution registry re-reads htpasswd on modification time change

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
…ial rotation

Add e2e tests that verify:
- Unauthenticated requests to the IRI registry return 401
- Authenticated requests with valid credentials return 200
- End-to-end credential rotation through all three phases

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
@rwsu rwsu force-pushed the AGENT-1449-auth-rotation branch from 6f21ddd to 19b6e97 Compare March 23, 2026 18:57
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Mar 23, 2026
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 323-324: The controller currently only enqueues on Secret updates
and ignores Secret create events, so when
ctrlcommon.InternalReleaseImageAuthSecretName (and similarly
ctrlcommon.InternalReleaseImageTLSSecretName) is created after the first sync
the pull-secret merge and htpasswd render are missed; update the Secret
watch/predicate used in SetupWithManager (or the secret event handlers) to also
handle Create events for secrets whose name equals
ctrlcommon.InternalReleaseImageAuthSecretName or
ctrlcommon.InternalReleaseImageTLSSecretName so the controller enqueues a
reconcile on Secret creation as well (mirror the existing update logic used for
those names and apply the same condition used at the if secret.Name != ...
checks).
- Around line 630-678: hasDualEntries only proves the Secret contains both
hashes, which can allow Phase 2 to proceed before the new rendered MachineConfig
is applied to nodes; after GenerateDualHtpasswd writes
authSecret.Data["htpasswd"] you must wait for the MachineConfig rendered for
that exact htpasswd to roll out before updating the global pull secret. Fix by
computing an identifier for the rendered config (for example a hash of
dualHtpasswd or the expected rendered MachineConfig name/annotation produced
from the htpasswd), write any needed marker on the Secret, then change the Phase
2 check (replace the plain areAllPoolsUpdated() call) to wait for MCPs to be
rolled out to that specific rendered config (implement a helper like
areAllPoolsUpdatedForRenderedConfig(expectedRenderedConfigID) that inspects
MachineConfig/MachineConfigPool status) and only update the pull secret when
that specific rollout is observed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e21c4454-88c0-4e04-9a53-36370454a57a

📥 Commits

Reviewing files that changed from the base of the PR and between 6f21ddd and 19b6e97.

📒 Files selected for processing (20)
  • cmd/machine-config-controller/start.go
  • docs/IRIAuthCredentialRotation.md
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/bootstrap/bootstrap.go
  • pkg/controller/common/constants.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_bootstrap.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_bootstrap_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_controller_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
  • pkg/controller/internalreleaseimage/internalreleaseimage_renderer.go
  • pkg/controller/internalreleaseimage/pullsecret.go
  • pkg/controller/internalreleaseimage/pullsecret_test.go
  • pkg/controller/internalreleaseimage/templates/master/files/iri-registry-auth-htpasswd.yaml
  • pkg/controller/internalreleaseimage/templates/master/files/usr-local-bin-load-registry-image-sh.yaml
  • pkg/controller/internalreleaseimage/templates/master/units/iri-registry.service.yaml
  • pkg/daemon/constants/constants.go
  • pkg/daemon/update.go
  • pkg/daemon/update_test.go
  • test/e2e-iri/iri_test.go
✅ Files skipped from review due to trivial changes (5)
  • pkg/controller/internalreleaseimage/templates/master/files/iri-registry-auth-htpasswd.yaml
  • pkg/apihelpers/apihelpers.go
  • pkg/controller/common/constants.go
  • pkg/daemon/constants/constants.go
  • test/e2e-iri/iri_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/machine-config-controller/start.go
  • pkg/controller/internalreleaseimage/templates/master/units/iri-registry.service.yaml
  • pkg/daemon/update.go

Comment on lines +323 to +324
if secret.Name != ctrlcommon.InternalReleaseImageTLSSecretName &&
secret.Name != ctrlcommon.InternalReleaseImageAuthSecretName {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Watch auth-secret creation on upgrade.

Line 387 explicitly allows internal-release-image-registry-auth to appear later, but this controller still only enqueues on Secret updates. If the secret is created after the first sync, the initial pull-secret merge and htpasswd render are missed until some unrelated object changes. Add a create path for this secret too.

Also applies to: 387-395

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`
around lines 323 - 324, The controller currently only enqueues on Secret updates
and ignores Secret create events, so when
ctrlcommon.InternalReleaseImageAuthSecretName (and similarly
ctrlcommon.InternalReleaseImageTLSSecretName) is created after the first sync
the pull-secret merge and htpasswd render are missed; update the Secret
watch/predicate used in SetupWithManager (or the secret event handlers) to also
handle Create events for secrets whose name equals
ctrlcommon.InternalReleaseImageAuthSecretName or
ctrlcommon.InternalReleaseImageTLSSecretName so the controller enqueues a
reconcile on Secret creation as well (mirror the existing update logic used for
those names and apply the same condition used at the if secret.Name != ...
checks).

Comment on lines +397 to +405
// Handle credential rotation using the desired-vs-current pattern.
// The auth secret holds the desired password; the pull secret holds the current
// active credentials. When they differ, a multi-phase rotation is performed
// to avoid authentication failures during rolling MachineConfig updates.
// See docs/iri-auth-credential-rotation.md for the full design.
if iriAuthSecret != nil {
if err := ctrl.reconcileAuthCredentials(iriAuthSecret, cconfig); err != nil {
return fmt.Errorf("failed to reconcile auth credentials: %w", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stop rendering auth changes while pools disagree on deployed credentials.

Line 556 returns nil, but Line 409 still renders from iriAuthSecret. During phase 2b, a manual htpasswd edit therefore gets pushed into a new MachineConfig instead of being ignored until convergence, which breaks the safety guarantee described here. Short-circuit this auth path until the pools agree again, or render from the last deployed auth state.

Also applies to: 409-409, 551-556

Comment on lines +630 to +678
// Check whether the htpasswd already has valid dual entries for both the
// deployed credentials and the desired credentials. We use
// bcrypt.CompareHashAndPassword rather than string comparison because bcrypt
// salts differ on every generation. This also handles mid-rotation password
// changes: if the desired password changed since the dual htpasswd was
// written, the new entry's hash won't match and we regenerate.
//
// The deployed credentials (oldUser, oldPass) are always preserved as one of
// the two htpasswd entries. The pull secret (what clients use to authenticate)
// always contains these old credentials. So even if the desired password
// changes mid-rotation, clients always authenticate successfully against
// every htpasswd version.
hasDualEntries := HtpasswdHasValidEntry(currentHtpasswd, deployedUsername, deployedPassword) &&
HtpasswdHasValidEntry(currentHtpasswd, newUsername, desiredPassword)

// Phase 1: deploy dual htpasswd.
if !hasDualEntries {
dualHtpasswd, err := GenerateDualHtpasswd(deployedUsername, deployedPassword, newUsername, desiredPassword)
if err != nil {
return fmt.Errorf("failed to generate dual htpasswd for rotation: %w", err)
}
authSecret.Data["htpasswd"] = []byte(dualHtpasswd)
if _, err := ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.MCONamespace).Update(
context.TODO(), authSecret, metav1.UpdateOptions{}); err != nil {
return fmt.Errorf("failed to update auth secret with dual htpasswd: %w", err)
}
klog.Infof("Updated auth secret htpasswd with dual credentials (%s + %s) for rotation", deployedUsername, newUsername)
// Return so the MC render on the next sync picks up the new htpasswd
return nil
}

// Phase 2: dual htpasswd is deployed — check if ALL pools have rolled out.
// Only then is it safe to update the pull secret.
allUpdated, err := ctrl.areAllPoolsUpdated()
if err != nil {
return fmt.Errorf("failed to check MCP status: %w", err)
}

if !allUpdated {
klog.V(4).Infof("Waiting for MachineConfigPool rollout to complete before advancing credential rotation")
return nil
}

// All nodes have the dual htpasswd — safe to update the pull secret.
// After the pull secret API object is updated, the template controller will
// re-render and trigger another rollout. On subsequent syncs, the deployed
// pull secret (from the rendered MC) will eventually match the desired password,
// at which point Phase 3 handles cleanup.
klog.Infof("MCP rollout complete, updating pull secret with new credentials (username=%s)", newUsername)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Phase 2 can advance before phase 1 is actually deployed.

hasDualEntries only proves the API Secret contains both hashes. After Line 651 updates data.htpasswd, the Secret/MachineConfig requeue can run again while every pool still looks converged on the previous rendered config, so Lines 663-678 may update the global pull secret before any node accepts the new password. Phase 2 needs a rollout signal tied to the rendered config for this dual htpasswd, not just the Secret contents.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`
around lines 630 - 678, hasDualEntries only proves the Secret contains both
hashes, which can allow Phase 2 to proceed before the new rendered MachineConfig
is applied to nodes; after GenerateDualHtpasswd writes
authSecret.Data["htpasswd"] you must wait for the MachineConfig rendered for
that exact htpasswd to roll out before updating the global pull secret. Fix by
computing an identifier for the rendered config (for example a hash of
dualHtpasswd or the expected rendered MachineConfig name/annotation produced
from the htpasswd), write any needed marker on the Secret, then change the Phase
2 check (replace the plain areAllPoolsUpdated() call) to wait for MCPs to be
rolled out to that specific rendered config (implement a helper like
areAllPoolsUpdatedForRenderedConfig(expectedRenderedConfigID) that inspects
MachineConfig/MachineConfigPool status) and only update the pull secret when
that specific rollout is observed.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 24, 2026

@rwsu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift 19b6e97 link true /test e2e-hypershift

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rwsu
Copy link
Copy Markdown
Author

rwsu commented Mar 26, 2026

/hold Evaluating a simpler credential rotation procedure that accepts brief registry downtown: #5810

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants