AGENT-1449: Add IRI registry credential rotation support#5766
AGENT-1449: Add IRI registry credential rotation support#5766rwsu wants to merge 4 commits intoopenshift:mainfrom
Conversation
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds IRI registry authentication and a multi‑phase credential rotation flow: introduces an IRI auth secret and htpasswd handling, merges auth into the global pull secret, makes the InternalReleaseImage controller MachineConfigPool-aware, propagates auth into render/bootstrap paths, updates node templates/daemon behavior, and adds unit and e2e tests plus documentation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@rwsu: An error was encountered searching for bug AGENT-1449 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
No response returned: Get "https://issues.redhat.com/rest/api/2/issue/AGENT-1449": GET https://issues.redhat.com/rest/api/2/issue/AGENT-1449 giving up after 5 attempt(s)
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rwsu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc @andfasano |
7762c05 to
2e3bd60
Compare
|
/jira refresh |
|
@rwsu: This pull request references AGENT-1449 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
89067e0 to
fa5b7cc
Compare
|
@rwsu: This pull request references AGENT-1449 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fa5b7cc to
e6a9246
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go (1)
375-376: Handlejson.Marshalerrors in the test builder to avoid silent fixture corruption.Ignoring this error can mask setup failures and produce misleading downstream assertions.
Suggested refactor
- raw, _ := json.Marshal(ignCfg) + raw, err := json.Marshal(ignCfg) + if err != nil { + panic(fmt.Sprintf("failed to marshal ignition config in renderedMCBuilder: %v", err)) + } rb.obj.Spec.Config.Raw = raw🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go` around lines 375 - 376, The test builder currently ignores the error from json.Marshal(ignCfg) before setting rb.obj.Spec.Config.Raw which can hide fixture setup failures; change the code to capture the error (err := json.Marshal(ignCfg)) and handle it instead of discarding it—either return the error from the builder or call the test failure helper (e.g., t.Fatalf) so failures surface, and then only assign rb.obj.Spec.Config.Raw = raw when err == nil.docs/IRIAuthCredentialRotation.md (1)
87-116: Add language specifier to the fenced code block.The state diagram uses a fenced code block without a language identifier. Consider adding a language hint (e.g.,
textorplaintext) to satisfy linting rules and improve rendering consistency.-``` +```text ┌────────────────────────┐🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/IRIAuthCredentialRotation.md` around lines 87 - 116, The fenced ASCII state-diagram block lacks a language tag; update the opening fence for that diagram (the block starting with the box-drawing diagram "┌────────────────────────┐" / "No IRI entry in pull secret") to include a plain text language hint (for example change "```" to "```text" or "```plaintext") so linting and rendering recognize it as preformatted text.test/e2e-iri/iri_test.go (1)
205-212: Consider settingMinVersionin the IRI test client for consistency with production TLS hardening practices.
getIRIRegistryClientrelies on Go's default TLS protocol negotiation. While not critical for test code, explicitly settingMinVersion: tls.VersionTLS13would align with production Kubernetes patterns and ensure stable protocol selection across runtime versions.Suggested change
return &http.Client{ Transport: &http.Transport{ TLSClientConfig: &tls.Config{ + MinVersion: tls.VersionTLS13, RootCAs: roots, }, }, Timeout: 30 * time.Second, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-iri/iri_test.go` around lines 205 - 212, The TLS config returned by getIRIRegistryClient in iri_test.go does not set a minimum TLS version; update the TLSConfig used in the http.Client (the TLSClientConfig inside the Transport returned by getIRIRegistryClient) to include MinVersion: tls.VersionTLS13 so the test client explicitly enforces TLS 1.3 for consistency with production hardening and stable protocol selection across Go runtimes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 383-394: The Secret returned by ctrl.secretLister (iriAuthSecret)
is being mutated later (authSecret.Data["htpasswd"] in
reconcileAuthCredentials), which corrupts the informer cache; make a deep copy
of the Secret before any mutation. Either call iriAuthSecret =
iriAuthSecret.DeepCopy() immediately after retrieving it (before passing into
reconcileAuthCredentials) or perform the DeepCopy at the start of
reconcileAuthCredentials and then mutate the copy. Ensure all mutations to
authSecret.Data (the htpasswd updates at the locations referencing lines ~590
and ~635) operate on the DeepCopy, not the lister-provided object.
- Around line 303-307: The current gating only compares
UpdatedMachineCount/ReadyMachineCount; update both updateMachineConfigPool() and
areAllPoolsUpdated() to also consider configuration-name convergence: in
updateMachineConfigPool(), besides comparing
oldPool.Status.UpdatedMachineCount/ReadyMachineCount, also enqueue when
oldPool.Spec.Configuration.Name != curPool.Spec.Configuration.Name or
oldPool.Status.Configuration.Name != curPool.Status.Configuration.Name so
config-name changes trigger a sync; in areAllPoolsUpdated(), in addition to
checking UpdatedMachineCount and ReadyMachineCount equality, require that for
each pool curPool.Spec.Configuration.Name == curPool.Status.Configuration.Name
(and matches the controller’s desired config name if applicable) before treating
the pool as “updated” to prevent advancing credential rotation while nodes are
still converging on the new configuration.
---
Nitpick comments:
In `@docs/IRIAuthCredentialRotation.md`:
- Around line 87-116: The fenced ASCII state-diagram block lacks a language tag;
update the opening fence for that diagram (the block starting with the
box-drawing diagram "┌────────────────────────┐" / "No IRI entry in pull
secret") to include a plain text language hint (for example change "```" to
"```text" or "```plaintext") so linting and rendering recognize it as
preformatted text.
In `@pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go`:
- Around line 375-376: The test builder currently ignores the error from
json.Marshal(ignCfg) before setting rb.obj.Spec.Config.Raw which can hide
fixture setup failures; change the code to capture the error (err :=
json.Marshal(ignCfg)) and handle it instead of discarding it—either return the
error from the builder or call the test failure helper (e.g., t.Fatalf) so
failures surface, and then only assign rb.obj.Spec.Config.Raw = raw when err ==
nil.
In `@test/e2e-iri/iri_test.go`:
- Around line 205-212: The TLS config returned by getIRIRegistryClient in
iri_test.go does not set a minimum TLS version; update the TLSConfig used in the
http.Client (the TLSClientConfig inside the Transport returned by
getIRIRegistryClient) to include MinVersion: tls.VersionTLS13 so the test client
explicitly enforces TLS 1.3 for consistency with production hardening and stable
protocol selection across Go runtimes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: af005b0e-1fe3-4910-81e2-b8af5b4ddbf7
⛔ Files ignored due to path filters (3)
vendor/golang.org/x/crypto/bcrypt/base64.gois excluded by!vendor/**,!**/vendor/**vendor/golang.org/x/crypto/bcrypt/bcrypt.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (20)
cmd/machine-config-controller/start.godocs/IRIAuthCredentialRotation.mdpkg/apihelpers/apihelpers.gopkg/controller/bootstrap/bootstrap.gopkg/controller/common/constants.gopkg/controller/internalreleaseimage/internalreleaseimage_bootstrap.gopkg/controller/internalreleaseimage/internalreleaseimage_bootstrap_test.gopkg/controller/internalreleaseimage/internalreleaseimage_controller.gopkg/controller/internalreleaseimage/internalreleaseimage_controller_test.gopkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.gopkg/controller/internalreleaseimage/internalreleaseimage_renderer.gopkg/controller/internalreleaseimage/pullsecret.gopkg/controller/internalreleaseimage/pullsecret_test.gopkg/controller/internalreleaseimage/templates/master/files/iri-registry-auth-htpasswd.yamlpkg/controller/internalreleaseimage/templates/master/files/usr-local-bin-load-registry-image-sh.yamlpkg/controller/internalreleaseimage/templates/master/units/iri-registry.service.yamlpkg/daemon/constants/constants.gopkg/daemon/update.gopkg/daemon/update_test.gotest/e2e-iri/iri_test.go
pkg/controller/internalreleaseimage/internalreleaseimage_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/internalreleaseimage/internalreleaseimage_controller.go
Outdated
Show resolved
Hide resolved
e6a9246 to
7a98820
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/controller/internalreleaseimage/internalreleaseimage_controller.go (1)
685-688: Consider addingretry.RetryOnConflictfor pull secret update.Other secret updates in this file (e.g., status updates at line 480) use retry logic. While the controller's
handleErrprovides eventual retry, an inline retry would converge faster on conflicts.♻️ Optional: Add retry wrapper
- pullSecret.Data[corev1.DockerConfigJsonKey] = mergedBytes - if _, err := ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update( - context.TODO(), pullSecret, metav1.UpdateOptions{}); err != nil { - return fmt.Errorf("failed to update pull secret with rotated credentials: %w", err) - } + if err := retry.RetryOnConflict(updateBackoff, func() error { + pullSecret, err := ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Get( + context.TODO(), ctrlcommon.GlobalPullSecretName, metav1.GetOptions{}) + if err != nil { + return err + } + mergedBytes, err := MergeIRIAuthIntoPullSecretWithUsername( + pullSecret.Data[corev1.DockerConfigJsonKey], newUsername, desiredPassword, baseDomain) + if err != nil { + return err + } + pullSecret.Data[corev1.DockerConfigJsonKey] = mergedBytes + _, err = ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update( + context.TODO(), pullSecret, metav1.UpdateOptions{}) + return err + }); err != nil { + return fmt.Errorf("failed to update pull secret with rotated credentials: %w", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go` around lines 685 - 688, Replace the direct Secrets().Update call with a retry.RetryOnConflict loop so we fetch the latest secret, apply the rotated credentials to the fetched object, and attempt Update until success or non-conflict error; specifically wrap the operation that currently calls ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(), pullSecret, metav1.UpdateOptions{}) in retry.RetryOnConflict(retry.DefaultRetry, func() error { s, err := ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Get(context.TODO(), pullSecret.Name, metav1.GetOptions{}); if err != nil { return err }; modify s with the rotated credentials from pullSecret; _, err = ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(), s, metav1.UpdateOptions{}); return err }); ensure errors are returned unchanged from the wrapper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/internalreleaseimage/pullsecret.go`:
- Around line 173-184: GenerateDualHtpasswd must reject identical usernames: add
an explicit check in GenerateDualHtpasswd comparing currentUsername and
newUsername and return a descriptive error (e.g., via fmt.Errorf) if they are
equal, before calling GenerateHtpasswdEntry; keep existing error wrapping for
GenerateHtpasswdEntry calls and return the concatenated entries only when
usernames differ.
In `@test/e2e-iri/iri_test.go`:
- Around line 296-405: The test mutates the cluster auth secret but never
restores it; add a t.Cleanup closure immediately after reading originalPassword
that updates authSecret.Data["password"] back to originalPassword via
cs.Secrets(...).Update, then poll/wait until the pull secret returns the
original credentials (use iri.ExtractIRICredentialsFromPullSecret) and wait for
MCP rollout completion with helpers.WaitForPoolCompleteAny for "master" and
"worker" so the cluster is returned to its original state even on test failure.
---
Nitpick comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 685-688: Replace the direct Secrets().Update call with a
retry.RetryOnConflict loop so we fetch the latest secret, apply the rotated
credentials to the fetched object, and attempt Update until success or
non-conflict error; specifically wrap the operation that currently calls
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(),
pullSecret, metav1.UpdateOptions{}) in retry.RetryOnConflict(retry.DefaultRetry,
func() error { s, err :=
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Get(context.TODO(),
pullSecret.Name, metav1.GetOptions{}); if err != nil { return err }; modify s
with the rotated credentials from pullSecret; _, err =
ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.OpenshiftConfigNamespace).Update(context.TODO(),
s, metav1.UpdateOptions{}); return err }); ensure errors are returned unchanged
from the wrapper.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1eb2ba4c-3936-4b9c-a498-4d8c164858f1
⛔ Files ignored due to path filters (3)
vendor/golang.org/x/crypto/bcrypt/base64.gois excluded by!vendor/**,!**/vendor/**vendor/golang.org/x/crypto/bcrypt/bcrypt.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (12)
cmd/machine-config-controller/start.godocs/IRIAuthCredentialRotation.mdpkg/apihelpers/apihelpers.gopkg/controller/internalreleaseimage/internalreleaseimage_controller.gopkg/controller/internalreleaseimage/internalreleaseimage_controller_test.gopkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.gopkg/controller/internalreleaseimage/pullsecret.gopkg/controller/internalreleaseimage/pullsecret_test.gopkg/daemon/constants/constants.gopkg/daemon/update.gopkg/daemon/update_test.gotest/e2e-iri/iri_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/daemon/update_test.go
- pkg/controller/internalreleaseimage/pullsecret_test.go
- pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
|
@rwsu: This pull request references AGENT-1449 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@rwsu: This pull request references AGENT-1449 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
7a98820 to
6df639c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 387-392: The secret retrieval currently ignores lister errors by
discarding the error from ctrl.secretLister.Secrets(...).Get; change the
iriAuthSecret, _ := ...Get(...) call to capture the error, check it with
apierrors.IsNotFound(err) and only treat NotFound as acceptable, but for any
other error return/requeue the error from the reconcile so the controller
doesn't proceed silently; keep the existing iriAuthSecret.DeepCopy() logic when
iriAuthSecret != nil. Ensure you reference
ctrl.secretLister.Secrets(ctrlcommon.MCONamespace).Get, the iriAuthSecret
variable, and ctrlcommon.InternalReleaseImageAuthSecretName when locating the
change.
In `@test/e2e-iri/iri_test.go`:
- Around line 343-350: The cleanup restore is merging the IRI auth using the
stale originalPullSecretRaw and originalPassword which can overwrite concurrent
updates and restore the wrong password; instead after fetching the current
secret into ps (from cs.Secrets(...).Get(...)) use the current secret data
(ps.Data[corev1.DockerConfigJsonKey]) as the base input to
iri.MergeIRIAuthIntoPullSecretWithUsername and pass the originally-observed
password variable originalPullSecretPassword (not originalPassword) so you merge
on the latest secret state and restore the exact original observed password;
update the call site where MergeIRIAuthIntoPullSecretWithUsername is invoked to
use those variables (and keep existing error handling).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5882d527-dc41-499a-9719-d10640ec938d
⛔ Files ignored due to path filters (3)
vendor/golang.org/x/crypto/bcrypt/base64.gois excluded by!vendor/**,!**/vendor/**vendor/golang.org/x/crypto/bcrypt/bcrypt.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (12)
cmd/machine-config-controller/start.godocs/IRIAuthCredentialRotation.mdpkg/apihelpers/apihelpers.gopkg/controller/internalreleaseimage/internalreleaseimage_controller.gopkg/controller/internalreleaseimage/internalreleaseimage_controller_test.gopkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.gopkg/controller/internalreleaseimage/pullsecret.gopkg/controller/internalreleaseimage/pullsecret_test.gopkg/daemon/constants/constants.gopkg/daemon/update.gopkg/daemon/update_test.gotest/e2e-iri/iri_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/controller/internalreleaseimage/pullsecret.go
- pkg/daemon/update_test.go
6df639c to
6f21ddd
Compare
|
/verified by TestIRIAuth_CredentialRotation e2e test and @rwsu |
|
@rwsu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Add htpasswd-based authentication to the IRI registry. The installer generates credentials and provides them via a bootstrap secret. The MCO mounts the htpasswd file into the registry container and configures registry auth environment variables. The registry password is merged into the node pull secret so kubelet can authenticate when pulling the release image. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
The IRI controller merges registry auth credentials into the global pull secret after bootstrap. This triggers the template controller to re-render template MCs (00-master, etc.) with the updated pull secret, producing a different rendered MC hash than what bootstrap created. The mismatch causes the MCD DaemonSet pod to fail during bootstrap: it reads the bootstrap-rendered MC name from the node annotation, but that MC no longer exists in-cluster (replaced by the re-rendered one). The MCD falls back to reading /etc/machine-config-daemon/currentconfig, which was never written because the firstboot MCD detected "no changes" and skipped it. Both master nodes go Degraded and never recover. Fix by merging IRI auth into the pull secret during bootstrap before template MC rendering, so both bootstrap and in-cluster produce identical rendered MC hashes. Extract the pull secret merge logic into a shared MergeIRIAuthIntoPullSecret function used by both the bootstrap path and the in-cluster IRI controller. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Implement safe credential rotation for the IRI registry using a desired-vs-current pattern with generation-numbered usernames. The auth secret holds the desired password; the pull secret (read from rendered MachineConfig) holds the deployed password. When they differ, a three-phase rotation is performed: 1. Deploy dual htpasswd (old + new credentials with different usernames) 2. Update pull secret after all MCPs finish rolling out 3. Clean up dual htpasswd to single entry after new pull secret is deployed Key changes: - Add MachineConfigPool event handler to IRI controller so that rotation phases advance automatically when pool rollouts complete - Add reconcileAuthCredentials as the main dispatcher, with reconcileCredentialRotation (Phase 1/2) and reconcileCredentialsInSync (Phase 3) as extracted helpers - Add getDeployedIRICredentials (reads from rendered MC, not API) - Add areAllPoolsUpdated (checks all pools including workers) - Add HtpasswdHasValidEntry, GenerateHtpasswdEntry, GenerateDualHtpasswd, NextIRIUsername, ExtractIRICredentialsFromPullSecret helpers - Vendor golang.org/x/crypto/bcrypt for htpasswd hash generation - Add credential rotation design doc - Add IRI htpasswd file to no-reboot list (NoneStatusAction) since the Distribution registry re-reads htpasswd on modification time change Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
…ial rotation Add e2e tests that verify: - Unauthenticated requests to the IRI registry return 401 - Authenticated requests with valid credentials return 200 - End-to-end credential rotation through all three phases Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
6f21ddd to
19b6e97
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`:
- Around line 323-324: The controller currently only enqueues on Secret updates
and ignores Secret create events, so when
ctrlcommon.InternalReleaseImageAuthSecretName (and similarly
ctrlcommon.InternalReleaseImageTLSSecretName) is created after the first sync
the pull-secret merge and htpasswd render are missed; update the Secret
watch/predicate used in SetupWithManager (or the secret event handlers) to also
handle Create events for secrets whose name equals
ctrlcommon.InternalReleaseImageAuthSecretName or
ctrlcommon.InternalReleaseImageTLSSecretName so the controller enqueues a
reconcile on Secret creation as well (mirror the existing update logic used for
those names and apply the same condition used at the if secret.Name != ...
checks).
- Around line 630-678: hasDualEntries only proves the Secret contains both
hashes, which can allow Phase 2 to proceed before the new rendered MachineConfig
is applied to nodes; after GenerateDualHtpasswd writes
authSecret.Data["htpasswd"] you must wait for the MachineConfig rendered for
that exact htpasswd to roll out before updating the global pull secret. Fix by
computing an identifier for the rendered config (for example a hash of
dualHtpasswd or the expected rendered MachineConfig name/annotation produced
from the htpasswd), write any needed marker on the Secret, then change the Phase
2 check (replace the plain areAllPoolsUpdated() call) to wait for MCPs to be
rolled out to that specific rendered config (implement a helper like
areAllPoolsUpdatedForRenderedConfig(expectedRenderedConfigID) that inspects
MachineConfig/MachineConfigPool status) and only update the pull secret when
that specific rollout is observed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e21c4454-88c0-4e04-9a53-36370454a57a
📒 Files selected for processing (20)
cmd/machine-config-controller/start.godocs/IRIAuthCredentialRotation.mdpkg/apihelpers/apihelpers.gopkg/controller/bootstrap/bootstrap.gopkg/controller/common/constants.gopkg/controller/internalreleaseimage/internalreleaseimage_bootstrap.gopkg/controller/internalreleaseimage/internalreleaseimage_bootstrap_test.gopkg/controller/internalreleaseimage/internalreleaseimage_controller.gopkg/controller/internalreleaseimage/internalreleaseimage_controller_test.gopkg/controller/internalreleaseimage/internalreleaseimage_helpers_test.gopkg/controller/internalreleaseimage/internalreleaseimage_renderer.gopkg/controller/internalreleaseimage/pullsecret.gopkg/controller/internalreleaseimage/pullsecret_test.gopkg/controller/internalreleaseimage/templates/master/files/iri-registry-auth-htpasswd.yamlpkg/controller/internalreleaseimage/templates/master/files/usr-local-bin-load-registry-image-sh.yamlpkg/controller/internalreleaseimage/templates/master/units/iri-registry.service.yamlpkg/daemon/constants/constants.gopkg/daemon/update.gopkg/daemon/update_test.gotest/e2e-iri/iri_test.go
✅ Files skipped from review due to trivial changes (5)
- pkg/controller/internalreleaseimage/templates/master/files/iri-registry-auth-htpasswd.yaml
- pkg/apihelpers/apihelpers.go
- pkg/controller/common/constants.go
- pkg/daemon/constants/constants.go
- test/e2e-iri/iri_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/machine-config-controller/start.go
- pkg/controller/internalreleaseimage/templates/master/units/iri-registry.service.yaml
- pkg/daemon/update.go
| if secret.Name != ctrlcommon.InternalReleaseImageTLSSecretName && | ||
| secret.Name != ctrlcommon.InternalReleaseImageAuthSecretName { |
There was a problem hiding this comment.
Watch auth-secret creation on upgrade.
Line 387 explicitly allows internal-release-image-registry-auth to appear later, but this controller still only enqueues on Secret updates. If the secret is created after the first sync, the initial pull-secret merge and htpasswd render are missed until some unrelated object changes. Add a create path for this secret too.
Also applies to: 387-395
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`
around lines 323 - 324, The controller currently only enqueues on Secret updates
and ignores Secret create events, so when
ctrlcommon.InternalReleaseImageAuthSecretName (and similarly
ctrlcommon.InternalReleaseImageTLSSecretName) is created after the first sync
the pull-secret merge and htpasswd render are missed; update the Secret
watch/predicate used in SetupWithManager (or the secret event handlers) to also
handle Create events for secrets whose name equals
ctrlcommon.InternalReleaseImageAuthSecretName or
ctrlcommon.InternalReleaseImageTLSSecretName so the controller enqueues a
reconcile on Secret creation as well (mirror the existing update logic used for
those names and apply the same condition used at the if secret.Name != ...
checks).
| // Handle credential rotation using the desired-vs-current pattern. | ||
| // The auth secret holds the desired password; the pull secret holds the current | ||
| // active credentials. When they differ, a multi-phase rotation is performed | ||
| // to avoid authentication failures during rolling MachineConfig updates. | ||
| // See docs/iri-auth-credential-rotation.md for the full design. | ||
| if iriAuthSecret != nil { | ||
| if err := ctrl.reconcileAuthCredentials(iriAuthSecret, cconfig); err != nil { | ||
| return fmt.Errorf("failed to reconcile auth credentials: %w", err) | ||
| } |
There was a problem hiding this comment.
Stop rendering auth changes while pools disagree on deployed credentials.
Line 556 returns nil, but Line 409 still renders from iriAuthSecret. During phase 2b, a manual htpasswd edit therefore gets pushed into a new MachineConfig instead of being ignored until convergence, which breaks the safety guarantee described here. Short-circuit this auth path until the pools agree again, or render from the last deployed auth state.
Also applies to: 409-409, 551-556
| // Check whether the htpasswd already has valid dual entries for both the | ||
| // deployed credentials and the desired credentials. We use | ||
| // bcrypt.CompareHashAndPassword rather than string comparison because bcrypt | ||
| // salts differ on every generation. This also handles mid-rotation password | ||
| // changes: if the desired password changed since the dual htpasswd was | ||
| // written, the new entry's hash won't match and we regenerate. | ||
| // | ||
| // The deployed credentials (oldUser, oldPass) are always preserved as one of | ||
| // the two htpasswd entries. The pull secret (what clients use to authenticate) | ||
| // always contains these old credentials. So even if the desired password | ||
| // changes mid-rotation, clients always authenticate successfully against | ||
| // every htpasswd version. | ||
| hasDualEntries := HtpasswdHasValidEntry(currentHtpasswd, deployedUsername, deployedPassword) && | ||
| HtpasswdHasValidEntry(currentHtpasswd, newUsername, desiredPassword) | ||
|
|
||
| // Phase 1: deploy dual htpasswd. | ||
| if !hasDualEntries { | ||
| dualHtpasswd, err := GenerateDualHtpasswd(deployedUsername, deployedPassword, newUsername, desiredPassword) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to generate dual htpasswd for rotation: %w", err) | ||
| } | ||
| authSecret.Data["htpasswd"] = []byte(dualHtpasswd) | ||
| if _, err := ctrl.kubeClient.CoreV1().Secrets(ctrlcommon.MCONamespace).Update( | ||
| context.TODO(), authSecret, metav1.UpdateOptions{}); err != nil { | ||
| return fmt.Errorf("failed to update auth secret with dual htpasswd: %w", err) | ||
| } | ||
| klog.Infof("Updated auth secret htpasswd with dual credentials (%s + %s) for rotation", deployedUsername, newUsername) | ||
| // Return so the MC render on the next sync picks up the new htpasswd | ||
| return nil | ||
| } | ||
|
|
||
| // Phase 2: dual htpasswd is deployed — check if ALL pools have rolled out. | ||
| // Only then is it safe to update the pull secret. | ||
| allUpdated, err := ctrl.areAllPoolsUpdated() | ||
| if err != nil { | ||
| return fmt.Errorf("failed to check MCP status: %w", err) | ||
| } | ||
|
|
||
| if !allUpdated { | ||
| klog.V(4).Infof("Waiting for MachineConfigPool rollout to complete before advancing credential rotation") | ||
| return nil | ||
| } | ||
|
|
||
| // All nodes have the dual htpasswd — safe to update the pull secret. | ||
| // After the pull secret API object is updated, the template controller will | ||
| // re-render and trigger another rollout. On subsequent syncs, the deployed | ||
| // pull secret (from the rendered MC) will eventually match the desired password, | ||
| // at which point Phase 3 handles cleanup. | ||
| klog.Infof("MCP rollout complete, updating pull secret with new credentials (username=%s)", newUsername) |
There was a problem hiding this comment.
Phase 2 can advance before phase 1 is actually deployed.
hasDualEntries only proves the API Secret contains both hashes. After Line 651 updates data.htpasswd, the Secret/MachineConfig requeue can run again while every pool still looks converged on the previous rendered config, so Lines 663-678 may update the global pull secret before any node accepts the new password. Phase 2 needs a rollout signal tied to the rendered config for this dual htpasswd, not just the Secret contents.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/internalreleaseimage/internalreleaseimage_controller.go`
around lines 630 - 678, hasDualEntries only proves the Secret contains both
hashes, which can allow Phase 2 to proceed before the new rendered MachineConfig
is applied to nodes; after GenerateDualHtpasswd writes
authSecret.Data["htpasswd"] you must wait for the MachineConfig rendered for
that exact htpasswd to roll out before updating the global pull secret. Fix by
computing an identifier for the rendered config (for example a hash of
dualHtpasswd or the expected rendered MachineConfig name/annotation produced
from the htpasswd), write any needed marker on the Secret, then change the Phase
2 check (replace the plain areAllPoolsUpdated() call) to wait for MCPs to be
rolled out to that specific rendered config (implement a helper like
areAllPoolsUpdatedForRenderedConfig(expectedRenderedConfigID) that inspects
MachineConfig/MachineConfigPool status) and only update the pull secret when
that specific rollout is observed.
|
@rwsu: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/hold Evaluating a simpler credential rotation procedure that accepts brief registry downtown: #5810 |
- What I did
Implement safe credential rotation for the IRI registry using a desired-vs-current pattern with generation-numbered usernames. The auth secret holds the desired password; the pull secret (read from rendered MachineConfig) holds the deployed password. When they differ, a three-phase rotation is performed:
This avoids authentication deadlocks during rolling MachineConfig updates because the pull secret always contains the old credentials, which are present in every version of the htpasswd. Mid-rotation password changes are handled by verifying htpasswd hashes with bcrypt.CompareHashAndPassword and regenerating if they don't match.
Key changes:
- How to verify it
Update the password to trigger the rotation to start:
Verify the /etc/iri-registry/auth/htpasswd has been updated.
Verify iri-registry works for both new and old credentials during rollout.
Verify global pull-secret contains the new credentials after rollout is complete.
- Description for the changelog
Add credential rotation support for the IRI registry. When the auth secret's password field is updated, the controller performs a three-phase rotation: (1) deploys a dual htpasswd with both old and new credentials so all nodes accept both passwords during rollout, (2) updates the global pull secret with the new credentials after all MachineConfigPools are fully updated, and (3) cleans up the dual htpasswd to a single entry once the new credentials are deployed everywhere. This avoids authentication deadlocks caused by api-int load-balancing requests across master nodes that may be at different stages of the rollout.
Also adds e2e tests for registry authentication (401 on unauthenticated requests, 200 with valid credentials) and an end-to-end credential rotation test that exercises all three phases.
Depends on #5765.