Skip to content

WIP: OCPBUGS-62517 ha replicas#709

Open
tmshort wants to merge 3 commits intoopenshift:mainfrom
tmshort:ocpbugs-62517-ha-replicas
Open

WIP: OCPBUGS-62517 ha replicas#709
tmshort wants to merge 3 commits intoopenshift:mainfrom
tmshort:ocpbugs-62517-ha-replicas

Conversation

@tmshort
Copy link
Copy Markdown
Contributor

@tmshort tmshort commented Apr 24, 2026

Summary by CodeRabbit

  • New Features

    • Added Pod Disruption Budgets for catalogd and operator-controller to keep at least one pod available.
  • Chores

    • Increased replica counts from 1 to 2 for catalogd-controller-manager and operator-controller-controller-manager.
    • Updated Helm chart values to reflect new replica settings.
    • Extended E2E timeouts (e2e/test-experimental-e2e).
  • Tests

    • Added e2e HA test and test steps to validate leader failover for catalogd; feature gate auto-enabled when >1 node.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Walkthrough

Replaces controller-runtime's leader-gated catalog HTTP server with a custom runnable that serves on every replica and adds a readiness probe; adds graceful shutdown handling. Separately, scales catalogd and operator-controller Deployments to 2 replicas, adds PodDisruptionBudgets, and extends e2e HA tests and related test hooks.

Changes

Cohort / File(s) Summary
Catalog Server Runnable
internal/catalogd/serverutil/serverutil.go
Replaces manager.Server usage with unexported catalogServerRunnable (no leader election), registers readiness check tied to a ready channel, closes ready before Serve, spawns shutdown goroutine to call server.Shutdown with 30s timeout, ignores http.ErrServerClosed. Parameter renamed tlsFileWatchercw.
Deployment Manifests (catalogd)
openshift/catalogd/manifests.yaml, openshift/catalogd/manifests-experimental.yaml, manifests/experimental.yaml, manifests/experimental-e2e.yaml
Increases catalogd-controller-manager replicas from 1 → 2 across manifests and adds a PodDisruptionBudget requiring minAvailable: 1.
Deployment Manifests (operator-controller)
openshift/operator-controller/manifests.yaml, openshift/operator-controller/manifests-experimental.yaml
Increases operator-controller-controller-manager replicas from 1 → 2 and adds a PodDisruptionBudget with minAvailable: 1.
Helm values updated
openshift/helm/catalogd.yaml, openshift/helm/operator-controller.yaml, helm/experimental.yaml
Removes explicit PDB-disable flags and sets replicas: 2 for catalogd and operator-controller in Helm values/experimental chart.
Cache client behavior
internal/operator-controller/catalogmetadata/client/client.go, internal/operator-controller/catalogmetadata/client/client_test.go
PopulateCache no longer caches non-200 HTTP responses (returns error without updating cache). Tests adjusted to stop asserting cached Put for unexpected status path.
E2E tests — HA feature
test/e2e/features/ha.feature, test/e2e/steps/ha_steps.go, test/e2e/steps/hooks.go, test/e2e/steps/steps.go
Adds HA e2e feature and step helpers to force-delete catalogd leader and wait for new leader election; adds catalogdHAFeature feature gate and enables it automatically when cluster has ≥2 nodes; registers new Gherkin steps.
Makefile timeouts
Makefile
Increases E2E timeouts: default E2E_TIMEOUT 15m → 20m, test-experimental-e2e override 20m → 25m.

Sequence Diagram(s)

sequenceDiagram
    participant Mgr as Manager
    participant Runnable as catalogServerRunnable
    participant CW as CertWatcher
    participant HTTP as http.Server
    participant Ready as ReadinessProbe
    participant Shutdown as Shutdown Goroutine
    participant Ctx as Manager Context

    Mgr->>Runnable: AddRunnable(mgr)
    Mgr->>Ready: AddReadyzCheck("catalog-server")
    Mgr->>Runnable: Start(ctx)
    activate Runnable

    Runnable->>Ready: close(readyChan)
    Note over Ready: readiness check now passes

    Runnable->>CW: load TLS certs
    CW-->>HTTP: provide TLS config
    Runnable->>HTTP: Serve(listenAddr)
    activate HTTP

    Ctx->>Shutdown: cancelled (when manager stops)
    Shutdown->>HTTP: Shutdown(ctx, 30s)
    HTTP-->>Shutdown: returns (ok or ErrServerClosed ignored)
    Shutdown-->>Runnable: goroutine exits

    deactivate HTTP
    deactivate Runnable
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 4

❌ Failed checks (3 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning HA test code has quality issues: missing assertion messages in NewCatalogdLeaderIsElected, stale leader pod name in CatalogdLeaderPodIsForceDeleted, and lack of pre-deletion Serving=True assertion. Add assertion messages to require.Eventually calls, refresh leader identity by querying lease directly, and add pre-deletion Serving assertion in feature file.
Topology-Aware Scheduling Compatibility ⚠️ Warning PR increases replica counts to 2 but HA feature gate detection in hooks.go queries total nodes instead of control-plane nodes, enabling HA tests on mixed-topology clusters where they shouldn't run. PVC uses ReadWriteOnce access mode with 2-replica deployments, causing pod Pending state. Fix HA detection to filter for control-plane nodes using node-role.kubernetes.io/control-plane label. Change e2e-coverage PVC to ReadWriteMany or use separate volumes per pod.
Title check ❓ Inconclusive The PR title 'WIP: OCPBUGS-62517 ha replicas' is vague and lacks specificity about the actual changes made. Consider a more descriptive title that captures the main objective, such as 'Enable high availability with 2 replicas and implement leader election for catalogd' or similar.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test names in feature files and step definitions use only static descriptive text without dynamic variables, timestamps, or runtime-generated data.
Microshift Test Compatibility ✅ Passed The new HA e2e test is properly protected from running on MicroShift through a feature gate mechanism with @CatalogdHA tag.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new HA test for catalogd is properly protected from running on Single Node OpenShift deployments with @CatalogdHA tag and feature gate check.
Ote Binary Stdout Contract ✅ Passed E2E test changes properly utilize textlogger from k8s.io/klog/v2 for all logging, configured to write to stderr by default with no stdout corruption.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed New HA e2e tests operate entirely on cluster-internal Kubernetes resources with no IPv4 assumptions, hardcoded IPs, IPv4-specific parsing, or external connectivity requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from pedjak and perdasilva April 24, 2026 21:07
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/catalogd/serverutil/serverutil.go`:
- Around line 67-68: The comment for catalogServerRunnable is stale: it states
"leader-only Runnable" while NeedLeaderElection() returns false (so it runs on
all pods); update the comment above catalogServerRunnable to remove or correct
the "leader-only" wording and accurately describe that the runnable runs on all
pods and lazily binds the catalog HTTP port in Start(), ensuring the comment
matches the behavior of NeedLeaderElection() and Start().
- Around line 78-87: PopulateCache in
internal/operator-controller/catalogmetadata/client/client.go is caching non-200
responses (notably 404s from non-leader pods) which causes permanent failures
during rollouts; update PopulateCache to not cache error responses and to
implement retries for transient/non-leader 404s (use a small retry loop with
exponential backoff and/or attempt alternate endpoints), only writing to the
local cache when a 2xx response is received; keep existing behavior that
Start()/NeedLeaderElection() returns false but ensure PopulateCache retries
across requests before giving up and then surface a final error without
persisting the 404 into the cache.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: def80a5a-2570-4718-a4c6-821b6a09c784

📥 Commits

Reviewing files that changed from the base of the PR and between 9b9eb03 and cf8b48a.

📒 Files selected for processing (7)
  • internal/catalogd/serverutil/serverutil.go
  • openshift/catalogd/manifests-experimental.yaml
  • openshift/catalogd/manifests.yaml
  • openshift/helm/catalogd.yaml
  • openshift/helm/operator-controller.yaml
  • openshift/operator-controller/manifests-experimental.yaml
  • openshift/operator-controller/manifests.yaml

Comment thread internal/catalogd/serverutil/serverutil.go Outdated
Comment thread internal/catalogd/serverutil/serverutil.go
@tmshort
Copy link
Copy Markdown
Contributor Author

tmshort commented Apr 27, 2026

/test openshift-e2e-aws

@tmshort
Copy link
Copy Markdown
Contributor Author

tmshort commented Apr 27, 2026

/test e2e-aws-upgrade-ovn-single-node

tmshort and others added 3 commits April 27, 2026 14:43
…tionBudget

With replicas=1, rolling updates cause Available=False on the olm
ClusterOperator because there is a window where no pod is ready.
Setting replicas=2 ensures at least one pod remains available throughout
the rolling update (maxUnavailable=0, maxSurge=1), keeping the
ClusterOperator Available=True.

Re-enable the PodDisruptionBudget (minAvailable=1) which was previously
disabled because it would deadlock node drains with a single replica.
With 2 replicas, the PDB correctly protects against voluntary disruptions
while still allowing node drains.

This replaces the previous <carry> that set podDisruptionBudget.enabled=false.

Fixes: https://issues.redhat.com/browse/OCPBUGS-62517
…readiness check

The catalog HTTP server has OnlyServeWhenLeader: true, so only the leader
pod should serve catalog content. Previously, net.Listen was called eagerly
at startup for all pods: the listen socket was bound on non-leaders even
though http.Serve was never called, causing TCP connections to queue without
being served. With replicas > 1 this made ~50% of catalog content requests
fail silently.

Replace manager.Server with a custom Runnable (catalogServerRunnable) in
serverutil that:
- Binds the catalog port lazily inside Start(), which is only called on the
  leader by controller-runtime's leader election machinery.
- Closes a ready channel once the listener is established, and registers a
  channel-select readiness check via AddReadyzCheck so non-leader pods fail
  the /readyz probe and are excluded from Service endpoints.

This keeps cmd/catalogd/main.go health/readiness setup identical to
cmd/operator-controller/main.go (healthz.Ping for both liveness and
readiness); the catalog-server readiness check is an implementation detail
of serverutil.AddCatalogServerToManager.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…roller with 2 replicas

The experimental e2e suite uses a 2-node kind cluster, making it a natural
fit to validate HA behaviour.  Set replicas=2 for both components in
helm/experimental.yaml so the experimental and experimental-e2e manifests
exercise the multi-replica path end-to-end.

This is safe for operator-controller (no leader-only HTTP servers) and for
catalogd now that the catalog server starts on all pods via
NeedLeaderElection=false, preventing the rolling-update deadlock that would
arise if the server were leader-only.

Also adds a @CatalogdHA experimental e2e scenario that force-deletes the
catalogd leader pod and verifies that a new leader is elected and the catalog
resumes serving.  The scenario is gated on a 2-node cluster (detected in
BeforeSuite and reflected in the featureGates map), so it is automatically
skipped in the standard 1-node e2e suite.  The experimental e2e timeout is
bumped from 20m to 25m to accommodate leader re-election time (~163s worst
case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
@tmshort tmshort force-pushed the ocpbugs-62517-ha-replicas branch from cf8b48a to dfec6bd Compare April 27, 2026 18:47
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
manifests/experimental.yaml (1)

2544-2544: Replica count alone does not guarantee node-level HA.

With only replicas: 2, both pods can still land on the same node. Consider adding pod anti-affinity or topology spread constraints so this HA change is resilient to single-node loss.

Also applies to: 2682-2682

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/experimental.yaml` at line 2544, Replica count set to 2 is
insufficient for node-level HA because both pods can schedule to the same node;
update the workload spec that contains `replicas: 2` to include scheduling
constraints such as a podAntiAffinity (e.g.,
preferredDuringSchedulingIgnoredDuringExecution with topologyKey
`kubernetes.io/hostname`) or topologySpreadConstraints so the two replicas are
spread across distinct nodes (apply the same change to the other occurrence
referencing `replicas: 2` at the noted location).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/operator-controller/catalogmetadata/client/client.go`:
- Around line 108-113: The code currently returns early on any non-200
resp.StatusCode which prevents cache.Put from recording the actual error;
instead, modify the branch so that only 404 bypasses the cache (treat as HA
leader-miss) while other non-200 statuses create an error value and call
cache.Put(key, errOrErrorResponse) before returning the error; specifically, in
the function handling the HTTP response (the block that checks resp.StatusCode
and currently returns fmt.Errorf...), change the logic to if resp.StatusCode ==
http.StatusNotFound then return without caching, else construct the appropriate
error (including status code and body if available), call cache.Put with the
same cache key used elsewhere, and then return that error.

In `@manifests/experimental-e2e.yaml`:
- Line 2624: The shared e2e-coverage PVC is defined as ReadWriteOnce but both
Deployments (catalogd and operator-controller) mount it and replicas were
increased to 2, causing pods on different nodes to hang on attach; update the
storage model to support multiple nodes by either changing the PVC named
e2e-coverage to accessMode: ReadWriteMany (if your storage class supports RWX)
or convert these Deployments to StatefulSets and use volumeClaimTemplates to
give each replica its own PVC; locate the e2e-coverage PVC definition and the
catalogd/operator-controller workload manifests and apply the RWX change or the
StatefulSet + volumeClaimTemplates approach accordingly.

In `@test/e2e/features/ha.feature`:
- Around line 15-19: Add a pre-failover assertion that the catalog is already
serving: before the step "When the catalogd leader pod is force-deleted" assert
that catalog "test" reports Serving as True with Reason Available (i.e., ensure
the step/check that verifies Serving=True/Reason=Available runs prior to
deleting the leader pod). Locate the scenario steps around "And catalogd is
ready to reconcile resources", "And catalog \"test\" is reconciled" and the
existing "And catalog \"test\" reports Serving as True with Reason Available"
and move or duplicate that Serving check so it executes immediately before the
force-delete action to guarantee the catalog was serving before failover.

In `@test/e2e/steps/ha_steps.go`:
- Around line 27-31: The code uses the cached value sc.leaderPods["catalogd"]
which can be stale; before deleting the pod, refresh the current lease holder
from the cluster instead of relying on sc.leaderPods. Replace the direct use of
sc.leaderPods["catalogd"] in the step (after obtaining sc via scenarioCtx(ctx))
with a fresh lookup of the catalogd leader (e.g., call the existing helper that
queries the lease/leader from the API or add a small helper like
refresh/getLeaseHolder for "catalogd"), then use that up-to-date leaderPod for
the deletion and for assertions in NewCatalogdLeaderIsElected. Ensure the
refreshed value overwrites or is compared against sc.leaderPods["catalogd"] so
you don't accidentally delete a follower.

In `@test/e2e/steps/hooks.go`:
- Around line 136-142: The HA gate is currently based on total node count;
change it to count only control-plane nodes that are schedulable and enable
featureGates[catalogdHAFeature] only when there are at least two such nodes. Use
the existing k8sClient call to list control-plane nodes (select by label
node-role.kubernetes.io/control-plane or equivalent), then for each returned
node check its spec.taints (or call k8sClient to get the node object) and
exclude nodes with a NoSchedule control-plane taint; finally set
featureGates[catalogdHAFeature] = true when the filtered count >= 2. Ensure you
reference the same featureGates map and catalogdHAFeature symbol and keep the
logic inside the same hook.

---

Nitpick comments:
In `@manifests/experimental.yaml`:
- Line 2544: Replica count set to 2 is insufficient for node-level HA because
both pods can schedule to the same node; update the workload spec that contains
`replicas: 2` to include scheduling constraints such as a podAntiAffinity (e.g.,
preferredDuringSchedulingIgnoredDuringExecution with topologyKey
`kubernetes.io/hostname`) or topologySpreadConstraints so the two replicas are
spread across distinct nodes (apply the same change to the other occurrence
referencing `replicas: 2` at the noted location).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c19769a4-b068-4be4-bc2a-b40decbe7fad

📥 Commits

Reviewing files that changed from the base of the PR and between cf8b48a and dfec6bd.

📒 Files selected for processing (17)
  • Makefile
  • helm/experimental.yaml
  • internal/catalogd/serverutil/serverutil.go
  • internal/operator-controller/catalogmetadata/client/client.go
  • internal/operator-controller/catalogmetadata/client/client_test.go
  • manifests/experimental-e2e.yaml
  • manifests/experimental.yaml
  • openshift/catalogd/manifests-experimental.yaml
  • openshift/catalogd/manifests.yaml
  • openshift/helm/catalogd.yaml
  • openshift/helm/operator-controller.yaml
  • openshift/operator-controller/manifests-experimental.yaml
  • openshift/operator-controller/manifests.yaml
  • test/e2e/features/ha.feature
  • test/e2e/steps/ha_steps.go
  • test/e2e/steps/hooks.go
  • test/e2e/steps/steps.go
💤 Files with no reviewable changes (1)
  • internal/operator-controller/catalogmetadata/client/client_test.go
✅ Files skipped from review due to trivial changes (2)
  • openshift/helm/operator-controller.yaml
  • helm/experimental.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • openshift/helm/catalogd.yaml
  • openshift/operator-controller/manifests-experimental.yaml
  • openshift/catalogd/manifests.yaml

Comment on lines 108 to 113
if resp.StatusCode != http.StatusOK {
errToCache := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, errToCache)
// Do not cache non-200 responses (e.g. 404 from a non-leader catalogd pod).
// Returning the error directly lets the next reconcile retry a fresh HTTP
// request and eventually hit the leader.
return nil, fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-200 responses are now hidden from cache consumers

On Line 112, returning directly skips cache.Put, so later reads can surface cache ... not found instead of the real upstream failure. That’s risky for persistent non-200s (e.g., 401/403/5xx). Consider bypassing cache only for the HA race case (404), and caching other non-200 errors.

Proposed scoped fix
 if resp.StatusCode != http.StatusOK {
-	// Do not cache non-200 responses (e.g. 404 from a non-leader catalogd pod).
-	// Returning the error directly lets the next reconcile retry a fresh HTTP
-	// request and eventually hit the leader.
-	return nil, fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
+	err := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
+	// Avoid pinning transient HA misses, but preserve error visibility for other failures.
+	if resp.StatusCode == http.StatusNotFound {
+		return nil, err
+	}
+	return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, err)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if resp.StatusCode != http.StatusOK {
errToCache := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, errToCache)
// Do not cache non-200 responses (e.g. 404 from a non-leader catalogd pod).
// Returning the error directly lets the next reconcile retry a fresh HTTP
// request and eventually hit the leader.
return nil, fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
}
if resp.StatusCode != http.StatusOK {
err := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
// Avoid pinning transient HA misses, but preserve error visibility for other failures.
if resp.StatusCode == http.StatusNotFound {
return nil, err
}
return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operator-controller/catalogmetadata/client/client.go` around lines
108 - 113, The code currently returns early on any non-200 resp.StatusCode which
prevents cache.Put from recording the actual error; instead, modify the branch
so that only 404 bypasses the cache (treat as HA leader-miss) while other
non-200 statuses create an error value and call cache.Put(key,
errOrErrorResponse) before returning the error; specifically, in the function
handling the HTTP response (the block that checks resp.StatusCode and currently
returns fmt.Errorf...), change the logic to if resp.StatusCode ==
http.StatusNotFound then return without caching, else construct the appropriate
error (including status code and body if available), call cache.Put with the
same cache key used elsewhere, and then return that error.

spec:
minReadySeconds: 5
replicas: 1
replicas: 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The shared e2e-coverage PVC does not scale with these replica bumps.

Both Deployments still mount the same e2e-coverage claim, but this file defines that PVC as ReadWriteOnce at Line 180-Line 185. With 2 catalogd pods and 2 operator-controller pods in the 2-node experimental suite, a replica scheduled onto a different node can stay Pending on volume attach, which defeats the HA rollout this change is trying to exercise.

Also applies to: 2775-2775

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/experimental-e2e.yaml` at line 2624, The shared e2e-coverage PVC is
defined as ReadWriteOnce but both Deployments (catalogd and operator-controller)
mount it and replicas were increased to 2, causing pods on different nodes to
hang on attach; update the storage model to support multiple nodes by either
changing the PVC named e2e-coverage to accessMode: ReadWriteMany (if your
storage class supports RWX) or convert these Deployments to StatefulSets and use
volumeClaimTemplates to give each replica its own PVC; locate the e2e-coverage
PVC definition and the catalogd/operator-controller workload manifests and apply
the RWX change or the StatefulSet + volumeClaimTemplates approach accordingly.

Comment on lines +15 to +19
And catalogd is ready to reconcile resources
And catalog "test" is reconciled
When the catalogd leader pod is force-deleted
Then a new catalogd leader is elected
And catalog "test" reports Serving as True with Reason Available
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the catalog is serving before killing the leader.

Right now this scenario only proves eventual availability after the delete. If the catalog never became Serving=True until after failover, the test would still pass, which weakens the regression signal for “resume serving” behavior.

Suggested tweak
     And catalogd is ready to reconcile resources
     And catalog "test" is reconciled
+    And catalog "test" reports Serving as True with Reason Available
     When the catalogd leader pod is force-deleted
     Then a new catalogd leader is elected
     And catalog "test" reports Serving as True with Reason Available
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
And catalogd is ready to reconcile resources
And catalog "test" is reconciled
When the catalogd leader pod is force-deleted
Then a new catalogd leader is elected
And catalog "test" reports Serving as True with Reason Available
And catalogd is ready to reconcile resources
And catalog "test" is reconciled
And catalog "test" reports Serving as True with Reason Available
When the catalogd leader pod is force-deleted
Then a new catalogd leader is elected
And catalog "test" reports Serving as True with Reason Available
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/features/ha.feature` around lines 15 - 19, Add a pre-failover
assertion that the catalog is already serving: before the step "When the
catalogd leader pod is force-deleted" assert that catalog "test" reports Serving
as True with Reason Available (i.e., ensure the step/check that verifies
Serving=True/Reason=Available runs prior to deleting the leader pod). Locate the
scenario steps around "And catalogd is ready to reconcile resources", "And
catalog \"test\" is reconciled" and the existing "And catalog \"test\" reports
Serving as True with Reason Available" and move or duplicate that Serving check
so it executes immediately before the force-delete action to guarantee the
catalog was serving before failover.

Comment on lines +27 to +31
sc := scenarioCtx(ctx)
leaderPod := sc.leaderPods["catalogd"]
if leaderPod == "" {
return fmt.Errorf("catalogd leader pod not found in scenario context; run 'catalogd is ready to reconcile resources' first")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Refresh the lease holder immediately before deleting the pod.

sc.leaderPods["catalogd"] can be stale by the time this step runs. If leadership already changed during catalog "test" is reconciled, this step deletes a follower, and NewCatalogdLeaderIsElected can then pass without any real failover because the holder already differs from the cached value.

Suggested fix
 func CatalogdLeaderPodIsForceDeleted(ctx context.Context) error {
 	sc := scenarioCtx(ctx)
-	leaderPod := sc.leaderPods["catalogd"]
-	if leaderPod == "" {
-		return fmt.Errorf("catalogd leader pod not found in scenario context; run 'catalogd is ready to reconcile resources' first")
+	holder, err := k8sClient("get", "lease", leaseNames["catalogd"], "-n", olmNamespace,
+		"-o", "jsonpath={.spec.holderIdentity}")
+	if err != nil || strings.TrimSpace(holder) == "" {
+		return fmt.Errorf("failed to resolve current catalogd leader from lease: %w", err)
 	}
+	leaderPod := strings.Split(strings.TrimSpace(holder), "_")[0]
+	sc.leaderPods["catalogd"] = leaderPod
 
 	logger.Info("Force-deleting catalogd leader pod", "pod", leaderPod)
 	if _, err := k8sClient("delete", "pod", leaderPod, "-n", olmNamespace,
 		"--force", "--grace-period=0"); err != nil {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sc := scenarioCtx(ctx)
leaderPod := sc.leaderPods["catalogd"]
if leaderPod == "" {
return fmt.Errorf("catalogd leader pod not found in scenario context; run 'catalogd is ready to reconcile resources' first")
}
sc := scenarioCtx(ctx)
holder, err := k8sClient("get", "lease", leaseNames["catalogd"], "-n", olmNamespace,
"-o", "jsonpath={.spec.holderIdentity}")
if err != nil || strings.TrimSpace(holder) == "" {
return fmt.Errorf("failed to resolve current catalogd leader from lease: %w", err)
}
leaderPod := strings.Split(strings.TrimSpace(holder), "_")[0]
sc.leaderPods["catalogd"] = leaderPod
logger.Info("Force-deleting catalogd leader pod", "pod", leaderPod)
if _, err := k8sClient("delete", "pod", leaderPod, "-n", olmNamespace,
"--force", "--grace-period=0"); err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/steps/ha_steps.go` around lines 27 - 31, The code uses the cached
value sc.leaderPods["catalogd"] which can be stale; before deleting the pod,
refresh the current lease holder from the cluster instead of relying on
sc.leaderPods. Replace the direct use of sc.leaderPods["catalogd"] in the step
(after obtaining sc via scenarioCtx(ctx)) with a fresh lookup of the catalogd
leader (e.g., call the existing helper that queries the lease/leader from the
API or add a small helper like refresh/getLeaseHolder for "catalogd"), then use
that up-to-date leaderPod for the deletion and for assertions in
NewCatalogdLeaderIsElected. Ensure the refreshed value overwrites or is compared
against sc.leaderPods["catalogd"] so you don't accidentally delete a follower.

Comment thread test/e2e/steps/hooks.go
Comment on lines +136 to +142
// Enable HA scenarios when the cluster has at least 2 nodes. This runs
// unconditionally so that upgrade scenarios (which install OLM in a Background
// step and return early below) still get the gate set correctly.
if out, err := k8sClient("get", "nodes", "--no-headers", "-o", "name"); err == nil &&
len(strings.Fields(strings.TrimSpace(out))) >= 2 {
featureGates[catalogdHAFeature] = true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

HA detection should use eligible control-plane nodes, not total nodes.

This currently enables the HA gate based on all nodes. On clusters with only one control-plane node, HA scenarios may be enabled incorrectly and fail for the wrong reason. Please gate on schedulable control-plane capacity for these deployments.

💡 Minimal adjustment
- if out, err := k8sClient("get", "nodes", "--no-headers", "-o", "name"); err == nil &&
+ if out, err := k8sClient("get", "nodes", "-l", "node-role.kubernetes.io/control-plane", "--no-headers", "-o", "name"); err == nil &&
 		len(strings.Fields(strings.TrimSpace(out))) >= 2 {
 		featureGates[catalogdHAFeature] = true
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Enable HA scenarios when the cluster has at least 2 nodes. This runs
// unconditionally so that upgrade scenarios (which install OLM in a Background
// step and return early below) still get the gate set correctly.
if out, err := k8sClient("get", "nodes", "--no-headers", "-o", "name"); err == nil &&
len(strings.Fields(strings.TrimSpace(out))) >= 2 {
featureGates[catalogdHAFeature] = true
}
// Enable HA scenarios when the cluster has at least 2 nodes. This runs
// unconditionally so that upgrade scenarios (which install OLM in a Background
// step and return early below) still get the gate set correctly.
if out, err := k8sClient("get", "nodes", "-l", "node-role.kubernetes.io/control-plane", "--no-headers", "-o", "name"); err == nil &&
len(strings.Fields(strings.TrimSpace(out))) >= 2 {
featureGates[catalogdHAFeature] = true
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/steps/hooks.go` around lines 136 - 142, The HA gate is currently
based on total node count; change it to count only control-plane nodes that are
schedulable and enable featureGates[catalogdHAFeature] only when there are at
least two such nodes. Use the existing k8sClient call to list control-plane
nodes (select by label node-role.kubernetes.io/control-plane or equivalent),
then for each returned node check its spec.taints (or call k8sClient to get the
node object) and exclude nodes with a NoSchedule control-plane taint; finally
set featureGates[catalogdHAFeature] = true when the filtered count >= 2. Ensure
you reference the same featureGates map and catalogdHAFeature symbol and keep
the logic inside the same hook.

@tmshort
Copy link
Copy Markdown
Contributor Author

tmshort commented Apr 28, 2026

/test e2e-aws-upgrade-ovn-single-node
/test openshift-e2e-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

@tmshort: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-upgrade-ovn-single-node dfec6bd link false /test e2e-aws-upgrade-ovn-single-node
ci/prow/openshift-e2e-aws dfec6bd link true /test openshift-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant