WIP: Gracefully stop kube-apiserver before MCO reboot#5775
WIP: Gracefully stop kube-apiserver before MCO reboot#5775saschagrunert wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
@saschagrunert: This pull request references Jira Issue OCPBUGS-75200, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughReboot logic in pkg/daemon now runs via a transient systemd unit that executes an embedded shell script to stop kubelet and kube-apiserver containers (crictl with timeouts) on control-plane nodes, then reboots. A minor comment was changed in pkg/daemon/update.go. A unit test for rebootCommand was added. No public APIs changed. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
a379c02 to
ee9a19e
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: saschagrunert The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@saschagrunert: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/34da6640-2212-11f1-82ca-7e4d0a294189-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/daemon/daemon.go`:
- Around line 285-292: The current logic captures CONTAINERS before stopping
kubelet, allowing a static-pod restart to create a new kube-apiserver container
that won't be stopped; after calling timeout 30 systemctl stop kubelet (in the
block around CONTAINERS and the for loop), re-query crictl (e.g., reassign
CONTAINERS="$(crictl ps -q --label io.kubernetes.container.name=kube-apiserver
2>/dev/null)") and use that refreshed CONTAINERS list for the subsequent loop
that stops containers (the logger and crictl stop --timeout 200 "$cid" calls),
ensuring any new kube-apiserver container started during the kubelet-stop window
is also stopped.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 168e9546-c887-4a2c-988f-55fc33d0b978
📒 Files selected for processing (2)
pkg/daemon/daemon.gopkg/daemon/update.go
ee9a19e to
96339a9
Compare
|
@saschagrunert: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c4f2ec4-2215-11f1-91a0-41630963d78d-0 |
|
@saschagrunert: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/195f7560-2215-11f1-8139-783d852a0fca-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/daemon/daemon.go`:
- Around line 288-296: The graceful-shutdown block currently swallows failures
from "systemctl stop kubelet" and "crictl stop" (using "|| true" and redirecting
stderr) but unconditionally logs success; change the logic to detect and surface
failures: run "systemctl stop kubelet" and capture its exit/code/output and log
via logger on non-zero exit; iterate CONTAINERS and for each cid run "crictl
stop --timeout 200 $cid", capture exit code and stderr/stdout and log a
per-container failure (including the cid and command output) instead of
discarding it, and only emit the final 'kube-apiserver containers stopped'
logger if all stops succeeded (or emit a different summary/logger showing which
cids failed). Reference the existing symbols: systemctl stop kubelet,
CONTAINERS, crictl stop --timeout 200, and logger when making these changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a9405367-d25c-4e88-b9ac-5ffc667ad647
📒 Files selected for processing (2)
pkg/daemon/daemon.gopkg/daemon/update.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/daemon/update.go
96339a9 to
72002f6
Compare
|
/payload-job e2e-aws-ovn |
|
@saschagrunert: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e260b4cc-2217-11f1-9ae7-47fc522fbc75-0 |
|
@saschagrunert: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e4ad2260-2217-11f1-9836-53d14348b10e-0 |
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
72002f6 to
27e2581
Compare
|
/payload-job e2e-aws-ovn |
|
@saschagrunert: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9d670550-2218-11f1-8f4d-a00412492feb-0 |
|
/retest |
|
@saschagrunert: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips |
|
@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8c756bb0-2399-11f1-9aee-22be1e273c3a-0 |
|
/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips 7 |
|
@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ce567fd0-23c9-11f1-81c9-4945ff419b7b-0 |
|
Pre-merge verification details for this PR. Cluster Details: Test Scenario: A reboot was triggered via a KubeletConfig change targeting the master MCP. Key Results
Observation: The /var/log/kube-apiserver/.terminating file exists after reboot on all 3 master nodes but did not impact graceful shutdown. @saschagrunert Could you please help to confirm this behavior. Refer this for full test details and log snippets |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-ovn-ocl |
|
@BhargaviGudi: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/417aabd0-242e-11f1-9a03-7cfbdadab570-0 |
Yes, this is expected. The
The files you observed are new ones created by the freshly started kube-apiserver pods after reboot, not leftovers from the previous run. The modification timestamps confirm this (e.g. 05:40 on ip-10-0-39-103, which rebooted at 05:38). The previous instance's file was cleaned up during the graceful |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-rosa-sts-ovn |
|
@BhargaviGudi: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f07939f0-2431-11f1-86fd-739002f0c9de-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.20-e2e-rosa-sts-ovn |
|
@BhargaviGudi: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0b177f10-2432-11f1-9fd3-99bbd4c5563f-0 |
I'm almost certain that the ROSA jobs only function on nightlies because they have to watch for new nightlies and propagate them into ClusterImageSets, we've never gotten presubmits to work there. The upgrade job that @isabella-janssen ran is probably our best bet, though when we ran that here and on #5782 the build failure rate was really bad, but lets try it again here too, maybe the builders are happier today. /payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips 7 |
|
@sdodson: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/915051d0-2480-11f1-865f-cf7cfd2dc1dd-0 |
|
|
|
@isabella-janssen @sdodson Could you please help me run ci job on rosa clusters? Thanks |
|
No, that's not possible without adding new jobs here and it's not worth that effort. |
|
/hold |
|
/retitle WIP: Gracefully stop kube-apiserver before MCO reboot |
|
@saschagrunert: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
- What I did
Alternative to #5708 that avoids enabling GracefulNodeShutdown (GNS).
Without GNS, kubelet exits immediately on SIGTERM without terminating pods during MCO-triggered reboots. kube-apiserver needs up to 194s for graceful shutdown but systemd's
DefaultTimeoutStopSecis 90s, so it gets SIGKILLed.The transient systemd reboot unit now gracefully stops kube-apiserver before rebooting:
crictl ps(no-op on workers)timeout 30 systemctl stop kubelet) to prevent static pod restartscrictl stop --timeout 200)systemctl rebootFailures in steps 2 and 3 are logged via
loggerand do not block the reboot. The transient unit hasTimeoutStartSec=300to prevent indefinite hangs if CRI-O is stuck.This works because kube-apiserver runs under
watch-termination(PID 1 in the container).crictl stopsends SIGTERM towatch-termination, which forwards it to kube-apiserver. After graceful shutdown completes,watch-terminationremoves its lock file (/var/log/kube-apiserver/.terminating) via defer and exits. On next startup, the absence of this file indicates graceful termination. No dependency on kubelet for detection.- How to verify it
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully extended/var/log/kube-apiserver/.terminatingis absent after reboot- Description for the changelog
Gracefully stop kube-apiserver containers before MCO-triggered reboots to prevent non-graceful termination.