Description
The most recent Bottlerocket release included an update to runc 1.1.6. Shortly after the release, we received reports of a regression where nodes would fall over after kubelet, systemd, and dbus-broker consumed excessive CPU and memory resources.
In bottlerocket-os/bottlerocket#3057 I narrowed this down via git bisect to e4ce94e which was meant to fix this issue, but instead now causes it to happen consistently.
I've confirmed that reverting that specific patch fixes the regression.
Steps to reproduce the issue
On an EKS 1.26 cluster with a single worker node, apply a consistent load via this spec:
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster; sleep $(( ( RANDOM % 10 ) + 1 ));
restartPolicy: OnFailure
hostNetwork: true
parallelism: 100
(repro credit to @yeazelm)
After a short time, the "Path does not exist" and "Failed to delete cgroup paths" errors appear and continue even after the spec is deleted and the load is removed.
Describe the results you received and expected
systemd, kubelet, and dbus-broker all showed high CPU usage.
journalctl -f and busctl monitor showed these messages repeatedly:
Apr 27 00:20:09 ip-10-0-83-192.us-west-2.compute.internal kubelet[1206]: I0427 00:20:09.181437 1206 kubelet_getters.go:306] "Path does not exist" path="/var/lib/kubelet/pods/b08478e4-6c1b-461e-9fb5-e6c6411cf3ef/volumes"
‣ Type=signal Endian=l Flags=1 Version=1 Cookie=5417700 Timestamp="Thu 2023-04-27 00:21:08.887527 UTC"
Sender=:1.0 Path=/org/freedesktop/systemd1 Interface=org.freedesktop.systemd1.Manager Member=UnitNew
UniqueName=:1.0
MESSAGE "so" {
STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
OBJECT_PATH "/org/freedesktop/systemd1/unit/kubepods_2dbesteffort_2dpodb08478e4_5f6c1b_5f461e_5f9fb5_5fe6c6411cf3ef_2eslice";
};
‣ Type=signal Endian=l Flags=1 Version=1 Cookie=5417701 Timestamp="Thu 2023-04-27 00:21:08.887570 UTC"
Sender=:1.0 Path=/org/freedesktop/systemd1 Interface=org.freedesktop.systemd1.Manager Member=JobNew
UniqueName=:1.0
MESSAGE "uos" {
UINT32 454294;
OBJECT_PATH "/org/freedesktop/systemd1/job/454294";
STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
};
‣ Type=method_call Endian=l Flags=0 Version=1 Cookie=449180 Timestamp="Thu 2023-04-27 00:21:10.622122 UTC"
Sender=:1.11 Destination=org.freedesktop.systemd1 Path=/org/freedesktop/systemd1 Interface=org.freedesktop.systemd1.Manager Member=StopUnit
UniqueName=:1.11
MESSAGE "ss" {
STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
STRING "replace";
};
‣ Type=signal Endian=l Flags=1 Version=1 Cookie=5418558 Timestamp="Thu 2023-04-27 00:21:09.106241 UTC"
Sender=:1.0 Path=/org/freedesktop/systemd1 Interface=org.freedesktop.systemd1.Manager Member=UnitRemoved
UniqueName=:1.0
MESSAGE "so" {
STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
OBJECT_PATH "/org/freedesktop/systemd1/unit/kubepods_2dbesteffort_2dpodb08478e4_5f6c1b_5f461e_5f9fb5_5fe6c6411cf3ef_2eslice";
};
What version of runc are you using?
# runc -v
runc version 1.1.6+bottlerocket
commit: 0f48801a0e21e3f0bc4e74643ead2a502df4818d
spec: 1.0.2-dev
go: go1.19.6
libseccomp: 2.5.4
Host OS information
Bottlerocket
Host kernel information
Bottlerocket releases cover a variety of kernels, so to break it down a bit:
- Kubernetes 1.23, Linux 5.10, cgroup v1 - not affected
- Kubernetes 1.24, Linux 5.15, cgroup v1 - affected
- Kubernetes 1.25, Linux 5.15, cgroup v1 - affected
- Kubernetes 1.26, Linux 5.15, cgroup v2 - not affected
Description
The most recent Bottlerocket release included an update to runc 1.1.6. Shortly after the release, we received reports of a regression where nodes would fall over after kubelet, systemd, and dbus-broker consumed excessive CPU and memory resources.
In bottlerocket-os/bottlerocket#3057 I narrowed this down via
git bisectto e4ce94e which was meant to fix this issue, but instead now causes it to happen consistently.I've confirmed that reverting that specific patch fixes the regression.
Steps to reproduce the issue
On an EKS 1.26 cluster with a single worker node, apply a consistent load via this spec:
(repro credit to @yeazelm)
After a short time, the "Path does not exist" and "Failed to delete cgroup paths" errors appear and continue even after the spec is deleted and the load is removed.
Describe the results you received and expected
systemd,kubelet, anddbus-brokerall showed high CPU usage.journalctl -fandbusctl monitorshowed these messages repeatedly:What version of runc are you using?
Host OS information
Bottlerocket
Host kernel information
Bottlerocket releases cover a variety of kernels, so to break it down a bit: