cgroup2: monitor OOMKill instead of OOM to prevent missing container events#6323
Conversation
|
Hi @jepio. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Build succeeded.
|
1c61c96 to
be5fe70
Compare
|
Build succeeded.
|
|
Build succeeded.
|
|
Windows flaked. /retest |
|
@jepio: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Hi, is any one willing to review this? |
kzys
left a comment
There was a problem hiding this comment.
The change looks good to me, but I don't know much about cgroups. Also does it make sense to add some tests around that?
@AkihiroSuda - Can you take a look?
|
Shouldn't this be emitted with another event type? |
|
I don't think so - the /tasks/oom event was supposed to mean "task was killed due to OOM". An OOM event caused by a task that does not result in this task being killed is not super interesting. So I would say "oom_kill" is what should have been used all along. But right now this is not being generated at all in the common case (single container in pod) as described in the PR. This is a consequence of limits being evaluated from the root down and the pod (slice) having the same limit as the container (scope). So the pod is OOM, but the container gets killed as a consequence. |
…OOM events With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice) and container cgroup (scope) will both have the same memory limit applied. In that situation, the kernel will consider an OOM event to be triggered by the parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only sees an oom_kill increment. Since we monitor child cgroups for oom events, check the OOMKill field so that we don't miss events. This is not visible when running containers through docker or ctr, because they set the limits differently (only container level). An alternative would be to not configure limits at the pod level - that way the container limit will be hit and the OOM will be correctly generated. An interesting consequence is that when spawning a pod with multiple containers, the oom events also work correctly, because: a) if one of the containers has no limit, the pod has no limit so OOM events in another container report correctly. b) if all of the containers have limits then the pod limit will be a sum of container events, so a container will be able to hit its limit first. Signed-off-by: Jeremi Piotrowski <[email protected]>
be5fe70 to
7275411
Compare
|
https://github.com/containerd/containerd/blob/main/pkg/cri/server/events.go#L333-L349 |
|
@AkihiroSuda, do you find my reasoning convincing? |
|
What is the state of this PR? This is making Vertical Pod Autoscaler (kubernetes) not to react on OOMs correctly so has a potentially big impact of kubernetes users. |
|
Will take a look tomorrow~ |
|
@jepio @AkihiroSuda This change looks good to me. This Linux patch updates cgroupv2 oom description https://lore.kernel.org/lkml/[email protected]/T/. I think oom_kill could be more compatible with cgroupv1. |
Docker driver's TestDockerDriver_OOMKilled should run on cgroups v2 now, since we're running docker v27 client library and our runners run docker v26 that contain containerd fix containerd/containerd#6323.
When running on cgroup2, currently in a 1-container-pod-with-memory-limit configuration, no
/tasks/oomevents are generated. This is because the pod cgroup and container cgroups both get the samememory.maxsetting, and theoomevents gets counted to the pod cgroup, while containerd monitors container cgroups for oom events. Fix that by monitoringoom_killinstead and reporting that.oom_killevents are counted both to the pod and container cgroups.My test case was the following kubernetes manifest:
Related issue: k3s-io/k3s#4572