use clone3 for exec process creation to reduce cgroup lock contention#4782
use clone3 for exec process creation to reduce cgroup lock contention#4782lujinda wants to merge 1 commit intoopencontainers:mainfrom
Conversation
Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into cgroup.procs. This approach leads to high lock contention on the cgroup_threadgroup_rwsem read-write lock under conditions of high container density and numerous exec probes, potentially causing system hang. This change introduces the usage of the clone3 system call within the setnsProcess.start function to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention. Signed-off-by: jinda.ljd <[email protected]>
There was a problem hiding this comment.
I think using clone3 might be a good idea (with a fallback, of course), thanks!
Do you have perf numbers to share? To better understand when this is problematic and how much help this provides on kernels that support it.
I wonder if @kolyshkin that, IIRC, wrote the patch for golang had ideas already.
| // Get the "before" value of oom kill count. | ||
| oom, _ := p.manager.OOMKillCount() | ||
| useClone3 := false | ||
| if cgroups.IsCgroup2UnifiedMode() && p.initProcessPid != 0 { |
There was a problem hiding this comment.
start() is already complex, let's move this to a function with a clear name, so it's more readable.
| procPid := p.pid() | ||
| if useClone3 { | ||
| procPid = -1 | ||
| } | ||
| if err := cgroups.WriteCgroupProc(path, procPid); err != nil && !p.rootlessCgroups { |
There was a problem hiding this comment.
I guess if it's -1 it's not written? Let's just useClone3 for the condition, explaining that if we are using clone3, then it's already set in the cgroup.
| p.cmd.SysProcAttr.UseCgroupFD = true | ||
| p.cmd.SysProcAttr.CgroupFD = int(fd.Fd()) |
There was a problem hiding this comment.
man clone3 says this is available since linux 5.7. You are setting useClone3 to true, but I don't think that is detecting CLONE_INTO_CGROUP is supported in this kernel, right? We will need to improve the detection. Not sure about the golang wrapper, but IIRC @kolyshkin wrote that for Go. He might have tips :)
There was a problem hiding this comment.
In case clone3 is not supported by the kernel or denied by the security policy, we'll get an error from exec. In this case, I guess, we should retry without UseCgroupFD.
There was a problem hiding this comment.
Or, we can rely on a kernel version (and have it in mind that it can still fail if e.g. clone3 is denied by the security policy).
There was a problem hiding this comment.
Usually a "dummy" exec of the syscall (with some nil params or whatever) is enough to know if it's supported. I guess something similar should be possible for this clone_into_cgroup param.
Usually that is recommended because people might backport it to old kernels, and if you just check the version you might not use it when it is available.
Note: This PR is only for discussion, and the code is for demonstration. If it is confirmed that there is no problem, the code structure may need to be optimized
Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into
cgroup.procs. This approach leads to high lock contention on thecgroup_threadgroup_rwsemread-write lock under conditions of high container density and numerous exec probes, potentially causing system hang.This change introduces the usage of the
clone3system call within thesetnsProcess.startfunction to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention.