use clone3 for exec process creation to reduce cgroup lock contention by lujinda · Pull Request #4782 · opencontainers/runc

lujinda · 2025-06-13T02:00:56Z

Note: This PR is only for discussion, and the code is for demonstration. If it is confirmed that there is no problem, the code structure may need to be optimized

Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into cgroup.procs. This approach leads to high lock contention on the cgroup_threadgroup_rwsem read-write lock under conditions of high container density and numerous exec probes, potentially causing system hang.

This change introduces the usage of the clone3 system call within the setnsProcess.start function to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention.

Currently, the runc exec process creates child processes by first cloning the child process and then writing its PID into cgroup.procs. This approach leads to high lock contention on the cgroup_threadgroup_rwsem read-write lock under conditions of high container density and numerous exec probes, potentially causing system hang. This change introduces the usage of the clone3 system call within the setnsProcess.start function to merge the application of the cgroup into the clone operation (assuming cgroup v2 is in use). By doing so, it avoids the need to write PIDs to cgroup.procs directly, thereby bypassing the requirement for taking the write lock and reducing the risk of lock contention. Signed-off-by: jinda.ljd <[email protected]>

rata

I think using clone3 might be a good idea (with a fallback, of course), thanks!

Do you have perf numbers to share? To better understand when this is problematic and how much help this provides on kernels that support it.

I wonder if @kolyshkin that, IIRC, wrote the patch for golang had ideas already.

rata · 2025-06-20T14:35:35Z

libcontainer/process_linux.go

 	// Get the "before" value of oom kill count.
 	oom, _ := p.manager.OOMKillCount()
+	useClone3 := false
+	if cgroups.IsCgroup2UnifiedMode() && p.initProcessPid != 0 {


start() is already complex, let's move this to a function with a clear name, so it's more readable.

rata · 2025-06-20T14:36:25Z

libcontainer/process_linux.go

+		procPid := p.pid()
+		if useClone3 {
+			procPid = -1
+		}
+		if err := cgroups.WriteCgroupProc(path, procPid); err != nil && !p.rootlessCgroups {


I guess if it's -1 it's not written? Let's just useClone3 for the condition, explaining that if we are using clone3, then it's already set in the cgroup.

rata · 2025-06-20T14:39:08Z

libcontainer/process_linux.go

+				p.cmd.SysProcAttr.UseCgroupFD = true
+				p.cmd.SysProcAttr.CgroupFD = int(fd.Fd())


man clone3 says this is available since linux 5.7. You are setting useClone3 to true, but I don't think that is detecting CLONE_INTO_CGROUP is supported in this kernel, right? We will need to improve the detection. Not sure about the golang wrapper, but IIRC @kolyshkin wrote that for Go. He might have tips :)

In case clone3 is not supported by the kernel or denied by the security policy, we'll get an error from exec. In this case, I guess, we should retry without UseCgroupFD.

Or, we can rely on a kernel version (and have it in mind that it can still fail if e.g. clone3 is denied by the security policy).

Usually a "dummy" exec of the syscall (with some nil params or whatever) is enough to know if it's supported. I guess something similar should be possible for this clone_into_cgroup param.

Usually that is recommended because people might backport it to old kernels, and if you just check the version you might not use it when it is available.

lujinda force-pushed the clone3_exec branch from 342ed8e to 0298a45 Compare June 13, 2025 02:01

rata reviewed Jun 20, 2025

View reviewed changes

kolyshkin mentioned this pull request Jul 16, 2025

runc exec: use CLONE_INTO_CGROUP #4812

Merged

lifubang closed this in #4812 Sep 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use clone3 for exec process creation to reduce cgroup lock contention#4782

use clone3 for exec process creation to reduce cgroup lock contention#4782
lujinda wants to merge 1 commit intoopencontainers:mainfrom
lujinda:clone3_exec

lujinda commented Jun 13, 2025

Uh oh!

rata left a comment •

edited

Loading

Uh oh!

rata Jun 20, 2025

Uh oh!

rata Jun 20, 2025

Uh oh!

rata Jun 20, 2025

Uh oh!

kolyshkin Jul 15, 2025

Uh oh!

kolyshkin Jul 15, 2025

Uh oh!

rata Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		p.cmd.SysProcAttr.UseCgroupFD = true
		p.cmd.SysProcAttr.CgroupFD = int(fd.Fd())

Conversation

lujinda commented Jun 13, 2025

Uh oh!

rata left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

rata Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

rata Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

kolyshkin Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

kolyshkin Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

rata Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rata left a comment •

edited

Loading