Skip to content

runc exec: use CLONE_INTO_CGROUP#4812

Merged
lifubang merged 1 commit intoopencontainers:mainfrom
kolyshkin:exec-clone-into-cgroup
Sep 27, 2025
Merged

runc exec: use CLONE_INTO_CGROUP#4812
lifubang merged 1 commit intoopencontainers:mainfrom
kolyshkin:exec-clone-into-cgroup

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jul 16, 2025

Requires (and currently includes) PR #4822; draft until that one is merged.

It makes sense to make runc exec benefit from clone3(CLONE_INTO_CGROUP), when
available. Since it requires a recent kernel and might not work, implement a fallback.

Based on:

Closes: #4782.

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch from aa873c8 to 115aa1f Compare July 16, 2025 02:27
@kolyshkin

This comment was marked as resolved.

@cyphar

This comment was marked as resolved.

@kolyshkin

This comment was marked as resolved.

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch 4 times, most recently from 467d16c to dfcf22a Compare July 16, 2025 07:47
@lifubang
Copy link
Member

I need some time to digest this. Any feedback is welcome.

I notice that all the failures occurred in rootless container tests. This might be related to:

// On cgroup v2 + nesting + domain controllers, WriteCgroupProc may fail with EBUSY.

However, you mentioned we're seeing an ENOENT error here, so that may not be the cause.

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch from dfcf22a to 6095b61 Compare July 16, 2025 22:46
@cyphar
Copy link
Member

cyphar commented Jul 17, 2025

@kolyshkin Wait, I thought we always communicated with systemd when using cgroup2 -- systemd is very happy to mess with our cgroups (including clearing limits and various other quite dangerous behaviour) if we don't tell it that we are managing the cgroup with Delegate=yes. Maybe this has changed over the years, but I'm fairly certain the initial implementations of this stuff all communicated something with systemd regardless of the cgroup driver used.

Is this just for our testing, or are users actually using this? Because we will need to fix that if we have users on systemd-based systems using cgroups directly without transient units...

@kolyshkin
Copy link
Contributor Author

@kolyshkin Wait, I thought we always communicated with systemd when using cgroup2 -- systemd is very happy to mess with our cgroups (including clearing limits and various other quite dangerous behaviour) if we don't tell it that we are managing the cgroup with Delegate=yes. Maybe this has changed over the years, but I'm fairly certain the initial implementations of this stuff all communicated something with systemd regardless of the cgroup driver used.

Is this just for our testing, or are users actually using this? Because we will need to fix that if we have users on systemd-based systems using cgroups directly without transient units...

When you use runc directly, unless --systemd-cgroup is explicitly specified, the fs/fs2 driver is used and runc do not communicate with systemd in any way. Which might be just fine, if the systemd is configured to not touch a specific cgroup path and everything under it, and runc is creating cgroups under that path. Having said that, runc with fs/fs2 driver neither configures such thing, nor checks if it is configured.

I'm pretty sure it has been that way from the very beginning.

One other thing is, when using systemd, we configure everything via systemd and then use fs/fs2 driver to write to cgroup directly. This is also how things have always been. One reason for that is we did not care much to translate OCI spec into systemd settings, which is now mostly fixed. Another reason is, systemd doesn't support all per-cgroup settings that the kernel has (so some of those can't be expressed as systemd unit properties).

@kolyshkin
Copy link
Contributor Author

I need some time to digest this. Any feedback is welcome.

I notice that all the failures occurred in rootless container tests. This might be related to:

// On cgroup v2 + nesting + domain controllers, WriteCgroupProc may fail with EBUSY.

However, you mentioned we're seeing an ENOENT error here, so that may not be the cause.

The thing is, while the comment says "EBUSY", the actual code doesn't check for particular error, going for this fallback on any error (including ENOENT).

My guess is, with systemd driver we actually need AddPid cgroup driver method to add a pid (like an exec pid) into a pre-created cgroup (as opposed to Apply which creates the cgroup). I'm working on adding it.

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch 6 times, most recently from 6e3bf36 to 9489925 Compare July 29, 2025 01:04
@kolyshkin
Copy link
Contributor Author

Apparently, we are also not placing rootless container exec's into the proper cgroup (which is still possible when using cgroup v2 systemd driver, but we'd need to use AttachProcessesToUnit). As a result, container init and exec are running in different cgroups. This could be a problem because rootless+cgroupv2+systemd-driver can still set resource limits, and exec is running without those.

Tacking it in #4822

@cyphar
Copy link
Member

cyphar commented Aug 15, 2025

I'm aware of the mixed fs/systemd setup, my confusion was more that systemd very strictly expects to be told about stuff in cgroupv2 because cgroupv2 was designed around a global management process -- it has been a long time since I reviewed the first cgroupv2 patches for runc, but I remember that being a thing I was worried about at the time.

The exec issue seems bad too...

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch from 9489925 to f96e179 Compare September 8, 2025 19:45
@kolyshkin kolyshkin added this to the 1.4.0-rc.2 milestone Sep 16, 2025
@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch 4 times, most recently from b56a52b to 1652210 Compare September 17, 2025 00:01
@kolyshkin
Copy link
Contributor Author

OK I did some debugging and have very bad news to share.

Apparently GHA moves the process we create (container's init) to a different cgroup. Here's an excerpt from debug logs (using fs2 cgroup driver):

runc run -d --console-socket /tmp/bats-run-X8QSrN/runc.7IFV0b/tty/sock test_busybox (status=0)
time="2025-07-16T02:31:13Z" level=info msg="XXX container init cgroup /sys/fs/cgroup/system.slice/test_busybox"

Here ^^^ runc created a container and put its init to /system.slice/test_busybox cgroup.

runc exec test_busybox stat /tmp/mount-1/foo.txt /tmp/mount-2/foo.txt (status=255)
XXX container test_busybox init cgroup: /system.slice/hosted-compute-agent.service (present)

Here ^^^ the same container init is unexpectedly in the /system.slice/hosted-compute-agent.service cgroup.

time="2025-07-16T02:31:13Z" level=error msg="exec failed: unable to start container process: can't open cgroup: open /sys/fs/cgroup/system.slice/test_busybox: no such file or directory"

And here ^^^ runc exec failed because container's cgroup no longer exists.

Maybe this is what systemd does? But it doesn't do that on my machine.

I need some time to digest this. Any feedback is welcome.

Guess what, this is no longer happening. Based on cgroup name (hosted-compute-agent.service) I suspect it was caused by a bug in Azure (or, more specifically, GHA CI) infrastructure software.

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch 2 times, most recently from ffdcf6e to 9b80e9f Compare September 24, 2025 19:30
// No clone3 syscall (kernels < v5.3).
case errors.Is(err, unix.ENOSYS):
return true
// No CLONE_INTO_CGROUP flag support (kernels v5.3 to v5.7).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify why clone3 returns E2BIG rather than EINVAL for kernels which don't support CLONE_INTO_CGROUP, I know this is because the kernel checked the oversized structures first, it might be helpful to include a link to the relevant kernel source (kernel/fork.c#L2525-L2536). This provides direct context for readers who are puzzled by this specific error code choice.

Copy link
Member

@cyphar cyphar Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torvalds/linux@f5a1a53 or https://github.com/torvalds/linux/blob/v5.4/include/linux/uaccess.h#L237 are better links -- the error is coming from copy_struct_from_user. The bit you linked to is for something else.

(This is a generic pattern for all extensible-struct syscalls by the way. bpf(2), openat2(2), clone3(2), and a few other interfaces have the same behaviour. I gave a talk about this in 2020.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I looked it all up when checking whether E2BIG is the correct error to check for, and found out this actually depends on the kernel version.

From my memory (might be wrong here):

  • in kernel v5.3, clone3 is already available but there is no copy_struct_from_user (yet clone3 returns E2BIG if the clone_args is larger than expected and the tail has non-zero values).
  • from v5.4 to v5.7, the above pattern is generalized in copy_struct_from_user by torvalds/linux@f5a1a53

I don't think the code reader need all those details and references though, but I've included a short explanation in the commit message so it can be found via git blame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the same text to PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not directly related to all this, but I remember then looking into in-kernel checkpoint-restore code (which is part of OpenVZ/Virtuozzo kernel, later reimplemented mostly in userspace, and thus CRIU was born), I saw ANK (who wrote most of in-kernel C/R single-handedly) was using structures with zero padding at the end. Not sure if he checked if the padding was actually zero. Also, it might not be the C/R code, but some of the COW layered filesystem (or block device) stuff we had).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the same text to PR description.

Removed it all as per #4812 (comment) we're no longer checking for specific errors and always retrying without cgroupfd.

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch from 9b80e9f to 190a165 Compare September 26, 2025 16:35
return nil, fmt.Errorf("bad sub cgroup path: %s", sub)
}

fd, err := os.OpenFile(cgroup, unix.O_PATH|unix.O_DIRECTORY|unix.O_CLOEXEC, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we use openat2 on the cached cgroupfs fd rather than having this double-check logic -- do we not have access to it anymore with the cgroup package split?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

I can't explain why I haven't used cgroups.OpenFile here... maybe I was writing this in a "prototype mode" and then totally forgot about this.

@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch from 190a165 to 2444516 Compare September 26, 2025 20:31
It makes sense to make runc exec benefit from clone2(CLONE_INTO_CGROUP),
if it is available. Since it requires a recent kernel and might not work,
implement a fallback to older way of joining the cgroup.

Based on:
 - https://go-review.googlesource.com/c/go/+/417695
 - coreos/go-systemd#458
 - opencontainers/cgroups#26
 - opencontainers#4822

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin force-pushed the exec-clone-into-cgroup branch from 2444516 to 5af4dd4 Compare September 26, 2025 21:27
@kolyshkin kolyshkin requested a review from lifubang September 27, 2025 00:47
@lifubang lifubang merged commit 050a853 into opencontainers:main Sep 27, 2025
36 checks passed
@cyphar cyphar added the backport/1.4-todo A PR in main branch which needs to backported to release-1.4 label Oct 8, 2025
@kolyshkin
Copy link
Contributor Author

1.4 backport: #4920

@kolyshkin kolyshkin added backport/1.4-done A PR in main branch which has been backported to release-1.4 and removed backport/1.4-todo A PR in main branch which needs to backported to release-1.4 labels Oct 8, 2025
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Jan 27, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go-review.googlesource.com/c/go/+/728642
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Jan 27, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure. From the
[CL 734200] it looks like it is an acceptable way. If the upstream
will introduce cmd.Clone we'll switch to it.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Jan 29, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure. From the
[CL 734200] it looks like it is an acceptable way. If the upstream
will introduce cmd.Clone we'll switch to it.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Jan 29, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Jan 29, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Jan 29, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
@AkihiroSuda
Copy link
Member

lifubang pushed a commit to lifubang/runc that referenced this pull request Feb 11, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
(cherry picked from commit cb31d62)
Signed-off-by: lifubang <[email protected]>
lifubang pushed a commit to lifubang/runc that referenced this pull request Feb 11, 2026
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: opencontainers#4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
(cherry picked from commit cb31d62)
Signed-off-by: lifubang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.4-done A PR in main branch which has been backported to release-1.4 impact/changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants