runc exec: use CLONE_INTO_CGROUP#4812
Conversation
aa873c8 to
115aa1f
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
467d16c to
dfcf22a
Compare
I notice that all the failures occurred in rootless container tests. This might be related to: runc/libcontainer/process_linux.go Line 205 in dfcf22a However, you mentioned we're seeing an ENOENT error here, so that may not be the cause. |
dfcf22a to
6095b61
Compare
|
@kolyshkin Wait, I thought we always communicated with systemd when using Is this just for our testing, or are users actually using this? Because we will need to fix that if we have users on systemd-based systems using cgroups directly without transient units... |
When you use I'm pretty sure it has been that way from the very beginning. One other thing is, when using systemd, we configure everything via systemd and then use fs/fs2 driver to write to cgroup directly. This is also how things have always been. One reason for that is we did not care much to translate OCI spec into systemd settings, which is now mostly fixed. Another reason is, systemd doesn't support all per-cgroup settings that the kernel has (so some of those can't be expressed as systemd unit properties). |
The thing is, while the comment says "EBUSY", the actual code doesn't check for particular error, going for this fallback on any error (including ENOENT). My guess is, with systemd driver we actually need |
6e3bf36 to
9489925
Compare
|
Apparently, we are also not placing rootless container exec's into the proper cgroup (which is still possible when using cgroup v2 systemd driver, but we'd need to use Tacking it in #4822 |
|
I'm aware of the mixed fs/systemd setup, my confusion was more that systemd very strictly expects to be told about stuff in cgroupv2 because cgroupv2 was designed around a global management process -- it has been a long time since I reviewed the first cgroupv2 patches for runc, but I remember that being a thing I was worried about at the time. The exec issue seems bad too... |
9489925 to
f96e179
Compare
b56a52b to
1652210
Compare
Guess what, this is no longer happening. Based on cgroup name ( |
ffdcf6e to
9b80e9f
Compare
libcontainer/process_linux.go
Outdated
| // No clone3 syscall (kernels < v5.3). | ||
| case errors.Is(err, unix.ENOSYS): | ||
| return true | ||
| // No CLONE_INTO_CGROUP flag support (kernels v5.3 to v5.7). |
There was a problem hiding this comment.
To clarify why clone3 returns E2BIG rather than EINVAL for kernels which don't support CLONE_INTO_CGROUP, I know this is because the kernel checked the oversized structures first, it might be helpful to include a link to the relevant kernel source (kernel/fork.c#L2525-L2536). This provides direct context for readers who are puzzled by this specific error code choice.
There was a problem hiding this comment.
torvalds/linux@f5a1a53 or https://github.com/torvalds/linux/blob/v5.4/include/linux/uaccess.h#L237 are better links -- the error is coming from copy_struct_from_user. The bit you linked to is for something else.
(This is a generic pattern for all extensible-struct syscalls by the way. bpf(2), openat2(2), clone3(2), and a few other interfaces have the same behaviour. I gave a talk about this in 2020.)
There was a problem hiding this comment.
Yes, I looked it all up when checking whether E2BIG is the correct error to check for, and found out this actually depends on the kernel version.
From my memory (might be wrong here):
- in kernel v5.3,
clone3is already available but there is nocopy_struct_from_user(yet clone3 returns E2BIG if the clone_args is larger than expected and the tail has non-zero values). - from v5.4 to v5.7, the above pattern is generalized in
copy_struct_from_userby torvalds/linux@f5a1a53
I don't think the code reader need all those details and references though, but I've included a short explanation in the commit message so it can be found via git blame.
There was a problem hiding this comment.
Added the same text to PR description.
There was a problem hiding this comment.
(Not directly related to all this, but I remember then looking into in-kernel checkpoint-restore code (which is part of OpenVZ/Virtuozzo kernel, later reimplemented mostly in userspace, and thus CRIU was born), I saw ANK (who wrote most of in-kernel C/R single-handedly) was using structures with zero padding at the end. Not sure if he checked if the padding was actually zero. Also, it might not be the C/R code, but some of the COW layered filesystem (or block device) stuff we had).
There was a problem hiding this comment.
Added the same text to PR description.
Removed it all as per #4812 (comment) we're no longer checking for specific errors and always retrying without cgroupfd.
9b80e9f to
190a165
Compare
libcontainer/process_linux.go
Outdated
| return nil, fmt.Errorf("bad sub cgroup path: %s", sub) | ||
| } | ||
|
|
||
| fd, err := os.OpenFile(cgroup, unix.O_PATH|unix.O_DIRECTORY|unix.O_CLOEXEC, 0) |
There was a problem hiding this comment.
I would prefer we use openat2 on the cached cgroupfs fd rather than having this double-check logic -- do we not have access to it anymore with the cgroup package split?
There was a problem hiding this comment.
Sure!
I can't explain why I haven't used cgroups.OpenFile here... maybe I was writing this in a "prototype mode" and then totally forgot about this.
190a165 to
2444516
Compare
It makes sense to make runc exec benefit from clone2(CLONE_INTO_CGROUP), if it is available. Since it requires a recent kernel and might not work, implement a fallback to older way of joining the cgroup. Based on: - https://go-review.googlesource.com/c/go/+/417695 - coreos/go-systemd#458 - opencontainers/cgroups#26 - opencontainers#4822 Signed-off-by: Kir Kolyshkin <[email protected]>
2444516 to
5af4dd4
Compare
|
1.4 backport: #4920 |
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure. [PR 4812]: opencontainers#4812 [CL 728642]: https://go-review.googlesource.com/c/go/+/728642 Signed-off-by: Kir Kolyshkin <[email protected]>
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure. From the [CL 734200] it looks like it is an acceptable way. If the upstream will introduce cmd.Clone we'll switch to it. [PR 4812]: opencontainers#4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure. From the [CL 734200] it looks like it is an acceptable way. If the upstream will introduce cmd.Clone we'll switch to it. [PR 4812]: opencontainers#4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd). From the [CL 734200] it looks like it is an acceptable way, but it might break in the future as it also copies the private fields, so let's do a proper field-by-field copy. If the upstream will add cmd.Clone method, we will switch to it. Also, we can probably be fine with a post-copy (once the first Start has failed), but let's be conservative here and do a pre-copy. [PR 4812]: opencontainers#4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd). From the [CL 734200] it looks like it is an acceptable way, but it might break in the future as it also copies the private fields, so let's do a proper field-by-field copy. If the upstream will add cmd.Clone method, we will switch to it. Also, we can probably be fine with a post-copy (once the first Start has failed), but let's be conservative here and do a pre-copy. [PR 4812]: opencontainers#4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd). From the [CL 734200] it looks like it is an acceptable way, but it might break in the future as it also copies the private fields, so let's do a proper field-by-field copy. If the upstream will add cmd.Clone method, we will switch to it. Also, we can probably be fine with a post-copy (once the first Start has failed), but let's be conservative here and do a pre-copy. [PR 4812]: opencontainers#4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]>
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd). From the [CL 734200] it looks like it is an acceptable way, but it might break in the future as it also copies the private fields, so let's do a proper field-by-field copy. If the upstream will add cmd.Clone method, we will switch to it. Also, we can probably be fine with a post-copy (once the first Start has failed), but let's be conservative here and do a pre-copy. [PR 4812]: opencontainers#4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]> (cherry picked from commit cb31d62) Signed-off-by: lifubang <[email protected]>
Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd). From the [CL 734200] it looks like it is an acceptable way, but it might break in the future as it also copies the private fields, so let's do a proper field-by-field copy. If the upstream will add cmd.Clone method, we will switch to it. Also, we can probably be fine with a post-copy (once the first Start has failed), but let's be conservative here and do a pre-copy. [PR 4812]: opencontainers#4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <[email protected]> Signed-off-by: Kir Kolyshkin <[email protected]> (cherry picked from commit cb31d62) Signed-off-by: lifubang <[email protected]>
Requires (and currently includes) PR #4822; draft until that one is merged.It makes sense to make runc exec benefit from
clone3(CLONE_INTO_CGROUP), whenavailable. Since it requires a recent kernel and might not work, implement a fallback.
Based on:
Closes: #4782.