chown cgroup to process uid in container namespace#3057
chown cgroup to process uid in container namespace#3057cyphar merged 1 commit intoopencontainers:masterfrom
Conversation
b30bb92 to
9e9c973
Compare
|
In general this kinda makes sense, although I'm afraid systemd might overwrite that at some random moment. This also needs test(s). |
You can use systemd transient unit API to set a different user. Unfortunately, the uid needs to I filed an issue against systemd to relax this requirement. It was provisionally rejected: I would like if we can push back and convince systemd project (i.e. Lennart) to just relax the checks |
AkihiroSuda
left a comment
There was a problem hiding this comment.
For security reason, the default behavior should not be changed.
We can probably have a feature gate as an annotation
User namespaces is already behind an annotation ( |
|
I'm talking about OCI Runtime Spec annotations, not about Kubernetes annotations. CRI-O may set the suggest OCI Runtime Spec annotation by default at their own risk. |
|
@AkihiroSuda thanks for the clarification. Fair enough. I'll gate it. I'll try to make all the suggested changes next week. |
|
@AkihiroSuda what would be an appropriate key/value for the annotation? |
SGTM, but in the long term we should have proper JSON entry in OCI Runtime Spec https://github.com/opencontainers/runtime-spec |
Better would be to adjust the semantics such that the container process can create scopes/slices in its own cgroup, and manage its child processes/threads. But that is a discussion to be had over there, not here :) |
9e9c973 to
3c94bb5
Compare
|
Thanks @kolyshkin and @AkihiroSuda for your reviews. I implemented all of your requested changes, and added a test. By the way, for testing with OpenShift, do not use 4.8.0, there is a regression in CRI-O 1.21 that prevents k8s annotations from being propagated to the OCI configuration. CRI-O 1.20 (OpenShift 4.7.x) is fine. |
43e5c14 to
0f4ffc4
Compare
|
Ping @kolyshkin and @AkihiroSuda - are you able to re-review this PR soon? |
4efd94e to
99cf0a8
Compare
99cf0a8 to
e155172
Compare
kolyshkin
left a comment
There was a problem hiding this comment.
Overall looks good and ready; only left a single suggestion.
e155172 to
cce31b0
Compare
cce31b0 to
8212dfa
Compare
Delegating cgroups to the container enables more complex workloads, including systemd-based workloads. The OCI runtime-spec was recently updated to explicitly admit such delegation, through specification of cgroup ownership semantics: opencontainers/runtime-spec#1123 Pursuant to the updated OCI runtime-spec, change the ownership of the container's cgroup directory and particular files therein, when using cgroups v2 and when the cgroupfs is to be mounted read/write. As a result of this change, systemd workloads can run in isolated user namespaces on OpenShift when the sandbox's cgroupfs is mounted read/write. It might be possible to implement this feature in other cgroup managers, but that work is deferred. Signed-off-by: Fraser Tweedale <[email protected]>
8212dfa to
35d20c4
Compare
|
Thanks to all who have reviewed this PR and given feedback. We need more approvals and then someone can press the merge button. Ping @cyphar, @AkihiroSuda, @giuseppe, @jfroy. |
|
Any takers? @mrunalp? |
|
Many thanks, @cyphar and all reviewers. |
|
I created #3311 to backport this feature to the |
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:
opencontainers/runtime-spec#1123
Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.
As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.
It might be possible to implement this feature in other cgroup
managers, but that work is deferred.