What is the problem you're trying to solve
There are ongoing works to enable pod level user namespace:
kubernetes/enhancements#127
kubernetes/enhancements#3275
User namespace (uid/gid remap) has already been supported by ctr, containerd core and runC, but there are still some issues in the CRI subsystem. When doing PoC for pod level user namespace, the first issue we encountered is:
RunPodSandbox for &PodSandboxMetadata{Name:nginx-sandbox,Uid:018b4704-222a-4657-990e-bb568b870f4b,Namespace:default,Attempt:1,} failed, error error="failed to create containerd task: failed to create shim task: OCI runtime create failed: container_linux.go:346: starting container process caused \"process_linux.go:449: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/home/wanglei01/opt/open/go_project/containerd_upstream/bin/run/containerd/io.containerd.runtime.v2.task/k8s.io/7388141ccd209d9df243a1b7df52c9510436e5d89b072904bd6778400d82301c/rootfs\\\\\\\" at \\\\\\\"/sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\": unknown"
After some investigation, we have found that the failure is caused by the flow to create namespaces.
When mounting sysfs, linux kernel checks that current user has CAP_SYS_ADMIN cap in the user namespace associated with the net namespace. And the current flow to create namespaces for sandbox/infra container is:
- call netns to create net ns for the pod if the net namespace mode is not
NODE.
- call cni to initialize the pod network
- configure other namespace for the infrastructure/app containers
- call container runtime to create other namespaces for the container, including the user ns.
With above flow, the net namespace will be associated with the init user ns because it's created before the pod user ns. And it fails to mount the sysfs in the pod user ns.
Describe the solution you'd like
We could tune the flow to create namespaces and initialize the pod network for sandbox as below:
- configure all needed namespaces for the sandbox/infra container, including the user namespace
- call runtime to start the infra container and creates all needed namespaces
- call CNI to initialize pod network with the net ns created by the runtime for the infra container
Additional context
No response
What is the problem you're trying to solve
There are ongoing works to enable pod level user namespace:
kubernetes/enhancements#127
kubernetes/enhancements#3275
User namespace (uid/gid remap) has already been supported by ctr, containerd core and runC, but there are still some issues in the CRI subsystem. When doing PoC for pod level user namespace, the first issue we encountered is:
After some investigation, we have found that the failure is caused by the flow to create namespaces.
When mounting sysfs, linux kernel checks that current user has
CAP_SYS_ADMINcap in the user namespace associated with the net namespace. And the current flow to create namespaces for sandbox/infra container is:NODE.With above flow, the net namespace will be associated with the init user ns because it's created before the pod user ns. And it fails to mount the sysfs in the pod user ns.
Describe the solution you'd like
We could tune the flow to create namespaces and initialize the pod network for sandbox as below:
Additional context
No response