What is the problem you're trying to solve
We have reworked the user namespaces KEP to rely on idmap mounts even for stateless pods. This means the kubelet will no longer chown the volume files with the hostUID/GID of the pod user namespace (like configmap/secret volumes, not real persistency is supported yet), but just add a mapping to the CRI bind mounts, so the kernel does the ID translation. This mapping is passed to the container runtime and should be passed to the OCI runtime too, that will do the idmap bind-mount.
Several problems will appear if: a) we keep the containerd CRI userns implementation as-is AND b) k8s later down the road adds support for userns with stateful pods and idmap mounts (it is in our roadmap, hepefully this year, but quite uncertain at this point) AND c) it is used with containerd 1.7.
The problem is basically that containerd will create a pod with userns and not do anything with volumes (no chown, no nothing), as that is something that the kubelet used to do. But the kubelet will not do it anymore, and therefore the pod will be created with a userns and just allowed access to the volumes. That means that _whenever stateful pods are supported, the files will be created with the hostUID/hostGID of the pod, that will change on pod-reschedules and such. So, the pod won't be able to read files that the pod itself created before being re-scheduled.
As long as userns for stateful pods is not implemented in k8s, this is not a big issue: there is no persistency in the volumes supported (configmap, secret, emptyDir, projected and downwardAPI), so the worst case is that the pod will see configmap's files owned as nobody/nogroup. So, for k8s 1.27 and releases not having stateful pod support with userns, this is not a big deal.
Describe the solution you'd like
I think there are some possible solutions to solve this problem:
- Update the cri-api vendoring so we see the new fields added to the mount, and if those are set then we throw an error. This way, whenever a kubelet asks us to use idmap mounts, as we currently don't support it, we just throw and error saying we don't support it. The upstream PR for cri-api adding the uid/gid mappings to CRI mounts, has not been merged yet but it already has a LGTM. It will probably be merged soon.
- Just do the simple changes in containerd to pass the mappings to the OCI runtime whenever the kubelet sends them. This PR is also simple (I have it ready, will open a draft PR so you can see the changes) and means that whenever the OCI runtime is upgraded with idmap mounts support, containerd 1.7 will just work with it. I don't know if runc 1.2 will work out of the box with containerd 1.7, though. UPDATE: this option doesn't seem nice, as runc ignores unknown fields (the runtime spec mandates that here). Therefore, this option will generate what we want to avoid. It seems better to throw an error con containerd (option 1).
- An option that doesn't involve containerd will be to, whenever we add support for idmap mounts in k8s, also make sure that when the kubelet sends the mapping over the CRI, it has more than one line. This way, the current containerd implementation will throw an error and we can make that work in future versions of containerd that support idmap mounts.
I think the best options is 1. What do you think?
If the 1.7 release is out without 1, then we can do the last option in the kubelet to make sure we don't create any issues whenever we support stateful pods in k8s.
Additional context
No response
What is the problem you're trying to solve
We have reworked the user namespaces KEP to rely on idmap mounts even for stateless pods. This means the kubelet will no longer chown the volume files with the hostUID/GID of the pod user namespace (like configmap/secret volumes, not real persistency is supported yet), but just add a mapping to the CRI bind mounts, so the kernel does the ID translation. This mapping is passed to the container runtime and should be passed to the OCI runtime too, that will do the idmap bind-mount.
Several problems will appear if: a) we keep the containerd CRI userns implementation as-is AND b) k8s later down the road adds support for userns with stateful pods and idmap mounts (it is in our roadmap, hepefully this year, but quite uncertain at this point) AND c) it is used with containerd 1.7.
The problem is basically that containerd will create a pod with userns and not do anything with volumes (no chown, no nothing), as that is something that the kubelet used to do. But the kubelet will not do it anymore, and therefore the pod will be created with a userns and just allowed access to the volumes. That means that _whenever stateful pods are supported, the files will be created with the hostUID/hostGID of the pod, that will change on pod-reschedules and such. So, the pod won't be able to read files that the pod itself created before being re-scheduled.
As long as userns for stateful pods is not implemented in k8s, this is not a big issue: there is no persistency in the volumes supported (configmap, secret, emptyDir, projected and downwardAPI), so the worst case is that the pod will see configmap's files owned as nobody/nogroup. So, for k8s 1.27 and releases not having stateful pod support with userns, this is not a big deal.
Describe the solution you'd like
I think there are some possible solutions to solve this problem:
I think the best options is 1. What do you think?
If the 1.7 release is out without 1, then we can do the last option in the kubelet to make sure we don't create any issues whenever we support stateful pods in k8s.
Additional context
No response