-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Problem
In #1688 we broke "Docker-in-LXD":
$ lxc launch ubuntu:18.04 foo -c security.nesting=true
$ lxc shell foo
foo# curl -fsSL get.docker.com -o get-docker.sh && sh get-docker.sh
foo# exit
$ lxc file push /usr/local/sbin/runc foo/usr/bin/docker-runc
$ lxc shell foo
foo# cat /proc/self/uid_map
(we are in userns here)
foo# docker run -it --rm busybox
docker: Error response from daemon: OCI runtime create failed: cannot specify gid= mount options for unmapped gid in rootless containers: unknown.
This is caused because runc enables the "rootless mode" when running in user namespace, but "Docker-in-LXD" does not expect runc to enable the rootless mode.
What "rootless mode" does actually
- Honor
$XDG_RUNTIME_DIR - Switch the cgroup manager to
libcontainer.RootlessCgroupfs - Disable cgroup-specific features such as
runc psand OOM notification - Disable
runc checkpointandrunc restore - Make sure
config.jsoncontains userns and id mappings if euid != 0 - Make sure
config.jsondoes not containuid=andgid=for mounts - Write
"deny"to/proc/$PID/setgroupsif single-entry mapping is specified - Disable additional groups, but actually we don't need to do this. ([TODO] rootless: support spec.Process.User.AdditionalGids #1835)
For "Docker-in-LXD", we need none of them, because runc is already executed in userns and cgroups is also available.
In #1688, we enabled the "rootless mode" in userns so as to support rootless img/buildkit/buildah/containerd/docker/podman, but actually we only need 1, 2, and 3 for these usecases.
Proposal
Step 1: fix Docker-in-LXD regression (PR: #1833 / Closed)
Change isRootless as follows:
func isRootless(context *cli.Context) (bool, error) {
if context != nil {
...
}
u := os.Getenv("USER")
return u != "" && u != "root"
}- When runc is executed in userns via Docker-in-LXD,
isRootless()returnsfalse, because LXD would set$USERto "root". - When runc is executed in userns via rootless img/buildkit/buildah/containerd/docker/podman,
isRootless()returnstrue, because we don't change environment variables after unsharing the userns and mapping UID=0 to the current user. - When runc is executed as a regular user in the initial namespace,
isRootless()returnstrue - When runc is executed as the root in the initial namespace,
isRootless()returnsfalse
Corner cases:
- When runc is executed in userns via Docker-in-rootless-Docker ("rootless dind"), as the root in the contaienr,
isRootless()returnsfalse, and unlikely to work. Thedockerdin the container would need to specifyrunc --rootlessexplicitly in this case. (And user would need to launchdockerdwith--rootlessexplicitly, probably) - When runc is executed in "rootless dind", as a non-root in the contaienr,
isRootless()returnstrueand likely to work.
Step 2: refactor the rootless mode: (PR: #1862)
Probably, this needs to be hornored when u := os.Getenv("USER"); u != "" && u != "root".
(Note that we shouldn't check the UID in the current namespace, because we still want to honor $XDG_RUNTIME_DIR after unsharing the userns and mappping UID=0 to the current user)
Or maybe we can always honor this variable, but potentially it breaks compatibility, when runc is executed as UID=0, $USER=root, $XDG_RUNTIME_DIR=/run/user/0..
We should detect cgroup availability explicitly by probably trying mkdir /sys/fs/cgroup/foo/bar or something similar. I guess the overhead is negligible.
Or just remove libcontainer.RootlessCgroupfs manager and ignore all errors.
We can just safely use libcontainer.RootlessCgroupfs when we are not the root in the initial namespace.
We could implement runc ps without using cgroups. (in another PR in future)
I'm not familar with CRIU, but I guess we only need to disable them when runc is executed as non-zero UID, regardless of whether we are in the initial namespace or in a userns.
We need to disable them only when runc is executed as non-zero UID (TODO: check capabilities instead?), regardless of whether we are in the initial namespace or in a userns.