Skip to content

Don't always enable rootless mode in userns #1837

@AkihiroSuda

Description

@AkihiroSuda

Problem

In #1688 we broke "Docker-in-LXD":

$ lxc launch ubuntu:18.04 foo -c security.nesting=true
$ lxc shell foo
foo# curl -fsSL get.docker.com -o get-docker.sh && sh get-docker.sh
foo# exit
$ lxc file push /usr/local/sbin/runc foo/usr/bin/docker-runc
$ lxc shell foo
foo# cat /proc/self/uid_map
(we are in userns here)
foo# docker run -it --rm busybox
docker: Error response from daemon: OCI runtime create failed: cannot specify gid= mount options for unmapped gid in rootless containers: unknown.

This is caused because runc enables the "rootless mode" when running in user namespace, but "Docker-in-LXD" does not expect runc to enable the rootless mode.

What "rootless mode" does actually

  1. Honor $XDG_RUNTIME_DIR
  2. Switch the cgroup manager to libcontainer.RootlessCgroupfs
  3. Disable cgroup-specific features such as runc ps and OOM notification
  4. Disable runc checkpoint and runc restore
  5. Make sure config.json contains userns and id mappings if euid != 0
  6. Make sureconfig.json does not contain uid= and gid= for mounts
  7. Write "deny" to /proc/$PID/setgroups if single-entry mapping is specified
  8. Disable additional groups, but actually we don't need to do this. ([TODO] rootless: support spec.Process.User.AdditionalGids #1835)

For "Docker-in-LXD", we need none of them, because runc is already executed in userns and cgroups is also available.

In #1688, we enabled the "rootless mode" in userns so as to support rootless img/buildkit/buildah/containerd/docker/podman, but actually we only need 1, 2, and 3 for these usecases.

Proposal

Step 1: fix Docker-in-LXD regression (PR: #1833 / Closed)

Change isRootless as follows:

func isRootless(context *cli.Context) (bool, error) {
  if context != nil {
  ...
  }
  u := os.Getenv("USER")
  return u != "" && u != "root"
}
  • When runc is executed in userns via Docker-in-LXD, isRootless() returns false, because LXD would set $USER to "root".
  • When runc is executed in userns via rootless img/buildkit/buildah/containerd/docker/podman, isRootless() returns true, because we don't change environment variables after unsharing the userns and mapping UID=0 to the current user.
  • When runc is executed as a regular user in the initial namespace, isRootless() returns true
  • When runc is executed as the root in the initial namespace, isRootless() returns false

Corner cases:

  • When runc is executed in userns via Docker-in-rootless-Docker ("rootless dind"), as the root in the contaienr, isRootless() returns false, and unlikely to work. The dockerd in the container would need to specify runc --rootless explicitly in this case. (And user would need to launch dockerd with --rootless explicitly, probably)
  • When runc is executed in "rootless dind", as a non-root in the contaienr, isRootless() returns true and likely to work.

Step 2: refactor the rootless mode: (PR: #1862)

  1. Honor $XDG_RUNTIME_DIR

Probably, this needs to be hornored when u := os.Getenv("USER"); u != "" && u != "root".
(Note that we shouldn't check the UID in the current namespace, because we still want to honor $XDG_RUNTIME_DIR after unsharing the userns and mappping UID=0 to the current user)

Or maybe we can always honor this variable, but potentially it breaks compatibility, when runc is executed as UID=0, $USER=root, $XDG_RUNTIME_DIR=/run/user/0..

  1. Switch the cgroup manager to libcontainer.RootlessCgroupfs

We should detect cgroup availability explicitly by probably trying mkdir /sys/fs/cgroup/foo/bar or something similar. I guess the overhead is negligible.
Or just remove libcontainer.RootlessCgroupfs manager and ignore all errors.

We can just safely use libcontainer.RootlessCgroupfs when we are not the root in the initial namespace.

  1. Disable cgroup-specific features such as runc ps and OOM notification

We could implement runc ps without using cgroups. (in another PR in future)

  1. Disable runc checkpoint and runc restore

I'm not familar with CRIU, but I guess we only need to disable them when runc is executed as non-zero UID, regardless of whether we are in the initial namespace or in a userns.

  1. Make sure config.json contains userns and id mappings if euid != 0
  2. Make sureconfig.json does not contain uid= and gid= for mounts
  3. Write "deny" to /proc/$PID/setgroups if single-entry mapping is specified

We need to disable them only when runc is executed as non-zero UID (TODO: check capabilities instead?), regardless of whether we are in the initial namespace or in a userns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions