Don't always enable rootless mode in userns

## Problem

In #1688 we broke "Docker-in-LXD":

```
$ lxc launch ubuntu:18.04 foo -c security.nesting=true
$ lxc shell foo
foo# curl -fsSL get.docker.com -o get-docker.sh && sh get-docker.sh
foo# exit
$ lxc file push /usr/local/sbin/runc foo/usr/bin/docker-runc
$ lxc shell foo
foo# cat /proc/self/uid_map
(we are in userns here)
foo# docker run -it --rm busybox
docker: Error response from daemon: OCI runtime create failed: cannot specify gid= mount options for unmapped gid in rootless containers: unknown.
```

This is caused because runc enables the "rootless mode" when running in user namespace, but "Docker-in-LXD" does not expect runc to enable the rootless mode.


## What "rootless mode" does actually

1. [Honor `$XDG_RUNTIME_DIR`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/main.go#L70)
2. [Switch the cgroup manager to `libcontainer.RootlessCgroupfs`](https://github.com/opencontainers/runc/blob/dd56ece8236d6d9e5bed4ea0c31fe53c7b873ff4/utils_linux.go#L46)
3. [Disable cgroup-specific features such as `runc ps`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/ps.go#L32) and [OOM notification](https://github.com/opencontainers/runc/blob/dd56ece8236d6d9e5bed4ea0c31fe53c7b873ff4/libcontainer/container_linux.go#L613)
4. [Disable `runc checkpoint`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/checkpoint.go#L51) and [`runc restore`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/restore.go#L103)
5. [Make sure `config.json` contains userns and id mappings if euid != 0](https://github.com/opencontainers/runc/blob/b222ea4469dd5e304f3f6cf7a2482860285639a2/libcontainer/configs/validate/rootless.go#L41)
6. [Make sure`config.json` does not contain `uid=` and `gid=` for mounts](https://github.com/opencontainers/runc/blob/b222ea4469dd5e304f3f6cf7a2482860285639a2/libcontainer/configs/validate/rootless.go#L80)
7. [Write `"deny"` to `/proc/$PID/setgroups` if single-entry mapping is specified](https://github.com/opencontainers/runc/blob/0cbfd8392fff2462701507296081e835b3b0b99a/libcontainer/nsenter/nsexec.c#L690)
8. [Disable additional groups](https://github.com/opencontainers/runc/blob/3a079311a7e9afeab4b722616cefd2f9b4129104/libcontainer/init_linux.go#L286), but actually we don't need to do this. (https://github.com/opencontainers/runc/issues/1835)

For "Docker-in-LXD", we need none of them, because runc is already executed in userns and cgroups is also available.

In #1688, we enabled the "rootless mode" in userns so as to support rootless img/buildkit/buildah/containerd/docker/podman, but actually we only need 1, 2, and 3 for these usecases.

## Proposal

### Step 1: fix Docker-in-LXD regression (PR: https://github.com/opencontainers/runc/pull/1833 / Closed)

Change `isRootless` as follows:

```go
func isRootless(context *cli.Context) (bool, error) {
  if context != nil {
  ...
  }
  u := os.Getenv("USER")
  return u != "" && u != "root"
}
```

* When runc is executed in userns via Docker-in-LXD, `isRootless()` returns `false`, because LXD would set `$USER` to "root".
* When runc is executed in userns via rootless img/buildkit/buildah/containerd/docker/podman, `isRootless()` returns `true`, because we don't change environment variables after unsharing the userns and mapping UID=0 to the current user.
* When runc is executed as a regular user in the initial namespace, `isRootless()` returns `true`
* When runc is executed as the root in the initial namespace, `isRootless()` returns `false`

Corner cases:
* When runc is executed in userns via Docker-in-rootless-Docker ("rootless dind"), as the root in the contaienr, `isRootless()` returns `false`, and unlikely to work. The `dockerd` in the container would need to specify `runc --rootless` explicitly in this case. (And user would need to launch `dockerd` with `--rootless` explicitly, probably)
* When runc is executed in "rootless dind", as a non-root in the contaienr, `isRootless()` returns `true` and likely to work.

### Step 2: refactor the rootless mode: (PR: https://github.com/opencontainers/runc/pull/1862)

> 1. [Honor `$XDG_RUNTIME_DIR`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/main.go#L70)

Probably, this needs to be hornored when `u := os.Getenv("USER"); u != "" && u != "root"`.
(Note that we shouldn't check the UID in the current namespace, because we still want to honor `$XDG_RUNTIME_DIR` after unsharing the userns and mappping UID=0 to the current user)

Or maybe we can always honor this variable, but potentially it breaks compatibility, when runc is executed as UID=0, `$USER=root`, `$XDG_RUNTIME_DIR=/run/user/0.`.


> 2. [Switch the cgroup manager to `libcontainer.RootlessCgroupfs`](https://github.com/opencontainers/runc/blob/dd56ece8236d6d9e5bed4ea0c31fe53c7b873ff4/utils_linux.go#L46)

~~We should detect cgroup availability explicitly by probably trying `mkdir /sys/fs/cgroup/foo/bar` or something similar. I guess the overhead is negligible.
Or just remove `libcontainer.RootlessCgroupfs` manager and ignore all errors.~~
We can just safely use `libcontainer.RootlessCgroupfs` when we are not the root in the initial namespace.

> 3. [Disable cgroup-specific features such as `runc ps`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/ps.go#L32) and [OOM notification](https://github.com/opencontainers/runc/blob/dd56ece8236d6d9e5bed4ea0c31fe53c7b873ff4/libcontainer/container_linux.go#L613)

We could implement `runc ps` without using cgroups. (in another PR in future)

> 4. [Disable `runc checkpoint`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/checkpoint.go#L51) and [`runc restore`](https://github.com/opencontainers/runc/blob/0e561642f81e84ebd0b3afd6ec510c75a2ccb71b/restore.go#L103)

I'm not familar with CRIU, but I guess we only need to disable them when runc is executed as non-zero UID, regardless of whether we are in the initial namespace or in a userns.

> 5. [Make sure `config.json` contains userns and id mappings if euid != 0](https://github.com/opencontainers/runc/blob/b222ea4469dd5e304f3f6cf7a2482860285639a2/libcontainer/configs/validate/rootless.go#L41)
> 6. [Make sure`config.json` does not contain `uid=` and `gid=` for mounts](https://github.com/opencontainers/runc/blob/b222ea4469dd5e304f3f6cf7a2482860285639a2/libcontainer/configs/validate/rootless.go#L80)
> 7. [Write `"deny"` to `/proc/$PID/setgroups` if single-entry mapping is specified](https://github.com/opencontainers/runc/blob/0cbfd8392fff2462701507296081e835b3b0b99a/libcontainer/nsenter/nsexec.c#L690)

We need to disable them only when runc is executed as non-zero UID (TODO: check capabilities instead?), regardless of whether we are in the initial namespace or in a userns.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't always enable rootless mode in userns #1837

Problem

What "rootless mode" does actually

Proposal

Step 1: fix Docker-in-LXD regression (PR: #1833 / Closed)

Step 2: refactor the rootless mode: (PR: #1862)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Don't always enable rootless mode in userns #1837

Description

Problem

What "rootless mode" does actually

Proposal

Step 1: fix Docker-in-LXD regression (PR: #1833 / Closed)

Step 2: refactor the rootless mode: (PR: #1862)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions