Skip to content

Tune mount mode of /dev/shm for pause container to support user namespace #6911

@jiangliu

Description

@jiangliu

What is the problem you're trying to solve

When doing PoC to enable pod level user namespace, the containerd/CRI/runc fails to mount /dev/shm for the pause container with error message:

{"file":"github.com/opencontainers/runc/utils.go:62","func":"main.fatalWithCode","level":"error","msg":"runc create failed: unable to start container process: error during container init: er
ror mounting \"/home/wanglei01/opt/open/go_project/containerd_upstream/bin/run/containerd/io.containerd.grpc.v1.cri/sandboxes/ab8749035cbfcad346798a9d7ed7782e5d7f84719adf4ab8964e3ac176aaf185
/shm\" to rootfs at \"/dev/shm\": mount /proc/self/fd/6:/dev/shm (via /proc/self/fd/8), flags: 0x5021: operation not permitted","time":"2022-05-09T23:10:26+08:00"}

Describe the solution you'd like

Another fact, there are two instance of /dev/shm mounted for pause container, one copy is created by the oci default configuration, the other is created by sandboxContainerSpec.

        {
            "destination": "/dev/shm",
            "type": "tmpfs",
            "source": "shm",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "mode=1777",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "bind",
            "source": "/home/wanglei01/opt/open/go_project/containerd_upstream/bin/run/containerd/io.containerd.grpc.v1.cri/sandboxes/831fa6775cbc84f1dcb58821343853dca46b576b8bc0efd33d88f0552d8d6c3c/shm",
            "options": [
                "rbind",
                "ro"
            ]
        },

We have tried several ways to fix the issue:

  1. change ro in the second instance to rw
  2. remove the second instance
  3. remove the first instance and change ro in the second instance to rw.

I feel 3 is the best solution. Any suggestions?

Additional context

When IPC mode is Pod, containerd/CRI creates a writable tmpfs for pod and mount it to a directory in function setupSandboxFiles:

	// Setup sandbox /dev/shm.
	if config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetIpc() == runtime.NamespaceMode_NODE {
		if _, err := c.os.Stat(devShm); err != nil {
			return fmt.Errorf("host %q is not available for host ipc: %w", devShm, err)
		}
	} else {
		sandboxDevShm := c.getSandboxDevShm(id)
		if err := c.os.MkdirAll(sandboxDevShm, 0700); err != nil {
			return fmt.Errorf("failed to create sandbox shm: %w", err)
		}
		shmproperty := fmt.Sprintf("mode=1777,size=%d", defaultShmSize)
		if err := c.os.(osinterface.UNIX).Mount("shm", sandboxDevShm, "tmpfs", uintptr(unix.MS_NOEXEC|unix.MS_NOSUID|unix.MS_NODEV), shmproperty); err != nil {
			return fmt.Errorf("failed to mount sandbox shm: %w", err)
		}
	}

Later the writable tmpfs will be bind mount to /dev/shm for pause container in read-only mode in function sandboxContainerSpec:

	// It's fine to generate the spec before the sandbox /dev/shm
	// is actually created.
	sandboxDevShm := c.getSandboxDevShm(id)
	if nsOptions.GetIpc() == runtime.NamespaceMode_NODE {
		sandboxDevShm = devShm
	}
	specOpts = append(specOpts, oci.WithMounts([]runtimespec.Mount{
		{
			Source:      sandboxDevShm,
			Destination: devShm,
			Type:        "bind",
			Options:     []string{"rbind", "ro"},
		},
		// Add resolv.conf for katacontainers to setup the DNS of pod VM properly.
		{
			Source:      c.getResolvPath(id),
			Destination: resolvConfPath,
			Type:        "bind",
			Options:     []string{"rbind", "ro"},
		},
	}))

Above flow works for pod without user namespace but fails when pod level user namespace is enabled. It seems like a limitation/restriction of Linux, though I have found the exact linux kernel code causing the behavior.

Following experiment gives a hint:

root@f8ff64bfc7d8:/kata# unshare -Umr /bin/bash
root@f8ff64bfc7d8:/kata# mount -o bind,rw /dev/shm /dev/shm
root@f8ff64bfc7d8:/kata# mount -o bind,ro /dev/shm /dev/shm
mount: /dev/shm: filesystem was mounted, but any subsequent operation failed: Unknown error 5005.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions