What is the problem you're trying to solve
When doing PoC to enable pod level user namespace, the containerd/CRI/runc fails to mount /dev/shm for the pause container with error message:
{"file":"github.com/opencontainers/runc/utils.go:62","func":"main.fatalWithCode","level":"error","msg":"runc create failed: unable to start container process: error during container init: er
ror mounting \"/home/wanglei01/opt/open/go_project/containerd_upstream/bin/run/containerd/io.containerd.grpc.v1.cri/sandboxes/ab8749035cbfcad346798a9d7ed7782e5d7f84719adf4ab8964e3ac176aaf185
/shm\" to rootfs at \"/dev/shm\": mount /proc/self/fd/6:/dev/shm (via /proc/self/fd/8), flags: 0x5021: operation not permitted","time":"2022-05-09T23:10:26+08:00"}
Describe the solution you'd like
Another fact, there are two instance of /dev/shm mounted for pause container, one copy is created by the oci default configuration, the other is created by sandboxContainerSpec.
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=65536k"
]
},
{
"destination": "/dev/shm",
"type": "bind",
"source": "/home/wanglei01/opt/open/go_project/containerd_upstream/bin/run/containerd/io.containerd.grpc.v1.cri/sandboxes/831fa6775cbc84f1dcb58821343853dca46b576b8bc0efd33d88f0552d8d6c3c/shm",
"options": [
"rbind",
"ro"
]
},
We have tried several ways to fix the issue:
- change
ro in the second instance to rw
- remove the second instance
- remove the first instance and change
ro in the second instance to rw.
I feel 3 is the best solution. Any suggestions?
Additional context
When IPC mode is Pod, containerd/CRI creates a writable tmpfs for pod and mount it to a directory in function setupSandboxFiles:
// Setup sandbox /dev/shm.
if config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetIpc() == runtime.NamespaceMode_NODE {
if _, err := c.os.Stat(devShm); err != nil {
return fmt.Errorf("host %q is not available for host ipc: %w", devShm, err)
}
} else {
sandboxDevShm := c.getSandboxDevShm(id)
if err := c.os.MkdirAll(sandboxDevShm, 0700); err != nil {
return fmt.Errorf("failed to create sandbox shm: %w", err)
}
shmproperty := fmt.Sprintf("mode=1777,size=%d", defaultShmSize)
if err := c.os.(osinterface.UNIX).Mount("shm", sandboxDevShm, "tmpfs", uintptr(unix.MS_NOEXEC|unix.MS_NOSUID|unix.MS_NODEV), shmproperty); err != nil {
return fmt.Errorf("failed to mount sandbox shm: %w", err)
}
}
Later the writable tmpfs will be bind mount to /dev/shm for pause container in read-only mode in function sandboxContainerSpec:
// It's fine to generate the spec before the sandbox /dev/shm
// is actually created.
sandboxDevShm := c.getSandboxDevShm(id)
if nsOptions.GetIpc() == runtime.NamespaceMode_NODE {
sandboxDevShm = devShm
}
specOpts = append(specOpts, oci.WithMounts([]runtimespec.Mount{
{
Source: sandboxDevShm,
Destination: devShm,
Type: "bind",
Options: []string{"rbind", "ro"},
},
// Add resolv.conf for katacontainers to setup the DNS of pod VM properly.
{
Source: c.getResolvPath(id),
Destination: resolvConfPath,
Type: "bind",
Options: []string{"rbind", "ro"},
},
}))
Above flow works for pod without user namespace but fails when pod level user namespace is enabled. It seems like a limitation/restriction of Linux, though I have found the exact linux kernel code causing the behavior.
Following experiment gives a hint:
root@f8ff64bfc7d8:/kata# unshare -Umr /bin/bash
root@f8ff64bfc7d8:/kata# mount -o bind,rw /dev/shm /dev/shm
root@f8ff64bfc7d8:/kata# mount -o bind,ro /dev/shm /dev/shm
mount: /dev/shm: filesystem was mounted, but any subsequent operation failed: Unknown error 5005.
What is the problem you're trying to solve
When doing PoC to enable pod level user namespace, the containerd/CRI/runc fails to mount
/dev/shmfor the pause container with error message:Describe the solution you'd like
Another fact, there are two instance of /dev/shm mounted for pause container, one copy is created by the oci default configuration, the other is created by
sandboxContainerSpec.We have tried several ways to fix the issue:
roin the second instance torwroin the second instance torw.I feel 3 is the best solution. Any suggestions?
Additional context
When IPC mode is Pod, containerd/CRI creates a writable
tmpfsfor pod and mount it to a directory in functionsetupSandboxFiles:Later the writable tmpfs will be bind mount to
/dev/shmfor pause container in read-only mode in functionsandboxContainerSpec:Above flow works for pod without user namespace but fails when pod level user namespace is enabled. It seems like a limitation/restriction of Linux, though I have found the exact linux kernel code causing the behavior.
Following experiment gives a hint: