container restore layering conundrum

I am currently working on implementing checkpoint/restore on the Kubernetes level (https://github.com/kubernetes/kubernetes/pull/97194) and during restore I have a problem with missing directories for bind mounts. The following situation gives me currently problems

1. Kubernetes creates something like `/var/lib/kubelet/pods/36dfe704-5fd4-4ec5-aaf1-9d7375364de0/volumes/kubernetes.io~secret/default-token-d2cs7` and it tells CRI-O, in my case, to mount it at `/var/run/secrets/kubernetes.io/serviceaccount`.
2. CRI-O mounts an empty  `tmpfs` for `/run/secrets` and tells `runc` to bind mount it at `/run/secrets`.
3. `runc` does it exactly what it is expected to do. It bind mounts the external `tmpfs` at `/run/secrets` and bind mounts  `/var/lib/kubelet/pods/36dfe704-5fd4-4ec5-aaf1-9d7375364de0/volumes/kubernetes.io~secret/default-token-d2cs7` at `/var/run/secrets/kubernetes.io/serviceaccount`.
4. `runc` creates the directory for that in the bind mounted `tmpfs` at `/run/secrets`

At this point everything is done and it is working. The thing which I do not totally understand is that `runc` seems to move the mounts of the container rootfs after it is done. Not sure what is happening exactly but if I look, in one of my tests, at `/var/lib/containers/storage/overlay/89579e5eb2910e11b8e8d852b1ee022272c789aa97670e92be388613e6984fe6/merged/run/secrets/` I do not see the necessary directories, but after a lot of printf debugging I see that `runc` actually creates the directory at this location. So, the part I am not totally understanding is that the directory is created at one location, but I can see it, outside of the container at some other location, once the container is running. So it seems the mount is moved after/during container creation.

During restore I now have the following situation.

1. Kubernetes re-creates the same directory to be mounted in the container at `/var/run/secrets/kubernetes.io/serviceaccount`.
2. CRI-O still mounts an empty `tmpfs` for `/run/secrets`. This `tmpfs` is basically how and where the problem appears.
3. `runc` prepares restoring the container and recreates the directory structure just as during initial container creation so that CRIU can mount the previous directories at the same places they used to be. `runc`, during restore, does not bind mount directories to create mount points (maybe this is the main problem and the reason for the errors I am seeing)
4. CRIU restores the container and errors out because `/run/secrets` is empty and does not have the necessary directories (`kubernetes.io/serviceaccount`).

Looking at the output of my printf debugging I see that `runc` still creates the necessary directories for the bind mounting, but it does not help, because in the restore code path the bind mounts are not mounted by `runc` but later by CRIU. If it would be a `tmpfs` inside  of the container it would be no problem, because CRIU would include the content of the tmpfs in the checkpoint and mount during restore. But it is an external `tmpfs` bind mounted in the container and CRIU expects bind mounts to contain the same content during restore as during checkpoint.

The directory inside of `/run/secrets` to mount `/var/run/secrets/kubernetes.io/serviceaccount` could be now recreated by either CRI-O, `runc` or CRIU.

It would make sense to have it in CRI-O because CRI-O wants to mount a directory, as told by Kubernetes, in a `tmpfs` created by CRI-O. But because during normal container start-up `runc` does create the directory it would feel strange to have the creation of the directory in CRI-O only during restore.

It would also make sense to have it created in `runc`, because `runc` creates the directory during initial container start-up, but because CRIU does all the mounting of bind mounts, maybe it should happen in CRIU.

It feels wrong to do it in CRIU, because up until now CRIU expected that all directories are correctly populated before restoring a container and all its mounts.

So the directory creation could happen at all those different layers, but right now, I do not see a real perfect solution/location for the directory creation.

I think it will end up in `runc` because `runc` also creates the directory during initial container start-up, but because I do not totally understand what is happening with the mountpoints and how they are moved during container creation, I am not sure where in the code would be the right place.

Would this mean `runc` has to mount all bind mounts during restore, just to unmount them before giving control to CRIU? But should it create mountpoints in bind mounts also during restore? Maybe without mounting?

Right now it feels complicated and unclear how to correctly solve this with all the involved layers.

To reproduce this outside of Kubernetes following steps are necessary:

1. `mount -t tmpfs tmpfs /tmp/test`
2. Start a container with
```
                {
                        "destination": "/test",
                        "type": "bind",
                        "source": "/tmp/test",
                        "options": [
                                "bind",
                                "rw"
                        ]
                },
                {
                        "destination": "/test/lower/directory",
                        "type": "tmpfs",
                        "source": "tmpfs",
                        "options": [
                                "rw"
                        ]
                }
```
3. The directory `/tmp/test/lower/directory` will be created by `runc` before mounting the second mount.
4. Checkpoint the container `runc checkpoint`
5. Unmount and mount `/tmp/test` with a new `tmpfs`
6. `runc restore` -> Same error from CRIU as described in the Kubernetes case
```
(00.064613)      1: Error (criu/mount.c:2058): mnt: Unable to mount tmpfs /tmp/.criu.mntns.hFhF0a/12-0000000000/test/lower/directory (id=668): No such file or directory
(00.064627)      1: Error (criu/mount.c:2123): mnt: Can't mount at /tmp/.criu.mntns.hFhF0a/12-0000000000/test/lower/directory: No such file or directory
(00.064630)      1: mnt: Start with 0:/tmp/.criu.mntns.hFhF0a
(00.065896) Error (criu/mount.c:3421): mnt: Can't remove the directory /tmp/.criu.mntns.hFhF0a: Device or resource busy
(00.065911) Error (criu/cr-restore.c:2510): Restoring FAILED.
```

Any ideas or suggestions how and where this could be solved?

Currently I have a CRIU hack which creates the directories if missing, but this feels like the wrong solution, because, as mentioned, CRIU expects the file-system to be ready for restore.

CC: @avagin, @rst0git in case you have an idea at which layer the directory should be created.

I would say that the layer above `runc` has to create the directory, but as `runc` creates during initial container start-up, maybe `runc` should also create it during restore. Not sure. Happy for any suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container restore layering conundrum #2748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

container restore layering conundrum #2748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions