I am currently working on implementing checkpoint/restore on the Kubernetes level (kubernetes/kubernetes#97194) and during restore I have a problem with missing directories for bind mounts. The following situation gives me currently problems
- Kubernetes creates something like
/var/lib/kubelet/pods/36dfe704-5fd4-4ec5-aaf1-9d7375364de0/volumes/kubernetes.io~secret/default-token-d2cs7 and it tells CRI-O, in my case, to mount it at /var/run/secrets/kubernetes.io/serviceaccount.
- CRI-O mounts an empty
tmpfs for /run/secrets and tells runc to bind mount it at /run/secrets.
runc does it exactly what it is expected to do. It bind mounts the external tmpfs at /run/secrets and bind mounts /var/lib/kubelet/pods/36dfe704-5fd4-4ec5-aaf1-9d7375364de0/volumes/kubernetes.io~secret/default-token-d2cs7 at /var/run/secrets/kubernetes.io/serviceaccount.
runc creates the directory for that in the bind mounted tmpfs at /run/secrets
At this point everything is done and it is working. The thing which I do not totally understand is that runc seems to move the mounts of the container rootfs after it is done. Not sure what is happening exactly but if I look, in one of my tests, at /var/lib/containers/storage/overlay/89579e5eb2910e11b8e8d852b1ee022272c789aa97670e92be388613e6984fe6/merged/run/secrets/ I do not see the necessary directories, but after a lot of printf debugging I see that runc actually creates the directory at this location. So, the part I am not totally understanding is that the directory is created at one location, but I can see it, outside of the container at some other location, once the container is running. So it seems the mount is moved after/during container creation.
During restore I now have the following situation.
- Kubernetes re-creates the same directory to be mounted in the container at
/var/run/secrets/kubernetes.io/serviceaccount.
- CRI-O still mounts an empty
tmpfs for /run/secrets. This tmpfs is basically how and where the problem appears.
runc prepares restoring the container and recreates the directory structure just as during initial container creation so that CRIU can mount the previous directories at the same places they used to be. runc, during restore, does not bind mount directories to create mount points (maybe this is the main problem and the reason for the errors I am seeing)
- CRIU restores the container and errors out because
/run/secrets is empty and does not have the necessary directories (kubernetes.io/serviceaccount).
Looking at the output of my printf debugging I see that runc still creates the necessary directories for the bind mounting, but it does not help, because in the restore code path the bind mounts are not mounted by runc but later by CRIU. If it would be a tmpfs inside of the container it would be no problem, because CRIU would include the content of the tmpfs in the checkpoint and mount during restore. But it is an external tmpfs bind mounted in the container and CRIU expects bind mounts to contain the same content during restore as during checkpoint.
The directory inside of /run/secrets to mount /var/run/secrets/kubernetes.io/serviceaccount could be now recreated by either CRI-O, runc or CRIU.
It would make sense to have it in CRI-O because CRI-O wants to mount a directory, as told by Kubernetes, in a tmpfs created by CRI-O. But because during normal container start-up runc does create the directory it would feel strange to have the creation of the directory in CRI-O only during restore.
It would also make sense to have it created in runc, because runc creates the directory during initial container start-up, but because CRIU does all the mounting of bind mounts, maybe it should happen in CRIU.
It feels wrong to do it in CRIU, because up until now CRIU expected that all directories are correctly populated before restoring a container and all its mounts.
So the directory creation could happen at all those different layers, but right now, I do not see a real perfect solution/location for the directory creation.
I think it will end up in runc because runc also creates the directory during initial container start-up, but because I do not totally understand what is happening with the mountpoints and how they are moved during container creation, I am not sure where in the code would be the right place.
Would this mean runc has to mount all bind mounts during restore, just to unmount them before giving control to CRIU? But should it create mountpoints in bind mounts also during restore? Maybe without mounting?
Right now it feels complicated and unclear how to correctly solve this with all the involved layers.
To reproduce this outside of Kubernetes following steps are necessary:
mount -t tmpfs tmpfs /tmp/test
- Start a container with
{
"destination": "/test",
"type": "bind",
"source": "/tmp/test",
"options": [
"bind",
"rw"
]
},
{
"destination": "/test/lower/directory",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"rw"
]
}
- The directory
/tmp/test/lower/directory will be created by runc before mounting the second mount.
- Checkpoint the container
runc checkpoint
- Unmount and mount
/tmp/test with a new tmpfs
runc restore -> Same error from CRIU as described in the Kubernetes case
(00.064613) 1: Error (criu/mount.c:2058): mnt: Unable to mount tmpfs /tmp/.criu.mntns.hFhF0a/12-0000000000/test/lower/directory (id=668): No such file or directory
(00.064627) 1: Error (criu/mount.c:2123): mnt: Can't mount at /tmp/.criu.mntns.hFhF0a/12-0000000000/test/lower/directory: No such file or directory
(00.064630) 1: mnt: Start with 0:/tmp/.criu.mntns.hFhF0a
(00.065896) Error (criu/mount.c:3421): mnt: Can't remove the directory /tmp/.criu.mntns.hFhF0a: Device or resource busy
(00.065911) Error (criu/cr-restore.c:2510): Restoring FAILED.
Any ideas or suggestions how and where this could be solved?
Currently I have a CRIU hack which creates the directories if missing, but this feels like the wrong solution, because, as mentioned, CRIU expects the file-system to be ready for restore.
CC: @avagin, @rst0git in case you have an idea at which layer the directory should be created.
I would say that the layer above runc has to create the directory, but as runc creates during initial container start-up, maybe runc should also create it during restore. Not sure. Happy for any suggestions.
I am currently working on implementing checkpoint/restore on the Kubernetes level (kubernetes/kubernetes#97194) and during restore I have a problem with missing directories for bind mounts. The following situation gives me currently problems
/var/lib/kubelet/pods/36dfe704-5fd4-4ec5-aaf1-9d7375364de0/volumes/kubernetes.io~secret/default-token-d2cs7and it tells CRI-O, in my case, to mount it at/var/run/secrets/kubernetes.io/serviceaccount.tmpfsfor/run/secretsand tellsruncto bind mount it at/run/secrets.runcdoes it exactly what it is expected to do. It bind mounts the externaltmpfsat/run/secretsand bind mounts/var/lib/kubelet/pods/36dfe704-5fd4-4ec5-aaf1-9d7375364de0/volumes/kubernetes.io~secret/default-token-d2cs7at/var/run/secrets/kubernetes.io/serviceaccount.runccreates the directory for that in the bind mountedtmpfsat/run/secretsAt this point everything is done and it is working. The thing which I do not totally understand is that
runcseems to move the mounts of the container rootfs after it is done. Not sure what is happening exactly but if I look, in one of my tests, at/var/lib/containers/storage/overlay/89579e5eb2910e11b8e8d852b1ee022272c789aa97670e92be388613e6984fe6/merged/run/secrets/I do not see the necessary directories, but after a lot of printf debugging I see thatruncactually creates the directory at this location. So, the part I am not totally understanding is that the directory is created at one location, but I can see it, outside of the container at some other location, once the container is running. So it seems the mount is moved after/during container creation.During restore I now have the following situation.
/var/run/secrets/kubernetes.io/serviceaccount.tmpfsfor/run/secrets. Thistmpfsis basically how and where the problem appears.runcprepares restoring the container and recreates the directory structure just as during initial container creation so that CRIU can mount the previous directories at the same places they used to be.runc, during restore, does not bind mount directories to create mount points (maybe this is the main problem and the reason for the errors I am seeing)/run/secretsis empty and does not have the necessary directories (kubernetes.io/serviceaccount).Looking at the output of my printf debugging I see that
runcstill creates the necessary directories for the bind mounting, but it does not help, because in the restore code path the bind mounts are not mounted byruncbut later by CRIU. If it would be atmpfsinside of the container it would be no problem, because CRIU would include the content of the tmpfs in the checkpoint and mount during restore. But it is an externaltmpfsbind mounted in the container and CRIU expects bind mounts to contain the same content during restore as during checkpoint.The directory inside of
/run/secretsto mount/var/run/secrets/kubernetes.io/serviceaccountcould be now recreated by either CRI-O,runcor CRIU.It would make sense to have it in CRI-O because CRI-O wants to mount a directory, as told by Kubernetes, in a
tmpfscreated by CRI-O. But because during normal container start-upruncdoes create the directory it would feel strange to have the creation of the directory in CRI-O only during restore.It would also make sense to have it created in
runc, becauserunccreates the directory during initial container start-up, but because CRIU does all the mounting of bind mounts, maybe it should happen in CRIU.It feels wrong to do it in CRIU, because up until now CRIU expected that all directories are correctly populated before restoring a container and all its mounts.
So the directory creation could happen at all those different layers, but right now, I do not see a real perfect solution/location for the directory creation.
I think it will end up in
runcbecauseruncalso creates the directory during initial container start-up, but because I do not totally understand what is happening with the mountpoints and how they are moved during container creation, I am not sure where in the code would be the right place.Would this mean
runchas to mount all bind mounts during restore, just to unmount them before giving control to CRIU? But should it create mountpoints in bind mounts also during restore? Maybe without mounting?Right now it feels complicated and unclear how to correctly solve this with all the involved layers.
To reproduce this outside of Kubernetes following steps are necessary:
mount -t tmpfs tmpfs /tmp/test/tmp/test/lower/directorywill be created byruncbefore mounting the second mount.runc checkpoint/tmp/testwith a newtmpfsrunc restore-> Same error from CRIU as described in the Kubernetes caseAny ideas or suggestions how and where this could be solved?
Currently I have a CRIU hack which creates the directories if missing, but this feels like the wrong solution, because, as mentioned, CRIU expects the file-system to be ready for restore.
CC: @avagin, @rst0git in case you have an idea at which layer the directory should be created.
I would say that the layer above
runchas to create the directory, but asrunccreates during initial container start-up, mayberuncshould also create it during restore. Not sure. Happy for any suggestions.