-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
devicemapper: protect mounts against volume leaks #34542
Conversation
3abe6f7
to
bc0058f
Compare
I'm working on fixing this issue, which doesn't happen on my local machine.
|
4d30501
to
065a18a
Compare
This is necessary to make sure that setups which include things like `-v /:/rootfs` won't accidentally include the libdm mounts inside /var/lib/docker. libdm does not react well to operations on a device which has live mounts, and leaking mounts to other containers easily will cause many issues. Note that this still does not solve the entire issue, because our OCI configuration forces a MS_SHARED relabel of all mounts (which breaks our option). This is fixed in the next patch in this series. In addition, make the daemonRepo (/var/lib/docker/containers) MS_UNBINDABLE for similar reasons, to avoid /dev/shm mounts from being leaked into containers. Signed-off-by: Aleksa Sarai <asarai@suse.de>
By default, runc uses MS_SHARED as the rootfsPropagation, which can exacerbate problems with mountpoint leaks (the recursive relabeling will cause all MS_UNBINDABLE mounts to become MS_SHARED). This is necessary to ensure that libdm mounts don't leak into containers that have bindmounts containing /var/lib/docker. Signed-off-by: Aleksa Sarai <asarai@suse.de>
This test is enabled on all storage drivers to make sure that similar bugs in other storage drivers are detected earlier. Signed-off-by: Aleksa Sarai <asarai@suse.de>
065a18a
to
717ca78
Compare
What is the kernel version on Jenkins? Linux upstream has a patch in newer kernels (>3.18) that makes I believe this is the reason the test fails on older kernels. I'm investigating a workaround for older kernels but it's actually quite difficult to make |
@@ -1574,3 +1574,16 @@ func (s *DockerSuite) TestRunWithNanoCPUs(c *check.C) { | |||
c.Assert(err, check.NotNil) | |||
c.Assert(out, checker.Contains, "Conflicting options: Nano CPUs and CPU Period cannot both be set") | |||
} | |||
|
|||
// #34542 | |||
func (s *DockerSuite) TestRunVolumeLeakage(c *check.C) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this an API test. CLI tests are deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do API tests live?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending documentation for this is in #34407
The new suite is integration/<component>/
, so for this it would be integration/container/run_test.go
. Since that doesn't exist yet, you could add it to integration-cli/docker_api_container)create_test.go
.
Although, looking at this change again. Would it be possible to test this with a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you could test this with a unit test, since it requires starting containers in order to make sure that -v /:/rootfs
won't cause unintended mount leakage.
@cyphar Can you explain a little bit how did it prevent mount point leaks into container? I could not grok it by reading commit message yet. So this is primarily a problem on older kernels, right? As you mentioned newer kernels will remove mounts from other mount namespaces when directory in daemon mount namespace is removed. We are taking care of it using a prestart plugin called |
Aha, you are making use of unbindable mounts. That sounds very interesting. Let me play with it. So first side affect of making /var/lib/docker/containers unbindable is that no container can bind mount it in. I mean |
There are two issues at play here:
So (1) is entirely solved by this current patchset, and I'm trying to figure out how to solve (2). My first idea for (2) was to make |
/cc @cpuguy83 |
oh, and @kolyshkin probably 😄 |
@cyphar for the first problem, I think deferred deletion already takes care of it? In Remove() flow, deferred deletion will delete device and then we will go on to remove container rootfs directory. That will yank the leaked mount point. And that means soon deferred deletion thread will claim the deferred deleted device. Even if one is not using deferred deletion, I think that might be fixable on new kernels. Say a device mount point has leaked into a container and then we try to remove that device. Device deletion fails as device is busy. Can't we first remove the directory containing container rootfs mount and then call into libdm to delete device. Removing directory will remove leaked mount point and after that device deletion will succeed. So flow will be something like. Remove() (In driver.go) But I also like the idea of making /var/log/docker/devicemapper unbindable. Not sure what else it will break. I remember that in the past containerd was creating some bind mounts of container rootfs. Not sure if that functionality is still around or not. Those operations will fail. |
@rhvgoyal Deferred deletion doesn't appear solve the problem. I believe the reason is that when requesting a deletion to be deferred, it still needs to be unmounted (but I'm by no means an LVM expert). Here's me trying it out:
Note though that I recently discovered that openSUSE has some
That sounds like it might work, I'll play with it and see whether it also solves the issue. The downside is that it won't solve the problem on older kernels (the current solution partially solves the problem on older kernels). |
Actually if we make /var/lib/docker/devicemapper unbindable, only bind mounts underneath are denied. I can still mount thin devices on /var/lib/docker/devicemapper/container1 and then /var/lib/docker/devicemapper/container1 can be bind mounted somewhere else. So looks like making /var/lib/docker/devicemapper/ unbindable might be fine. |
@rhvgoyal Yeah, making |
@cyphar deferred deletion works for me. I am using Fedora26 docker. Deferred Removal Enabled: true % docker run -d --name testA busybox top So it worked for me. In fact if deferred deletion is on, then you should not have got an error saying failed to delete device. Because that's the whole point of deferred deletion. Despite the fact that device could not be deleted return success to caller and try deferred deletion later with the help of a worker thread. So something is missing from your setup. |
// Always use MS_PRIVATE for rootfs propagation to make sure that we | ||
// don't leak any MS_UNBINDABLE mounts into the container's namespace | ||
// while its resolving volume bindmounts. | ||
RootfsPropagation: "private", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to look at the logic more closely. I am forgetting the details. So this will apply "rprivate" on "/" in container mount namespace? If yes, this will break "shared/slave" volume mounts? "-v /mnt/vol1:/foo:slave".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my reading of libcontainer/rootfs_linux.go
this only changes how the host's /
mount is set in the containers mountns before setting up the container's mounts. I'm actually very confused why MS_SLAVE|MS_REC
is the default, as that seems to me like it would cause a lot of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if I do a slave shared volume, right now docker seems to be setting rootfsPropagation=rslave and I think that will reset the unbindable property of /var/lib/docker/devicemapper.
So first question is that does it have to be rslave. I suspect we can get away with just slave
as well. That way it will not reset unbindable property of mount points inherited from parent.
Or we need to enahnace specs so that caller can pass in a list of unbindable mounts and runc sets these mounts unbindable before trying to do volume mounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MS_SLAVE|MS_REC is libcontainer default. I thought somebody sets rootfsPropagation MS_PRIVATE | MS_REC in normal cases so libcontainer default does not kick in. Can't find though who does it.
Thought MS_SLAVE | MS_REC makes sense in a way that if anything leaks into container from host, it has slave relationship and if it is unmounted on host, it will get unmounted from container too, leading to less issues of busy mounts.
But, if we apply any property recursively (rshared, rprivate, rslave) on /
, it will reset unbindable
property of mount. So I doubt we can apply anything recursively on /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhvgoyal I had a tenative runC patch that would specifically ignore rootfsPropagation
would explicitly restore MS_UNBINDABLE
mountpoints after doing MS_PRIVATE|MS_REC
(or whatever the option is set to). It didn't appear to work and I discovered that rootfsPropagation=private
fixed the issue as well, so stopped debugging. That method might work.
But as I mentioned we should still use the rmdir
method, and potentially do something more clever with EBUSY
in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought somebody sets rootfsPropagation MS_PRIVATE | MS_REC in normal cases so libcontainer default does not kick in. Can't find though who does it.
It's the libcontainer default, but when generating the spec the actual default is MS_PRIVATE|MS_REC
if you look at the spec generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not have private on the graphdriver root as well as unbindable?
Yeah, you're right. I tried it on Debian 9 and it also worked with deferred removal+deletion, which tells me that there's some weird bug going on on openSUSE. I double checked and it wasn't PEBKAC, so maybe there's some patch we've applied to However, I still think that handling this problem for non-deferred users is still a worthwhile exercise. I'm going to play around with the removal idea we've discussed and see if it can solve the problem without then need for |
I completely forgot about opencontainers/runc#1500 which attempts to fix a similar version of this issue. It's possible that might help in the instances where the kernel is too old to be able to handle the |
cc @rhatdan |
@rhvgoyal Looks like this does not solve the issue in CRI-O either. runc must be makeing the mount points recursively RPRIVATE or RSHARE or something. |
Closing in favour of #34573. |
This is necessary to make sure that setups which include things like
-v /:/rootfs
won't accidentally include the libdm mounts inside/var/lib/docker. libdm does not react well to operations on a device
which has live mounts, and leaking mounts to other containers easily
will cause many issues.
By default, runc uses MS_SHARED as the rootfsPropagation, which can
exacerbate problems with mountpoint leaks (the recursive relabeling will
cause all MS_UNBINDABLE mounts to become MS_SHARED). This is necessary
to ensure that libdm mounts don't leak into containers that have
bindmounts containing /var/lib/docker.
In addition, make the daemonRepo (/var/lib/docker/containers)
MS_UNBINDABLE for similar reasons, to avoid /dev/shm mounts from being
leaked into containers.
You can reproduce this issue by doing something like this on a
-s devicemapper
box:
I"ve added a test for this too.
Cute by Behzad No.
Signed-off-by: Aleksa Sarai asarai@suse.de