Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

devicemapper: protect mounts against volume leaks #34542

Closed

Conversation

cyphar
Copy link
Contributor

@cyphar cyphar commented Aug 17, 2017

This is necessary to make sure that setups which include things like -v /:/rootfs won't accidentally include the libdm mounts inside
/var/lib/docker. libdm does not react well to operations on a device
which has live mounts, and leaking mounts to other containers easily
will cause many issues.

By default, runc uses MS_SHARED as the rootfsPropagation, which can
exacerbate problems with mountpoint leaks (the recursive relabeling will
cause all MS_UNBINDABLE mounts to become MS_SHARED). This is necessary
to ensure that libdm mounts don't leak into containers that have
bindmounts containing /var/lib/docker.

In addition, make the daemonRepo (/var/lib/docker/containers)
MS_UNBINDABLE for similar reasons, to avoid /dev/shm mounts from being
leaked into containers.

You can reproduce this issue by doing something like this on a -s devicemapper
box:

% docker run -d --name testA busybox top
% docker run -d --name testB -v /var/lib/docker:/docker busybox top
% docker rm -f testA
Error response from daemon: Driver devicemapper failed to remove root filesystem 8f3f1a40cda5f464839f356efa8c48447afc3408688921475c353e4ea08d0d58: failed to remove device cf0aadfb11388690d91af3bdb16634c38e24951266b1939aebe7d7cdab08b4c4: devicemapper: Error running RemoveDevice dm_task_run failed

I"ve added a test for this too.

Cute by Behzad No

Cute by Behzad No.

Signed-off-by: Aleksa Sarai asarai@suse.de

@cyphar cyphar force-pushed the dm-protect-mounts-against-volume-leaks branch 2 times, most recently from 3abe6f7 to bc0058f Compare August 17, 2017 11:11
@cyphar
Copy link
Contributor Author

cyphar commented Aug 17, 2017

I'm working on fixing this issue, which doesn't happen on my local machine.

11:47:18 ----------------------------------------------------------------------
11:47:18 FAIL: docker_cli_run_unix_test.go:1579: DockerSuite.TestRunDevicemapperVolumeLeakage
11:47:18 
11:47:18 docker_cli_run_unix_test.go:1588:
11:47:18     c.Assert(err, check.IsNil)
11:47:18 ... value *errors.errorString = &errors.errorString{s:"\nCommand:  /usr/local/bin/docker rm -f testA\nExitCode: 1\nError:    exit status 1\nStdout:   \nStderr:   Error response from daemon: unable to remove filesystem for c1490c494a3cc207ff208f4e1b588335bc9ecc454139102f491f7e675c5f2cab: remove /var/lib/docker/containers/c1490c494a3cc207ff208f4e1b588335bc9ecc454139102f491f7e675c5f2cab/shm: device or resource busy\n\n\nFailures:\nExitCode was 1 expected 0\nExpected no error\n"} ("\nCommand:  /usr/local/bin/docker rm -f testA\nExitCode: 1\nError:    exit status 1\nStdout:   \nStderr:   Error response from daemon: unable to remove filesystem for c1490c494a3cc207ff208f4e1b588335bc9ecc454139102f491f7e675c5f2cab: remove /var/lib/docker/containers/c1490c494a3cc207ff208f4e1b588335bc9ecc454139102f491f7e675c5f2cab/shm: device or resource busy\n\n\nFailures:\nExitCode was 1 expected 0\nExpected no error\n")

@cyphar cyphar force-pushed the dm-protect-mounts-against-volume-leaks branch 4 times, most recently from 4d30501 to 065a18a Compare August 17, 2017 13:55
cyphar added 3 commits August 18, 2017 00:27
This is necessary to make sure that setups which include things like `-v
/:/rootfs` won't accidentally include the libdm mounts inside
/var/lib/docker. libdm does not react well to operations on a device
which has live mounts, and leaking mounts to other containers easily
will cause many issues.

Note that this still does not solve the entire issue, because our OCI
configuration forces a MS_SHARED relabel of all mounts (which breaks our
option). This is fixed in the next patch in this series.

In addition, make the daemonRepo (/var/lib/docker/containers)
MS_UNBINDABLE for similar reasons, to avoid /dev/shm mounts from being
leaked into containers.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
By default, runc uses MS_SHARED as the rootfsPropagation, which can
exacerbate problems with mountpoint leaks (the recursive relabeling will
cause all MS_UNBINDABLE mounts to become MS_SHARED). This is necessary
to ensure that libdm mounts don't leak into containers that have
bindmounts containing /var/lib/docker.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
This test is enabled on all storage drivers to make sure that similar
bugs in other storage drivers are detected earlier.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar cyphar force-pushed the dm-protect-mounts-against-volume-leaks branch from 065a18a to 717ca78 Compare August 17, 2017 14:38
@cyphar
Copy link
Contributor Author

cyphar commented Aug 17, 2017

What is the kernel version on Jenkins? Linux upstream has a patch in newer kernels (>3.18) that makes rmdir on a mountpoint that is referenced in another namespace no longer give EBUSY (because it made machines susceptible to DoS) -- see torvalds/linux@8ed936b5671b ("vfs: Lazily remove mounts on unlinked files and directories.").

I believe this is the reason the test fails on older kernels. I'm investigating a workaround for older kernels but it's actually quite difficult to make /dev/shm unbindable -- because it's main usage is bindmounting into containers.

@@ -1574,3 +1574,16 @@ func (s *DockerSuite) TestRunWithNanoCPUs(c *check.C) {
c.Assert(err, check.NotNil)
c.Assert(out, checker.Contains, "Conflicting options: Nano CPUs and CPU Period cannot both be set")
}

// #34542
func (s *DockerSuite) TestRunVolumeLeakage(c *check.C) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this an API test. CLI tests are deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do API tests live?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending documentation for this is in #34407

The new suite is integration/<component>/, so for this it would be integration/container/run_test.go. Since that doesn't exist yet, you could add it to integration-cli/docker_api_container)create_test.go.

Although, looking at this change again. Would it be possible to test this with a unit test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you could test this with a unit test, since it requires starting containers in order to make sure that -v /:/rootfs won't cause unintended mount leakage.

@tophj-ibm
Copy link
Contributor

@cyphar I can't see any of the x86_64 kernel versions, (ping @psftw) but the ppc64le ones are 4.4.0-x where it did pass.

@cyphar
Copy link
Contributor Author

cyphar commented Aug 17, 2017

/cc @runcom @mrunalp You probably care about this for containers/storage. There are also a whole host of libdm issues where you can DoS a system using it, but those require kernel intervention.

@mrunalp
Copy link
Contributor

mrunalp commented Aug 17, 2017

@rhvgoyal @nalind PTAL

@rhvgoyal
Copy link
Contributor

@cyphar Can you explain a little bit how did it prevent mount point leaks into container? I could not grok it by reading commit message yet.

So this is primarily a problem on older kernels, right? As you mentioned newer kernels will remove mounts from other mount namespaces when directory in daemon mount namespace is removed.

We are taking care of it using a prestart plugin called oci-umount. That plugin does the unmounting. But if this PR takes care of it, then I will gladly get rid of oci-umount plugin.

https://github.com/projectatomic/oci-umount

@rhvgoyal
Copy link
Contributor

Aha, you are making use of unbindable mounts. That sounds very interesting. Let me play with it.

So first side affect of making /var/lib/docker/containers unbindable is that no container can bind mount it in. I mean -v /var/lib/docker/containers:/foo/ will fail. I know that fluentd container does this opearation in an attempt to look at container logs. But they probably will have to move away from that. Either make use of API or bind mount /var/lib/docker/ in.

@cyphar
Copy link
Contributor Author

cyphar commented Aug 17, 2017

@rhvgoyal

So this is primarily a problem on older kernels, right? As you mentioned newer kernels will remove mounts from other mount namespaces when directory in daemon mount namespace is removed.

There are two issues at play here:

  1. Our libdm usage needs to use MS_UNBINDABLE because it doesn't correctly handle leaked mounts (as we've all collectively discovered in other recent libdm issues) so anyone who wanted to bind-mount /var/lib/docker/devicemapper is out-of-luck. Basically if you keep a copy of the libdm mount alive, then Docker won't be able to clean up the libdm device (and deferred deletion and removal don't appear to help either). This happens on all kernels, and I'm just making /var/lib/docker/devicemapper use MS_UNBINDABLE.

  2. The reason why the tests are currently failing is because the test machines have an older kernel that doesn't have torvalds/linux@8ed936b5671b applied. As a result, docker rm -f is failing because the /dev/shm mount has been leaked to the second container in my test. On newer kernels that isn't a problem (the mount is killed when Docker attempts an rmdir on the shm mountpoint), but on older kernels you get EBUSY.

So (1) is entirely solved by this current patchset, and I'm trying to figure out how to solve (2). My first idea for (2) was to make /var/lib/docker/containers also unbindable, but I've also come to the conclusion that making /var/lib/docker/containers unbindable is a bad idea for multiple reasons (the most obvious of which is that it completely breaks resolv.conf, hostname and hosts because you can't bind-mount them anymore -- you can't bind-mount anything that's a inside of a MS_BINDABLE mount). Really, the fact the tests fail on 14.04 is because it's a kernel bug that hasn't been fixed in Ubuntu's kernels, and I'm not sure that significant workarounds are going to be worth it (especially since currently -v /:/rootfs completely breaks on all kernels and setups). Maybe we should just make the test ignore shm failures and just warn on them?

@thaJeztah
Copy link
Member

/cc @cpuguy83

@thaJeztah
Copy link
Member

oh, and @kolyshkin probably 😄

@rhvgoyal
Copy link
Contributor

@cyphar for the first problem, I think deferred deletion already takes care of it?

In Remove() flow, deferred deletion will delete device and then we will go on to remove container rootfs directory. That will yank the leaked mount point. And that means soon deferred deletion thread will claim the deferred deleted device.

Even if one is not using deferred deletion, I think that might be fixable on new kernels. Say a device mount point has leaked into a container and then we try to remove that device. Device deletion fails as device is busy. Can't we first remove the directory containing container rootfs mount and then call into libdm to delete device. Removing directory will remove leaked mount point and after that device deletion will succeed.

So flow will be something like.

Remove() (In driver.go)
deactivateDevice()
system.EnsureRemoveAll(mp)
DeleteDevice()
}

But I also like the idea of making /var/log/docker/devicemapper unbindable. Not sure what else it will break. I remember that in the past containerd was creating some bind mounts of container rootfs. Not sure if that functionality is still around or not. Those operations will fail.

@cyphar
Copy link
Contributor Author

cyphar commented Aug 17, 2017

@rhvgoyal Deferred deletion doesn't appear solve the problem. I believe the reason is that when requesting a deletion to be deferred, it still needs to be unmounted (but I'm by no means an LVM expert). Here's me trying it out:

% docker info | grep Deferred
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
% docker run -d --name testA busybox top
[snip]
% docker run -d --name testB -v /:/rootfs busybox top
[snip]
% docker rm -f testA
Error response from daemon: Driver devicemapper failed to remove root filesystem 5851da81b1e30f11ad8aebcb3ae0d05adeb4fbbb221490488df96b77f7d36719: failed to remove device 35546a51ac8dee03c934fa294fcfa1937689cfc4de06c9e554166b583a955425: devicemapper: Error running DeleteDevice dm_task_run failed

Note though that I recently discovered that openSUSE has some lvm patches that make error logging not as verbose as it is in upstream lvm (which may throw off some of the retry code in Docker), but that shouldn't affect this particular case.

Can't we first remove the directory containing container rootfs mount and then call into libdm to delete device. Removing directory will remove leaked mount point and after that device deletion will succeed.

That sounds like it might work, I'll play with it and see whether it also solves the issue. The downside is that it won't solve the problem on older kernels (the current solution partially solves the problem on older kernels).

@rhvgoyal
Copy link
Contributor

rhvgoyal commented Aug 17, 2017

Actually if we make /var/lib/docker/devicemapper unbindable, only bind mounts underneath are denied. I can still mount thin devices on /var/lib/docker/devicemapper/container1 and then /var/lib/docker/devicemapper/container1 can be bind mounted somewhere else.

So looks like making /var/lib/docker/devicemapper/ unbindable might be fine.

@cyphar
Copy link
Contributor Author

cyphar commented Aug 17, 2017

@rhvgoyal Yeah, making /var/lib/docker/devicemapper unbindable isn't bullet-proof, and just provides safety against accidental leakage. You can still forcefully leak the mount through a variety of ways (or up the safety by using --storage-opt dm.mountopt=unbindable). But if we just do the EnsureRemoveAll route (maybe combining it with the current solution) then we should be able to resolve even cases where someone has gone out of their way to try to DoS Docker.

@rhvgoyal
Copy link
Contributor

@cyphar deferred deletion works for me. I am using Fedora26 docker.

Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0

% docker run -d --name testA busybox top
[snip]
% docker run -d --name testB -v /:/rootfs busybox top
[snip]
% docker rm -f testA
testA

So it worked for me. In fact if deferred deletion is on, then you should not have got an error saying failed to delete device. Because that's the whole point of deferred deletion. Despite the fact that device could not be deleted return success to caller and try deferred deletion later with the help of a worker thread.

So something is missing from your setup.

// Always use MS_PRIVATE for rootfs propagation to make sure that we
// don't leak any MS_UNBINDABLE mounts into the container's namespace
// while its resolving volume bindmounts.
RootfsPropagation: "private",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to look at the logic more closely. I am forgetting the details. So this will apply "rprivate" on "/" in container mount namespace? If yes, this will break "shared/slave" volume mounts? "-v /mnt/vol1:/foo:slave".

Copy link
Contributor Author

@cyphar cyphar Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading of libcontainer/rootfs_linux.go this only changes how the host's / mount is set in the containers mountns before setting up the container's mounts. I'm actually very confused why MS_SLAVE|MS_REC is the default, as that seems to me like it would cause a lot of issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if I do a slave shared volume, right now docker seems to be setting rootfsPropagation=rslave and I think that will reset the unbindable property of /var/lib/docker/devicemapper.

So first question is that does it have to be rslave. I suspect we can get away with just slave as well. That way it will not reset unbindable property of mount points inherited from parent.

Or we need to enahnace specs so that caller can pass in a list of unbindable mounts and runc sets these mounts unbindable before trying to do volume mounts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MS_SLAVE|MS_REC is libcontainer default. I thought somebody sets rootfsPropagation MS_PRIVATE | MS_REC in normal cases so libcontainer default does not kick in. Can't find though who does it.

Thought MS_SLAVE | MS_REC makes sense in a way that if anything leaks into container from host, it has slave relationship and if it is unmounted on host, it will get unmounted from container too, leading to less issues of busy mounts.

But, if we apply any property recursively (rshared, rprivate, rslave) on /, it will reset unbindable property of mount. So I doubt we can apply anything recursively on /

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhvgoyal I had a tenative runC patch that would specifically ignore rootfsPropagation would explicitly restore MS_UNBINDABLE mountpoints after doing MS_PRIVATE|MS_REC (or whatever the option is set to). It didn't appear to work and I discovered that rootfsPropagation=private fixed the issue as well, so stopped debugging. That method might work.

But as I mentioned we should still use the rmdir method, and potentially do something more clever with EBUSY in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought somebody sets rootfsPropagation MS_PRIVATE | MS_REC in normal cases so libcontainer default does not kick in. Can't find though who does it.

It's the libcontainer default, but when generating the spec the actual default is MS_PRIVATE|MS_REC if you look at the spec generation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not have private on the graphdriver root as well as unbindable?

@cyphar
Copy link
Contributor Author

cyphar commented Aug 19, 2017

@rhvgoyal

deferred deletion works for me. I am using Fedora26 docker.

Yeah, you're right. I tried it on Debian 9 and it also worked with deferred removal+deletion, which tells me that there's some weird bug going on on openSUSE. I double checked and it wasn't PEBKAC, so maybe there's some patch we've applied to libdm that is breaking deferral. I'll look into it.

However, I still think that handling this problem for non-deferred users is still a worthwhile exercise. I'm going to play around with the removal idea we've discussed and see if it can solve the problem without then need for MS_UNBINDABLE.

@cyphar
Copy link
Contributor Author

cyphar commented Aug 19, 2017

I completely forgot about opencontainers/runc#1500 which attempts to fix a similar version of this issue. It's possible that might help in the instances where the kernel is too old to be able to handle the RemoveAll semantics, if we just make the default propagation to MS_SLAVE. I'm not sure.

@rhvgoyal
Copy link
Contributor

cc @rhatdan

@rhatdan
Copy link
Contributor

rhatdan commented Aug 21, 2017

@rhvgoyal Looks like this does not solve the issue in CRI-O either. runc must be makeing the mount points recursively RPRIVATE or RSHARE or something.

@cyphar
Copy link
Contributor Author

cyphar commented Aug 21, 2017

@rhatdan By default runc makes mounts rprivate, you can change it with rootfsPropagation in config.json. But take a look at my alternative PR to this one #34573, which works even if you do RPRIVATE (I still need to update it to address @rhvgoyal's comments).

@cyphar
Copy link
Contributor Author

cyphar commented Sep 2, 2017

Closing in favour of #34573.

@cyphar cyphar closed this Sep 2, 2017
@cyphar cyphar deleted the dm-protect-mounts-against-volume-leaks branch September 2, 2017 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants