Mount /proc and /sys read-only, except in privileged containers. #5445

jpetazzo · 2014-04-28T20:07:14Z

It has been pointed out that some files in /proc and /sys can be used
to break out of containers. However, if those filesystems are mounted
read-only, most of the known exploits are mitigated, since they rely
on writing some file in those filesystems.

This does not replace security modules (like SELinux or AppArmor), it
is just another layer of security. Likewise, it doesn't mean that the
other mitigations (shadowing parts of /proc or /sys with bind mounts)
are useless. Those measures are still useful.

Fixes #5444

Docker-DCO-1.1-Signed-off-by: Jérôme Petazzoni [email protected] (github: jpetazzo)

unclejack · 2014-04-28T20:48:32Z

I'm getting a lot of failures and a panic in the integration tests. This isn't happening on master.
https://gist.github.com/unclejack/b6bf7c82bf69154a60d3

ibuildthecloud · 2014-04-28T20:52:37Z

Won't this disable a lot of very useful functionality? Like /proc/sys/net has good stuff in it and it's valid to change in a container because of the network namespace.

jpetazzo · 2014-04-29T00:00:49Z

@unclejack: those tests run fine here. The only tests failing are PingGoogle (because 8.8.8.8 is not reachable on my machine) and the test that checks sysfs access (obviously).

How did you run the tests? (I did make test, FWIW)

jpetazzo · 2014-04-29T01:08:05Z

Tests updated. All pass here.

vieux · 2014-04-29T01:18:22Z

I still have tons of failures

crosbymichael · 2014-04-29T01:23:36Z

@vieux i don't trust you

jpetazzo · 2014-04-29T01:50:27Z

I fixed the gofmt thing.

(Also, the person responsible for messing up with my gofmt git hook has been terminated!)

@vieux what kind of failure do you see? FWIW, I'm running make test and my local Docker is 0.9 + native + btrfs (not that it should matter, but who knows).

crosbymichael · 2014-04-29T17:13:06Z

ping @creack

creack · 2014-04-29T19:29:02Z

Same as @vieux: plenty of failure in test-integration. (too much output to paste) + endup in panic. (master works fine)

jpetazzo · 2014-04-29T19:30:43Z

@vieux told me he was running with AUFS and kernel 3.11; what's your setup @creack?

crosbymichael · 2014-04-29T20:36:59Z

@creack @vieux can we see what at least one of the errors looks like?

creack · 2014-04-29T21:02:10Z

@jpetazzo btrfs 3.13

unclejack · 2014-04-29T21:04:39Z

@jpetazzo I was running the tests using make test on Ubuntu 13.10, kernel 3.10.37 with btrfs as storage backend.

creack · 2014-04-29T21:08:38Z

@crosbymichael http://pastebin.com/73pwzCQw

jpetazzo · 2014-04-29T21:14:07Z

Just reproduced the tests within a boot2docker VM (kernel 3.13, docker 0.10, native, aufs).

First time, I got a completely bogus error in archive.

I switched back to master, test went fine.

Then I switched back to my branch, test went fine.

I don't even.

Meanwhile, I'm running the whole suite 5 times in a row here to see what happens.

I'll also compare the fails reported by @creack and @unclejack. @vieux, can you add yours please?

Thanks.

crosbymichael · 2014-04-29T21:18:06Z

@unclejack @creack @vieux Instead of running the tests can you try running a container to get a better error message about what is preventing it from running?

unclejack · 2014-04-29T21:32:42Z

@jpetazzo @crosbymichael Here's what I get when I run the tests: http://showterm.io/72a7e02e87a8ec374ffe6#fast

@crosbymichael I'm building a binary to test now.

vieux · 2014-04-29T21:33:03Z

2014/04/29 21:45:32 read-only file system

vieux · 2014-04-29T21:34:28Z

Nothing in the daemon logs:

[66bc9e6b] +job attach(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[debug] server.go:940 Calling POST /containers/{name:.*}/start
2014/04/29 21:46:37 POST /v1.11/containers/873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e/start
[66bc9e6b] +job start(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[66bc9e6b] +job allocate_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[66bc9e6b] -job allocate_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[debug] container.go:188 attach: stdin: begin
[debug] container.go:225 attach: stdout: begin
[debug] container.go:263 attach: stderr: begin
[debug] container.go:309 attach: waiting for job 1/3
[libcontainer] 2014/04/29 21:46:37 created sync pipe parent fd 15 child fd 14
[libcontainer] 2014/04/29 21:46:37 creating master and console
[libcontainer] 2014/04/29 21:46:37 attach terminal to command
[libcontainer] 2014/04/29 21:46:37 starting command
[libcontainer] 2014/04/29 21:46:37 writing pid 5693 to file
[libcontainer] 2014/04/29 21:46:37 setting cgroups
[66bc9e6b] -job start(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[debug] server.go:940 Calling POST /containers/{name:.*}/resize
[libcontainer] 2014/04/29 21:46:37 setting up network
2014/04/29 21:46:37 POST /v1.11/containers/873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e/resize?h=94&w=180
[66bc9e6b] +job resize(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, 94, 180)
[66bc9e6b] -job resize(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, 94, 180) = OK (0)
[libcontainer] 2014/04/29 21:46:37 closing sync pipe with child
[libcontainer] 2014/04/29 21:46:37 process exited with status 1
[libcontainer] 2014/04/29 21:46:37 removing pid file
[66bc9e6b] +job release_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[66bc9e6b] -job release_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[error] container.go:626 873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e: Error closing terminal: invalid argument
[debug] container.go:242 attach: stdout: end
[debug] container.go:280 attach: stderr: end
[debug] container.go:314 attach: job 1 completed successfully
[debug] container.go:309 attach: waiting for job 2/3
[debug] container.go:314 attach: job 2 completed successfully
[debug] container.go:309 attach: waiting for job 3/3
[debug] server.go:2333 Closing buffered stdin pipe
[debug] container.go:215 attach: stdin: end
[debug] container.go:314 attach: job 3 completed successfully
[debug] container.go:316 attach: all jobs completed successfully
[66bc9e6b] -job attach(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[debug] server.go:940 Calling GET /containers/{name:.*}/json
2014/04/29 21:46:37 GET /v1.11/containers/873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e/json
[66bc9e6b] +job inspect(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, container)
[66bc9e6b] -job inspect(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, container) = OK (0)

unclejack · 2014-04-29T21:34:40Z

I'm also getting the read-only file system error.

crosbymichael · 2014-04-29T21:37:55Z

Maybe something with apparmor on their systems?

crosbymichael · 2014-04-29T21:40:10Z

Ya, I think it maybe apparmor when we go to set the profile for the process

creack · 2014-04-29T21:43:00Z

Nothing regarding apparmor in dmesg, but it might be indeed apparmor.

tianon · 2014-04-29T21:47:28Z

Just spitballing, but couldn't we remount,ro on these in dockerinit instead?

crosbymichael · 2014-04-29T21:48:33Z

Yes, if it really is the setting the apparmor profile we will have to mount it as rw initially then after we have everything setup remount as ro

jpetazzo · 2014-04-30T00:15:31Z

Right. Eventually reproduced the issue with a Ubuntu-based VM. And confirming that it comes from the AppArmor thing, thanks a lot to those who helped to figure this out!

I think I will just improve Restrict() a little bit, and then patch it up into the LXC driver as well.

It has been pointed out that some files in /proc and /sys can be used to break out of containers. However, if those filesystems are mounted read-only, most of the known exploits are mitigated, since they rely on writing some file in those filesystems. This does not replace security modules (like SELinux or AppArmor), it is just another layer of security. Likewise, it doesn't mean that the other mitigations (shadowing parts of /proc or /sys with bind mounts) are useless. Those measures are still useful. As such, the shadowing of /proc/kcore is still enabled with both LXC and native drivers. Special care has to be taken with /proc/1/attr, which still needs to be mounted read-write in order to enable the AppArmor profile. It is bind-mounted from a private read-write mount of procfs. All that enforcement is done in dockerinit. The code doing the real work is in libcontainer. The init function for the LXC driver calls the function from libcontainer to avoid code duplication. Docker-DCO-1.1-Signed-off-by: Jérôme Petazzoni <[email protected]> (github: jpetazzo)

jpetazzo · 2014-05-01T01:45:06Z

To summarize a quick exchange with Mike on #docker-dev: there is a catch 22 here, because to enable AppArmor, you need write access to /proc; but once you have done that, you don't have the necessary permissions to remount /proc read-only anymore.

I found a workaround: do a temporary mount of /proc read-write, and bind-mount /proc/1/attr (the directory needed by AppArmor) over the "real" read-only /proc. It's not the cleanest thing, but it works pretty well, and I like the fact that instead of masking some things, we mask everything and explicitly re-enable access wherever needed.

Regarding the overall structure, I ended up doing the following changes:

Restrict() is now able to mask over sensitive files, but also remount-RO sensitive mountpoints (which allows to gather all those mountpoint-related thing in the same function, instead of scattering them in multiple places),
Restrict() is now used both in the native and lxc drivers;
Restrict() is now called in the context of dockerinit, because it needs to bind-mount /proc/1/attr for AppArmor compatibility.

Unfortunately, I haven't been able to run all tests on this branch. I have some trouble with my test setup (all tests pass within my boot2docker VM; I have random failures in my local environment; and consistent failures in my Ubuntu 14.04 VM in graphdriver/devmapper unit tests, which doesn't make any sense; those failures also happen on master, which hints at some problem within my VM rather than the code itself).

If people could give this a roll while I try to fix my test setup, I would be ever grateful.

crosbymichael · 2014-05-01T02:02:34Z

#assignee=crosbymichael

crosbymichael · 2014-05-01T16:32:16Z

@jpetazzo I'll take this PR from here. We are trying a new workflow so instead of asking you to make a lot of small changes i'll just do it and get it merged in unless there are some larger changes that you want to still make.

cgwalters · 2014-05-09T18:58:55Z

Downstream regression: https://bugzilla.redhat.com/show_bug.cgi?id=1096375

jpetazzo mentioned this pull request Apr 28, 2014

/proc/sys/kernel/hostname doesn't exist inside container #5444

Closed

smarterclayton mentioned this pull request Apr 29, 2014

Can't read /proc/xxx/root in Docker 0.10 #5471

Closed

jpetazzo assigned crosbymichael May 1, 2014

crosbymichael mentioned this pull request May 1, 2014

Mount /proc and /sys read-only, except in privileged containers #5529

Merged

creack closed this in #5529 May 2, 2014

rjnagal mentioned this pull request May 8, 2014

running systemd inside docker arch container hangs or segfaults #3629

Closed

jpetazzo unassigned crosbymichael Jul 24, 2014

yosifkit mentioned this pull request Feb 25, 2015

Disable Transparent Huge Pages (THP) in docker machine redis/docker-library-redis#20

Closed

Mount /proc and /sys read-only, except in privileged containers. #5445

Mount /proc and /sys read-only, except in privileged containers. #5445

Uh oh!

Conversation

jpetazzo commented Apr 28, 2014

Uh oh!

unclejack commented Apr 28, 2014

Uh oh!

ibuildthecloud commented Apr 28, 2014

Uh oh!

jpetazzo commented Apr 29, 2014

Uh oh!

jpetazzo commented Apr 29, 2014

Uh oh!

vieux commented Apr 29, 2014

Uh oh!

crosbymichael commented Apr 29, 2014

Uh oh!

jpetazzo commented Apr 29, 2014

Uh oh!

crosbymichael commented Apr 29, 2014

Uh oh!

creack commented Apr 29, 2014

Uh oh!

jpetazzo commented Apr 29, 2014

Uh oh!

crosbymichael commented Apr 29, 2014

Uh oh!

creack commented Apr 29, 2014

Uh oh!

unclejack commented Apr 29, 2014

Uh oh!

creack commented Apr 29, 2014

Uh oh!

jpetazzo commented Apr 29, 2014

Uh oh!

crosbymichael commented Apr 29, 2014

Uh oh!

unclejack commented Apr 29, 2014

Uh oh!

vieux commented Apr 29, 2014

Uh oh!

vieux commented Apr 29, 2014

Uh oh!

unclejack commented Apr 29, 2014

Uh oh!

crosbymichael commented Apr 29, 2014

Uh oh!

crosbymichael commented Apr 29, 2014

Uh oh!

creack commented Apr 29, 2014

Uh oh!

tianon commented Apr 29, 2014

Uh oh!

crosbymichael commented Apr 29, 2014

Uh oh!

jpetazzo commented Apr 30, 2014

Uh oh!

jpetazzo commented May 1, 2014

Uh oh!

crosbymichael commented May 1, 2014

Uh oh!

crosbymichael commented May 1, 2014

Uh oh!

cgwalters commented May 9, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants