Skip to content

Conversation

@jpetazzo
Copy link
Contributor

It has been pointed out that some files in /proc and /sys can be used
to break out of containers. However, if those filesystems are mounted
read-only, most of the known exploits are mitigated, since they rely
on writing some file in those filesystems.

This does not replace security modules (like SELinux or AppArmor), it
is just another layer of security. Likewise, it doesn't mean that the
other mitigations (shadowing parts of /proc or /sys with bind mounts)
are useless. Those measures are still useful.

Fixes #5444

Docker-DCO-1.1-Signed-off-by: Jérôme Petazzoni [email protected] (github: jpetazzo)

@unclejack
Copy link
Contributor

I'm getting a lot of failures and a panic in the integration tests. This isn't happening on master.
https://gist.github.com/unclejack/b6bf7c82bf69154a60d3

@ibuildthecloud
Copy link
Contributor

Won't this disable a lot of very useful functionality? Like /proc/sys/net has good stuff in it and it's valid to change in a container because of the network namespace.

@jpetazzo
Copy link
Contributor Author

@unclejack: those tests run fine here. The only tests failing are PingGoogle (because 8.8.8.8 is not reachable on my machine) and the test that checks sysfs access (obviously).

How did you run the tests? (I did make test, FWIW)

@jpetazzo
Copy link
Contributor Author

Tests updated. All pass here.

@vieux
Copy link
Contributor

vieux commented Apr 29, 2014

I still have tons of failures

@crosbymichael
Copy link
Contributor

@vieux i don't trust you

@jpetazzo
Copy link
Contributor Author

I fixed the gofmt thing.

(Also, the person responsible for messing up with my gofmt git hook has been terminated!)

@vieux what kind of failure do you see? FWIW, I'm running make test and my local Docker is 0.9 + native + btrfs (not that it should matter, but who knows).

@crosbymichael
Copy link
Contributor

ping @creack

@creack
Copy link
Contributor

creack commented Apr 29, 2014

Same as @vieux: plenty of failure in test-integration. (too much output to paste) + endup in panic. (master works fine)

@jpetazzo
Copy link
Contributor Author

@vieux told me he was running with AUFS and kernel 3.11; what's your setup @creack?

@crosbymichael
Copy link
Contributor

@creack @vieux can we see what at least one of the errors looks like?

@creack
Copy link
Contributor

creack commented Apr 29, 2014

@jpetazzo btrfs 3.13

@unclejack
Copy link
Contributor

@jpetazzo I was running the tests using make test on Ubuntu 13.10, kernel 3.10.37 with btrfs as storage backend.

@jpetazzo
Copy link
Contributor Author

Just reproduced the tests within a boot2docker VM (kernel 3.13, docker 0.10, native, aufs).

First time, I got a completely bogus error in archive.

I switched back to master, test went fine.

Then I switched back to my branch, test went fine.

I don't even.

Meanwhile, I'm running the whole suite 5 times in a row here to see what happens.

I'll also compare the fails reported by @creack and @unclejack. @vieux, can you add yours please?

Thanks.

@crosbymichael
Copy link
Contributor

@unclejack @creack @vieux Instead of running the tests can you try running a container to get a better error message about what is preventing it from running?

@unclejack
Copy link
Contributor

@vieux
Copy link
Contributor

vieux commented Apr 29, 2014

2014/04/29 21:45:32 read-only file system

@vieux
Copy link
Contributor

vieux commented Apr 29, 2014

Nothing in the daemon logs:

[66bc9e6b] +job attach(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[debug] server.go:940 Calling POST /containers/{name:.*}/start
2014/04/29 21:46:37 POST /v1.11/containers/873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e/start
[66bc9e6b] +job start(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[66bc9e6b] +job allocate_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[66bc9e6b] -job allocate_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[debug] container.go:188 attach: stdin: begin
[debug] container.go:225 attach: stdout: begin
[debug] container.go:263 attach: stderr: begin
[debug] container.go:309 attach: waiting for job 1/3
[libcontainer] 2014/04/29 21:46:37 created sync pipe parent fd 15 child fd 14
[libcontainer] 2014/04/29 21:46:37 creating master and console
[libcontainer] 2014/04/29 21:46:37 attach terminal to command
[libcontainer] 2014/04/29 21:46:37 starting command
[libcontainer] 2014/04/29 21:46:37 writing pid 5693 to file
[libcontainer] 2014/04/29 21:46:37 setting cgroups
[66bc9e6b] -job start(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[debug] server.go:940 Calling POST /containers/{name:.*}/resize
[libcontainer] 2014/04/29 21:46:37 setting up network
2014/04/29 21:46:37 POST /v1.11/containers/873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e/resize?h=94&w=180
[66bc9e6b] +job resize(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, 94, 180)
[66bc9e6b] -job resize(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, 94, 180) = OK (0)
[libcontainer] 2014/04/29 21:46:37 closing sync pipe with child
[libcontainer] 2014/04/29 21:46:37 process exited with status 1
[libcontainer] 2014/04/29 21:46:37 removing pid file
[66bc9e6b] +job release_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e)
[66bc9e6b] -job release_interface(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[error] container.go:626 873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e: Error closing terminal: invalid argument
[debug] container.go:242 attach: stdout: end
[debug] container.go:280 attach: stderr: end
[debug] container.go:314 attach: job 1 completed successfully
[debug] container.go:309 attach: waiting for job 2/3
[debug] container.go:314 attach: job 2 completed successfully
[debug] container.go:309 attach: waiting for job 3/3
[debug] server.go:2333 Closing buffered stdin pipe
[debug] container.go:215 attach: stdin: end
[debug] container.go:314 attach: job 3 completed successfully
[debug] container.go:316 attach: all jobs completed successfully
[66bc9e6b] -job attach(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e) = OK (0)
[debug] server.go:940 Calling GET /containers/{name:.*}/json
2014/04/29 21:46:37 GET /v1.11/containers/873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e/json
[66bc9e6b] +job inspect(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, container)
[66bc9e6b] -job inspect(873262823efcb4bad767022cfc2ea3ab9aea51d9892788212eda3b708417d93e, container) = OK (0)

@unclejack
Copy link
Contributor

I'm also getting the read-only file system error.

@crosbymichael
Copy link
Contributor

Maybe something with apparmor on their systems?

@crosbymichael
Copy link
Contributor

Ya, I think it maybe apparmor when we go to set the profile for the process

@creack
Copy link
Contributor

creack commented Apr 29, 2014

Nothing regarding apparmor in dmesg, but it might be indeed apparmor.

@tianon
Copy link
Member

tianon commented Apr 29, 2014

Just spitballing, but couldn't we remount,ro on these in dockerinit instead?

@crosbymichael
Copy link
Contributor

Yes, if it really is the setting the apparmor profile we will have to mount it as rw initially then after we have everything setup remount as ro

@jpetazzo
Copy link
Contributor Author

Right. Eventually reproduced the issue with a Ubuntu-based VM. And confirming that it comes from the AppArmor thing, thanks a lot to those who helped to figure this out!

I think I will just improve Restrict() a little bit, and then patch it up into the LXC driver as well.

It has been pointed out that some files in /proc and /sys can be used
to break out of containers. However, if those filesystems are mounted
read-only, most of the known exploits are mitigated, since they rely
on writing some file in those filesystems.

This does not replace security modules (like SELinux or AppArmor), it
is just another layer of security. Likewise, it doesn't mean that the
other mitigations (shadowing parts of /proc or /sys with bind mounts)
are useless. Those measures are still useful. As such, the shadowing
of /proc/kcore is still enabled with both LXC and native drivers.

Special care has to be taken with /proc/1/attr, which still needs to
be mounted read-write in order to enable the AppArmor profile. It is
bind-mounted from a private read-write mount of procfs.

All that enforcement is done in dockerinit. The code doing the real
work is in libcontainer. The init function for the LXC driver calls
the function from libcontainer to avoid code duplication.

Docker-DCO-1.1-Signed-off-by: Jérôme Petazzoni <[email protected]> (github: jpetazzo)
@jpetazzo
Copy link
Contributor Author

jpetazzo commented May 1, 2014

To summarize a quick exchange with Mike on #docker-dev: there is a catch 22 here, because to enable AppArmor, you need write access to /proc; but once you have done that, you don't have the necessary permissions to remount /proc read-only anymore.

I found a workaround: do a temporary mount of /proc read-write, and bind-mount /proc/1/attr (the directory needed by AppArmor) over the "real" read-only /proc. It's not the cleanest thing, but it works pretty well, and I like the fact that instead of masking some things, we mask everything and explicitly re-enable access wherever needed.

Regarding the overall structure, I ended up doing the following changes:

  • Restrict() is now able to mask over sensitive files, but also remount-RO sensitive mountpoints (which allows to gather all those mountpoint-related thing in the same function, instead of scattering them in multiple places),
  • Restrict() is now used both in the native and lxc drivers;
  • Restrict() is now called in the context of dockerinit, because it needs to bind-mount /proc/1/attr for AppArmor compatibility.

Unfortunately, I haven't been able to run all tests on this branch. I have some trouble with my test setup (all tests pass within my boot2docker VM; I have random failures in my local environment; and consistent failures in my Ubuntu 14.04 VM in graphdriver/devmapper unit tests, which doesn't make any sense; those failures also happen on master, which hints at some problem within my VM rather than the code itself).

If people could give this a roll while I try to fix my test setup, I would be ever grateful.

@crosbymichael
Copy link
Contributor

#assignee=crosbymichael

@crosbymichael
Copy link
Contributor

@jpetazzo I'll take this PR from here. We are trying a new workflow so instead of asking you to make a lot of small changes i'll just do it and get it merged in unless there are some larger changes that you want to still make.

@cgwalters
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/proc/sys/kernel/hostname doesn't exist inside container

8 participants