Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more by poettering · Pull Request #4018 · systemd/systemd

poettering · 2016-08-22T17:30:06Z

Let's harden our service sandbox a bit.

evverx · 2016-08-23T09:31:29Z

Shouldn't we use the skip_seccomp_unavailable?

poettering · 2016-08-23T18:07:27Z

Shouldn't we use the skip_seccomp_unavailable?

Yupp. But this PR predates the merge of that patch, IIRC ;-)

poettering · 2016-08-24T19:06:17Z

OK, I force-pushed a new version now that adds the seccomp test and is rebased on current master. Please have a look!

evverx · 2016-08-24T20:01:23Z

I'm not sure I understand these settings.
Actually, I can easy bypass ProtectKernelTunables:

-bash-4.3# systemctl cat hola --no-pager
# /etc/systemd/system/hola.service
[Service]
Type=oneshot
ProtectKernelTunables=yes
ExecStart=/bin/sh -e -x -c 'mkdir -p /new-proc; mount -t proc proc /new-proc; echo hey >/new-proc/sys/kernel/core_pattern'

-bash-4.3# cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %e

-bash-4.3# systemctl start hola

-bash-4.3# cat /proc/sys/kernel/core_pattern
hey

poettering · 2016-08-24T20:16:59Z

Yes, you may, unless combined with other options they can be circumvented. We actually document that fact for some of the namespace-related options, but I didn't mention that here again.

These options are very effective as soon as you either use CapabilityBoundingSet=~CAP_SYS_ADMIN or use SystemCallFilter=@mount. I figure we should document that, for these two options explicitly.

My intention is to introduce RestrictMounts= as another simple boolean soon, which acts as a combination of three things:

SystemCallFilter=@mount
MountFlags=slave
an effective block on accessing /dev/fuse (using the devices cgroup controller)

As soon as we have that we should probably recommend to always set RestrictMounts= if people use any of the various namespacing options, include PrivateTmp=, PrivateDevices=, ReadOnlyDirectories=, and so on.

I'd claim these options are still useful even without this, as they protect from accidental modifications.

evverx · 2016-08-24T20:49:05Z

I'd claim these options are still useful even without this, as they protect from accidental modifications.

Indeed. But why not to use ReadOnlyPaths

ReadOnlyPaths=/sys /proc/sys
ReadOnlyPaths=/sys/fs/cgroup

?

RestrictMounts= as another simple boolean soon, which acts as a combination of three things:

SystemCallFilter=@mount
...

Sounds good. But we ignore SystemCallFilter sometimes. So, this is not a "portable" solution.

poettering · 2016-08-25T07:37:29Z

Indeed. But why not to use ReadOnlyPaths

Note ReadOnlyPaths=/proc/sys /sys is not entirely equivalent to ProtectKernelTunables=1, as the latter also installs a syscall filter. But the major difference is really more on the psychological level. I think we should add high-level easy-to-use "meta" options for the common, suggested sandboxing options, and ReadOnlyPaths= is more a low-level option for those who know stuff. I think the low-level stuff should always take precedence and trump the high-level options.

It's mostly a matter of being able to tell in friendly words: "you want that your service can't change kernel tunables? OK, then set the aptly named ProtectKernelTunables=1" and there you go.

And then there's another reason: my intention is to eventually add a new concept to systemd (via a generator) that introduces services that are by default locked down, and where features need to be opened up explicitly if that's desired. But in that case it's a lot nicer conceptually if we have friendly bools instead of lists of dirs, since bools can easily be unset, but we have no nice way to remove items from list (we only have resetting, via an empty assignment).

I hope that makes sense.

poettering · 2016-08-25T07:40:50Z

Sounds good. But we ignore SystemCallFilter sometimes. So, this is not a "portable" solution.

Sure, people get what they pay for. If they turn off seccomp during build they will get much less secure sandboxes...

tixxdz · 2016-08-25T09:38:17Z

@poettering few questions:

Shouldn't the documentation says that "assuming procfs and sysfs are mounted in /proc and /sys..." ?
Since you are already using seccomp why not filter the arguments of mount ? I mean don't expose it, just hide since these are details, inspect the arguments of mount() if they are procfs sysfs then deny. I am not sure how this will compose with your plan here Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more #4018 (comment) but at least ProtectKernelTunables= should be safe by default, and should not require SystemCallFilter=@mount or RestrictMounts= which are for all mounts ?!
Some other containers lxc remount some files like /proc/sysrq-trigger read-only, but there are more files which are not protected... This can get complicated, not sure what would be the best solution here, hmmm....

Thank you!

poettering · 2016-08-25T10:30:29Z

Shouldn't the documentation says that "assuming procfs and sysfs are mounted in /proc and /sys..." ?

We do not support such setups. systemd only supports systems where procfs is /proc, and sysfs is /sys. We hardcode that in the early boot process, hence everything else is explicitly out of focus for us.

Since you are already using seccomp why not filter the arguments of mount ?

seccomp can only test the actual parameters passed, it cannot follow pointers. This means you can never do string checks with seccomp, and hence not check if mount() is called for procfs or any other file system.

Some other containers lxc remount some files like /proc/sysrq-trigger read-only, but there are more files which are not protected... This can get complicated, not sure what would be the best solution here, hmmm....

It's one of the reasons why ProtectKernelTunables= is a good idea I think: it allows us to block more stuff later on easily. That said, I figure we should add /proc/sysrq-trigger of the stuff to block right-away.

poettering · 2016-08-26T11:43:13Z

I force pushed a substantially improved version now. It addresses @evverx's issues raised and says explicitly that thew new options need to be combined with SystemCallFilter=~@mount.

More importantly though I added a substantial amount of patches on top that improve per-service namespacing. Most interestingly I added a new mode to ProtectSystem= called "strict" which turns the blacklisting of dirs into a whitelisting, by mount everything read-only, except for the stuff that isn't. This is then also implied by DynamicUser=1.

Moreover I added code for chasing symlinks á la canononicalize_file_name() in userspace, that can take root directories into account when encountering an absolute path.

Finally it turns on a lot more sandboxing features by default for our long running services.

Also contains fixes for #3996 and #3867.

evverx · 2016-08-26T12:32:29Z

blame: https://git.launchpad.net/~pitti/+git/systemd-debian
badpkg: Test dependencies are unsatisfiable. A common reason is that your testbed is out of date with respect to the archive, and you need to use a current testbed or run apt-get update or use -U.
autopkgtest [12:20:33]: ERROR: erroneous package: Test dependencies are unsatisfiable. A common reason is that your testbed is out of date with respect to the archive, and you need to use a current testbed or run apt-get update or use -U.

@martinpitt , please, have a look.

evverx · 2016-08-26T13:27:57Z

@poettering , I've just merged #3984. Can you rebase this branch?

poettering · 2016-08-26T15:15:04Z

@evverx excellent, thanks for merging #3984!

Will rebase shortly.

poettering · 2016-08-26T15:46:50Z

OK, force pushed a rebased version now! Please have a look!

alban · 2016-08-26T16:03:59Z

@alepuccetti and @lucab do you have time to check the impact on rkt? We have stage1/prepare-app mounting/proc and /sys with different options depending on userns:

See
https://github.com/coreos/rkt/blob/master/stage1/prepare-app/prepare-app.c#L196
rkt/rkt#2386
rkt/rkt#2490

ronnychevalier · 2016-08-26T16:09:16Z

man/systemd.exec.xml

+
+      <varlistentry>
+        <term><varname>ProtectControlGroups=</varname></term>
+


maybe a reference to man 7 cgroups?

makes sense, will add.

ronnychevalier · 2016-08-26T18:23:43Z

3 minor comments.

In addition, I think it would be great to have tests added for the different settings :) Currently, not all unit settings are tested, but it would be nice to add tests (when possible) when we add new settings to avoid the list to grow.

Otherwise, it looks good to me (except the commit related to #3867 which I did not have time to review).

evverx · 2016-08-29T01:52:52Z

src/test/test-fs-util.c

+        assert_se(r == -ENOTDIR);
+
+        assert_se(rm_rf(temp, REMOVE_ROOT|REMOVE_PHYSICAL) >= 0);
+}


This hogs the CPU

diff --git a/src/test/test-fs-util.c b/src/test/test-fs-util.c index dc6521f..96a47ba 100644 --- a/src/test/test-fs-util.c +++ b/src/test/test-fs-util.c @@ -114,6 +114,11 @@ static void test_chase_symlinks(void) { r = chase_symlinks("/etc/machine-id/foo", NULL, &result); assert_se(r == -ENOTDIR); + result = mfree(result); + p = strjoina(temp, "/recursive-symlink"); + assert_se(symlink("recursive-symlink", p) >= 0); + r = chase_symlinks(p, NULL, &result); + assert_se(rm_rf(temp, REMOVE_ROOT|REMOVE_PHYSICAL) >= 0); }

ouch! i totall forgot to bound this...

poettering · 2016-08-31T10:46:03Z

This also fixes #567

poettering · 2016-09-01T19:12:55Z

Also fixes #4082

evverx · 2016-09-02T03:17:27Z

man/systemd.exec.xml

+
+        <listitem><para>Takes a boolean argument. If true, kernel variables accessible through
+        <filename>/proc/sys</filename> and <filename>/sys</filename> will be made read-only to all processes of the
+        unit. Usually, tunable kernel variables should only be written at boot-time, with the


ProtectKernelTunables protects the /proc/sysrq-trigger too.

Meaning? I think sysrq-trigger falls under the "Almost no services need to write to these" umbrella, and it doesn't need to be mentioned explicitly.

evverx · 2016-09-02T05:17:57Z

There is the regression:

-bash-4.3# rm -rf /i-dont-exist

-bash-4.3# systemd-run --wait --property ReadOnlyPaths=-/i-dont-exist sh -x -c 'mkdir -p /HEY; mount -t tmpfs tmpfs /HEY; grep HEY /proc/self/mountinfo'
Running as unit: run-u4.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 612ms

-bash-4.3# grep HEY /proc/self/mountinfo
139 18 0:38 / /HEY rw,relatime shared:64 - tmpfs tmpfs rw

i.e. ReadOnlyPath doesn't disconnect propagation of mounts from the service to the host

tixxdz · 2016-09-13T09:28:03Z

src/core/namespace.c

+                        m++;
+                }
+
+                if (protect_cgroups != protect_sysctl) {


If both of them are true then /sys/fs/cgroup will still be READWRITE or perhaps you are validating the parameters in callers ?

Oh never mind the mounts are recursive, thanks!

keszybz · 2016-09-15T12:26:25Z

Yep, I'm also seeing the regression.

tixxdz · 2016-09-20T07:46:49Z

@keszybz @evverx I rebased this branch here #4185 and probably fixed the propagation regression and other changes, it would be nice to test that branch and confirm the fixes before going forward, thanks!

evverx · 2016-09-21T02:55:42Z

@tixxdz , many thanks! I'll take a look.

keszybz · 2016-09-24T17:28:43Z

Closing this one, let's pursue #4185 instead.

This commit adds the possibility to leave /sysfs, and /proc read-write. It introduces a new (undocumented) boolean env var SYSTEMD_NSPAWN_MOUNT_RW to enable this feature. If unset or set to false, the current behavior is preserved. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directory was already discussed. Related: rkt/rkt#3245

This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.

This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.

poettering added the pid1 label Aug 22, 2016

poettering force-pushed the protect-kernel branch from 2118902 to 5d9c323 Compare August 24, 2016 19:05

poettering force-pushed the protect-kernel branch from 5d9c323 to 5f824b3 Compare August 26, 2016 11:36

poettering changed the title ~~Add new ProtectKernelTunables= and ProtectControlGroups= unit file settings~~ Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more Aug 26, 2016

poettering force-pushed the protect-kernel branch from 5f824b3 to 0200895 Compare August 26, 2016 15:45

ronnychevalier reviewed Aug 26, 2016
View reviewed changes

evverx reviewed Aug 29, 2016
View reviewed changes

poettering mentioned this pull request Aug 31, 2016

systemctl start not working when ReadWriteDirectories is a symlink #567

Closed

poettering mentioned this pull request Sep 1, 2016

PrivateTmp= doesn't account for symlinked /tmp #4082

Closed

2 tasks

evverx reviewed Sep 2, 2016
View reviewed changes

tixxdz reviewed Sep 13, 2016
View reviewed changes

keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Sep 15, 2016

tixxdz mentioned this pull request Sep 20, 2016

core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes #4185

Merged

keszybz closed this Sep 24, 2016

s-urbaniak mentioned this pull request Sep 28, 2016

--insecure-options=all-run should mount /proc/sys read-write rkt/rkt#3245

Open

s-urbaniak mentioned this pull request Oct 17, 2016

nspawn: R/W support for /sysfs, /proc, and /proc/sys/net #4395

Merged


		<varlistentry>
		<term><varname>ProtectControlGroups=</varname></term>

Uh oh!

Conversation

poettering commented Aug 22, 2016

Uh oh!

evverx commented Aug 23, 2016

Uh oh!

poettering commented Aug 23, 2016

Uh oh!

poettering commented Aug 24, 2016

Uh oh!

evverx commented Aug 24, 2016

Uh oh!

poettering commented Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evverx commented Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poettering commented Aug 25, 2016

Uh oh!

poettering commented Aug 25, 2016

Uh oh!

tixxdz commented Aug 25, 2016

Uh oh!

poettering commented Aug 25, 2016

Uh oh!

poettering commented Aug 26, 2016

Uh oh!

evverx commented Aug 26, 2016

Uh oh!

evverx commented Aug 26, 2016

Uh oh!

poettering commented Aug 26, 2016

Uh oh!

poettering commented Aug 26, 2016

Uh oh!

alban commented Aug 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronnychevalier commented Aug 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering commented Aug 31, 2016

Uh oh!

poettering commented Sep 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evverx commented Sep 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keszybz commented Sep 15, 2016

Uh oh!

tixxdz commented Sep 20, 2016

Uh oh!

evverx commented Sep 21, 2016

Uh oh!

keszybz commented Sep 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

7 participants

poettering commented Aug 24, 2016 •

edited

Loading

evverx commented Aug 24, 2016 •

edited

Loading

alban commented Aug 26, 2016 •

edited

Loading