Skip to content

Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more#4018

Closed
poettering wants to merge 28 commits intosystemd:masterfrom
poettering:protect-kernel
Closed

Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more#4018
poettering wants to merge 28 commits intosystemd:masterfrom
poettering:protect-kernel

Conversation

@poettering
Copy link
Member

Let's harden our service sandbox a bit.

@evverx
Copy link
Contributor

evverx commented Aug 23, 2016

Shouldn't we use the skip_seccomp_unavailable?

@poettering
Copy link
Member Author

Shouldn't we use the skip_seccomp_unavailable?

Yupp. But this PR predates the merge of that patch, IIRC ;-)

@poettering
Copy link
Member Author

OK, I force-pushed a new version now that adds the seccomp test and is rebased on current master. Please have a look!

@evverx
Copy link
Contributor

evverx commented Aug 24, 2016

I'm not sure I understand these settings.
Actually, I can easy bypass ProtectKernelTunables:

-bash-4.3# systemctl cat hola --no-pager
# /etc/systemd/system/hola.service
[Service]
Type=oneshot
ProtectKernelTunables=yes
ExecStart=/bin/sh -e -x -c 'mkdir -p /new-proc; mount -t proc proc /new-proc; echo hey >/new-proc/sys/kernel/core_pattern'

-bash-4.3# cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %e

-bash-4.3# systemctl start hola

-bash-4.3# cat /proc/sys/kernel/core_pattern
hey

@poettering
Copy link
Member Author

poettering commented Aug 24, 2016

Yes, you may, unless combined with other options they can be circumvented. We actually document that fact for some of the namespace-related options, but I didn't mention that here again.

These options are very effective as soon as you either use CapabilityBoundingSet=~CAP_SYS_ADMIN or use SystemCallFilter=@mount. I figure we should document that, for these two options explicitly.

My intention is to introduce RestrictMounts= as another simple boolean soon, which acts as a combination of three things:

  • SystemCallFilter=@mount
  • MountFlags=slave
  • an effective block on accessing /dev/fuse (using the devices cgroup controller)

As soon as we have that we should probably recommend to always set RestrictMounts= if people use any of the various namespacing options, include PrivateTmp=, PrivateDevices=, ReadOnlyDirectories=, and so on.

I'd claim these options are still useful even without this, as they protect from accidental modifications.

@evverx
Copy link
Contributor

evverx commented Aug 24, 2016

I'd claim these options are still useful even without this, as they protect from accidental modifications.

Indeed. But why not to use ReadOnlyPaths

ReadOnlyPaths=/sys /proc/sys
ReadOnlyPaths=/sys/fs/cgroup

?

RestrictMounts= as another simple boolean soon, which acts as a combination of three things:

Sounds good. But we ignore SystemCallFilter sometimes. So, this is not a "portable" solution.

@poettering
Copy link
Member Author

Indeed. But why not to use ReadOnlyPaths

Note ReadOnlyPaths=/proc/sys /sys is not entirely equivalent to ProtectKernelTunables=1, as the latter also installs a syscall filter. But the major difference is really more on the psychological level. I think we should add high-level easy-to-use "meta" options for the common, suggested sandboxing options, and ReadOnlyPaths= is more a low-level option for those who know stuff. I think the low-level stuff should always take precedence and trump the high-level options.

It's mostly a matter of being able to tell in friendly words: "you want that your service can't change kernel tunables? OK, then set the aptly named ProtectKernelTunables=1" and there you go.

And then there's another reason: my intention is to eventually add a new concept to systemd (via a generator) that introduces services that are by default locked down, and where features need to be opened up explicitly if that's desired. But in that case it's a lot nicer conceptually if we have friendly bools instead of lists of dirs, since bools can easily be unset, but we have no nice way to remove items from list (we only have resetting, via an empty assignment).

I hope that makes sense.

@poettering
Copy link
Member Author

Sounds good. But we ignore SystemCallFilter sometimes. So, this is not a "portable" solution.

Sure, people get what they pay for. If they turn off seccomp during build they will get much less secure sandboxes...

@tixxdz
Copy link
Member

tixxdz commented Aug 25, 2016

@poettering few questions:

  1. Shouldn't the documentation says that "assuming procfs and sysfs are mounted in /proc and /sys..." ?

  2. Since you are already using seccomp why not filter the arguments of mount ? I mean don't expose it, just hide since these are details, inspect the arguments of mount() if they are procfs sysfs then deny. I am not sure how this will compose with your plan here Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more #4018 (comment) but at least ProtectKernelTunables= should be safe by default, and should not require SystemCallFilter=@mount or RestrictMounts= which are for all mounts ?!

  3. Some other containers lxc remount some files like /proc/sysrq-trigger read-only, but there are more files which are not protected... This can get complicated, not sure what would be the best solution here, hmmm....

Thank you!

@poettering
Copy link
Member Author

  1. Shouldn't the documentation says that "assuming procfs and sysfs are mounted in /proc and /sys..." ?

We do not support such setups. systemd only supports systems where procfs is /proc, and sysfs is /sys. We hardcode that in the early boot process, hence everything else is explicitly out of focus for us.

  1. Since you are already using seccomp why not filter the arguments of mount ?

seccomp can only test the actual parameters passed, it cannot follow pointers. This means you can never do string checks with seccomp, and hence not check if mount() is called for procfs or any other file system.

  1. Some other containers lxc remount some files like /proc/sysrq-trigger read-only, but there are more files which are not protected... This can get complicated, not sure what would be the best solution here, hmmm....

It's one of the reasons why ProtectKernelTunables= is a good idea I think: it allows us to block more stuff later on easily. That said, I figure we should add /proc/sysrq-trigger of the stuff to block right-away.

@poettering poettering changed the title Add new ProtectKernelTunables= and ProtectControlGroups= unit file settings Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more Aug 26, 2016
@poettering
Copy link
Member Author

I force pushed a substantially improved version now. It addresses @evverx's issues raised and says explicitly that thew new options need to be combined with SystemCallFilter=~@mount.

More importantly though I added a substantial amount of patches on top that improve per-service namespacing. Most interestingly I added a new mode to ProtectSystem= called "strict" which turns the blacklisting of dirs into a whitelisting, by mount everything read-only, except for the stuff that isn't. This is then also implied by DynamicUser=1.

Moreover I added code for chasing symlinks á la canononicalize_file_name() in userspace, that can take root directories into account when encountering an absolute path.

Finally it turns on a lot more sandboxing features by default for our long running services.

Also contains fixes for #3996 and #3867.

@evverx
Copy link
Contributor

evverx commented Aug 26, 2016

blame: https://git.launchpad.net/~pitti/+git/systemd-debian
badpkg: Test dependencies are unsatisfiable. A common reason is that your testbed is out of date with respect to the archive, and you need to use a current testbed or run apt-get update or use -U.
autopkgtest [12:20:33]: ERROR: erroneous package: Test dependencies are unsatisfiable. A common reason is that your testbed is out of date with respect to the archive, and you need to use a current testbed or run apt-get update or use -U.

@martinpitt , please, have a look.

@evverx
Copy link
Contributor

evverx commented Aug 26, 2016

@poettering , I've just merged #3984. Can you rebase this branch?

@poettering
Copy link
Member Author

@evverx excellent, thanks for merging #3984!

Will rebase shortly.

@poettering
Copy link
Member Author

OK, force pushed a rebased version now! Please have a look!

@alban
Copy link
Member

alban commented Aug 26, 2016

@alepuccetti and @lucab do you have time to check the impact on rkt? We have stage1/prepare-app mounting/proc and /sys with different options depending on userns:

See
https://github.com/coreos/rkt/blob/master/stage1/prepare-app/prepare-app.c#L196
rkt/rkt#2386
rkt/rkt#2490


<varlistentry>
<term><varname>ProtectControlGroups=</varname></term>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a reference to man 7 cgroups?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will add.

@ronnychevalier
Copy link
Member

3 minor comments.

In addition, I think it would be great to have tests added for the different settings :) Currently, not all unit settings are tested, but it would be nice to add tests (when possible) when we add new settings to avoid the list to grow.

Otherwise, it looks good to me (except the commit related to #3867 which I did not have time to review).

assert_se(r == -ENOTDIR);

assert_se(rm_rf(temp, REMOVE_ROOT|REMOVE_PHYSICAL) >= 0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hogs the CPU

diff --git a/src/test/test-fs-util.c b/src/test/test-fs-util.c
index dc6521f..96a47ba 100644
--- a/src/test/test-fs-util.c
+++ b/src/test/test-fs-util.c
@@ -114,6 +114,11 @@ static void test_chase_symlinks(void) {
         r = chase_symlinks("/etc/machine-id/foo", NULL, &result);
         assert_se(r == -ENOTDIR);

+        result = mfree(result);
+        p = strjoina(temp, "/recursive-symlink");
+        assert_se(symlink("recursive-symlink", p) >= 0);
+        r = chase_symlinks(p, NULL, &result);
+
         assert_se(rm_rf(temp, REMOVE_ROOT|REMOVE_PHYSICAL) >= 0);
 }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch! i totall forgot to bound this...

@poettering
Copy link
Member Author

This also fixes #567

@poettering
Copy link
Member Author

Also fixes #4082


<listitem><para>Takes a boolean argument. If true, kernel variables accessible through
<filename>/proc/sys</filename> and <filename>/sys</filename> will be made read-only to all processes of the
unit. Usually, tunable kernel variables should only be written at boot-time, with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProtectKernelTunables protects the /proc/sysrq-trigger too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning? I think sysrq-trigger falls under the "Almost no services need to write to these" umbrella, and it doesn't need to be mentioned explicitly.

@evverx
Copy link
Contributor

evverx commented Sep 2, 2016

There is the regression:

-bash-4.3# rm -rf /i-dont-exist

-bash-4.3# systemd-run --wait --property ReadOnlyPaths=-/i-dont-exist sh -x -c 'mkdir -p /HEY; mount -t tmpfs tmpfs /HEY; grep HEY /proc/self/mountinfo'
Running as unit: run-u4.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 612ms

-bash-4.3# grep HEY /proc/self/mountinfo
139 18 0:38 / /HEY rw,relatime shared:64 - tmpfs tmpfs rw

i.e. ReadOnlyPath doesn't disconnect propagation of mounts from the service to the host

m++;
}

if (protect_cgroups != protect_sysctl) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both of them are true then /sys/fs/cgroup will still be READWRITE or perhaps you are validating the parameters in callers ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind the mounts are recursive, thanks!

@keszybz
Copy link
Member

keszybz commented Sep 15, 2016

Yep, I'm also seeing the regression.

@tixxdz
Copy link
Member

tixxdz commented Sep 20, 2016

@keszybz @evverx I rebased this branch here #4185 and probably fixed the propagation regression and other changes, it would be nice to test that branch and confirm the fixes before going forward, thanks!

@evverx
Copy link
Contributor

evverx commented Sep 21, 2016

@tixxdz , many thanks! I'll take a look.

@keszybz
Copy link
Member

keszybz commented Sep 24, 2016

Closing this one, let's pursue #4185 instead.

@keszybz keszybz closed this Sep 24, 2016
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Oct 17, 2016
This commit adds the possibility to leave /sysfs, and /proc read-write.
It introduces a new (undocumented) boolean env var SYSTEMD_NSPAWN_MOUNT_RW
to enable this feature. If unset or set to false, the current behavior
is preserved.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directory was already discussed.

Related: rkt/rkt#3245
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Oct 17, 2016
This commit adds the possibility to leave /sysfs, and /proc read-write.
It introduces a new (undocumented) boolean env var SYSTEMD_NSPAWN_MOUNT_RW
to enable this feature. If unset or set to false, the current behavior
is preserved.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directory was already discussed.

Related: rkt/rkt#3245
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Oct 18, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Oct 18, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Oct 20, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Oct 21, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Nov 4, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Nov 7, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Nov 11, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Nov 14, 2016
This commit adds the possibility to leave /sysfs, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sysfs, and /proc/sys will be read-write.
If set to "no", /sysfs, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Nov 18, 2016
This commit adds the possibility to leave /sys, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sys, and /proc/sys will be read-write.
If set to "no", /sys, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Nov 22, 2016
This commit adds the possibility to leave /sys, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sys, and /proc/sys will be read-write.
If set to "no", /sys, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
s-urbaniak added a commit to s-urbaniak/systemd that referenced this pull request Nov 25, 2016
This commit adds the possibility to leave /sys, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sys, and /proc/sys will be read-write.
If set to "no", /sys, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pid1 reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks

Development

Successfully merging this pull request may close these issues.

7 participants