Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more#4018
Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict unit file settings and more#4018poettering wants to merge 28 commits intosystemd:masterfrom
Conversation
|
Shouldn't we use the |
Yupp. But this PR predates the merge of that patch, IIRC ;-) |
2118902 to
5d9c323
Compare
|
OK, I force-pushed a new version now that adds the seccomp test and is rebased on current master. Please have a look! |
|
I'm not sure I understand these settings. -bash-4.3# systemctl cat hola --no-pager
# /etc/systemd/system/hola.service
[Service]
Type=oneshot
ProtectKernelTunables=yes
ExecStart=/bin/sh -e -x -c 'mkdir -p /new-proc; mount -t proc proc /new-proc; echo hey >/new-proc/sys/kernel/core_pattern'
-bash-4.3# cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %e
-bash-4.3# systemctl start hola
-bash-4.3# cat /proc/sys/kernel/core_pattern
hey |
|
Yes, you may, unless combined with other options they can be circumvented. We actually document that fact for some of the namespace-related options, but I didn't mention that here again. These options are very effective as soon as you either use My intention is to introduce
As soon as we have that we should probably recommend to always set I'd claim these options are still useful even without this, as they protect from accidental modifications. |
Indeed. But why not to use ReadOnlyPaths=/sys /proc/sys
ReadOnlyPaths=/sys/fs/cgroup?
Sounds good. But we ignore |
Note It's mostly a matter of being able to tell in friendly words: "you want that your service can't change kernel tunables? OK, then set the aptly named ProtectKernelTunables=1" and there you go. And then there's another reason: my intention is to eventually add a new concept to systemd (via a generator) that introduces services that are by default locked down, and where features need to be opened up explicitly if that's desired. But in that case it's a lot nicer conceptually if we have friendly bools instead of lists of dirs, since bools can easily be unset, but we have no nice way to remove items from list (we only have resetting, via an empty assignment). I hope that makes sense. |
Sure, people get what they pay for. If they turn off seccomp during build they will get much less secure sandboxes... |
|
@poettering few questions:
Thank you! |
We do not support such setups. systemd only supports systems where procfs is /proc, and sysfs is /sys. We hardcode that in the early boot process, hence everything else is explicitly out of focus for us.
seccomp can only test the actual parameters passed, it cannot follow pointers. This means you can never do string checks with seccomp, and hence not check if mount() is called for procfs or any other file system.
It's one of the reasons why ProtectKernelTunables= is a good idea I think: it allows us to block more stuff later on easily. That said, I figure we should add /proc/sysrq-trigger of the stuff to block right-away. |
5d9c323 to
5f824b3
Compare
|
I force pushed a substantially improved version now. It addresses @evverx's issues raised and says explicitly that thew new options need to be combined with More importantly though I added a substantial amount of patches on top that improve per-service namespacing. Most interestingly I added a new mode to Moreover I added code for chasing symlinks á la Finally it turns on a lot more sandboxing features by default for our long running services. |
@martinpitt , please, have a look. |
|
@poettering , I've just merged #3984. Can you rebase this branch? |
5f824b3 to
0200895
Compare
|
OK, force pushed a rebased version now! Please have a look! |
|
@alepuccetti and @lucab do you have time to check the impact on rkt? We have stage1/prepare-app mounting/proc and /sys with different options depending on userns: See |
|
|
||
| <varlistentry> | ||
| <term><varname>ProtectControlGroups=</varname></term> | ||
|
|
There was a problem hiding this comment.
maybe a reference to man 7 cgroups?
There was a problem hiding this comment.
makes sense, will add.
|
3 minor comments. In addition, I think it would be great to have tests added for the different settings :) Currently, not all unit settings are tested, but it would be nice to add tests (when possible) when we add new settings to avoid the list to grow. Otherwise, it looks good to me (except the commit related to #3867 which I did not have time to review). |
| assert_se(r == -ENOTDIR); | ||
|
|
||
| assert_se(rm_rf(temp, REMOVE_ROOT|REMOVE_PHYSICAL) >= 0); | ||
| } |
There was a problem hiding this comment.
This hogs the CPU
diff --git a/src/test/test-fs-util.c b/src/test/test-fs-util.c
index dc6521f..96a47ba 100644
--- a/src/test/test-fs-util.c
+++ b/src/test/test-fs-util.c
@@ -114,6 +114,11 @@ static void test_chase_symlinks(void) {
r = chase_symlinks("/etc/machine-id/foo", NULL, &result);
assert_se(r == -ENOTDIR);
+ result = mfree(result);
+ p = strjoina(temp, "/recursive-symlink");
+ assert_se(symlink("recursive-symlink", p) >= 0);
+ r = chase_symlinks(p, NULL, &result);
+
assert_se(rm_rf(temp, REMOVE_ROOT|REMOVE_PHYSICAL) >= 0);
}There was a problem hiding this comment.
ouch! i totall forgot to bound this...
|
This also fixes #567 |
|
Also fixes #4082 |
|
|
||
| <listitem><para>Takes a boolean argument. If true, kernel variables accessible through | ||
| <filename>/proc/sys</filename> and <filename>/sys</filename> will be made read-only to all processes of the | ||
| unit. Usually, tunable kernel variables should only be written at boot-time, with the |
There was a problem hiding this comment.
ProtectKernelTunables protects the /proc/sysrq-trigger too.
There was a problem hiding this comment.
Meaning? I think sysrq-trigger falls under the "Almost no services need to write to these" umbrella, and it doesn't need to be mentioned explicitly.
|
There is the regression: -bash-4.3# rm -rf /i-dont-exist
-bash-4.3# systemd-run --wait --property ReadOnlyPaths=-/i-dont-exist sh -x -c 'mkdir -p /HEY; mount -t tmpfs tmpfs /HEY; grep HEY /proc/self/mountinfo'
Running as unit: run-u4.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 612ms
-bash-4.3# grep HEY /proc/self/mountinfo
139 18 0:38 / /HEY rw,relatime shared:64 - tmpfs tmpfs rwi.e. |
| m++; | ||
| } | ||
|
|
||
| if (protect_cgroups != protect_sysctl) { |
There was a problem hiding this comment.
If both of them are true then /sys/fs/cgroup will still be READWRITE or perhaps you are validating the parameters in callers ?
There was a problem hiding this comment.
Oh never mind the mounts are recursive, thanks!
|
Yep, I'm also seeing the regression. |
|
@tixxdz , many thanks! I'll take a look. |
|
Closing this one, let's pursue #4185 instead. |
This commit adds the possibility to leave /sysfs, and /proc read-write. It introduces a new (undocumented) boolean env var SYSTEMD_NSPAWN_MOUNT_RW to enable this feature. If unset or set to false, the current behavior is preserved. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directory was already discussed. Related: rkt/rkt#3245
This commit adds the possibility to leave /sysfs, and /proc read-write. It introduces a new (undocumented) boolean env var SYSTEMD_NSPAWN_MOUNT_RW to enable this feature. If unset or set to false, the current behavior is preserved. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directory was already discussed. Related: rkt/rkt#3245
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sysfs, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sysfs, and /proc/sys will be read-write. If set to "no", /sysfs, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
Let's harden our service sandbox a bit.