core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes#4185
Conversation
| break; | ||
| case PROTECT_HOME_YES: | ||
| r = append_target_mounts(p, root_directory, protect_home_yes_table, | ||
| ELEMENTSOF(protect_home_yes_table)); |
There was a problem hiding this comment.
Yes! fixed and force pushed. Thanks!
There was a problem hiding this comment.
It'd be a bit shorter (and safer) to simply return immediately. Then r does not need to be defined, etc.
f13c70f to
4f7279f
Compare
evverx
left a comment
There was a problem hiding this comment.
I'll test this branch tomorrow. There are two nitpicks
src/core/namespace.c
Outdated
| { "/proc/fs", READONLY, true, }, | ||
| { "/proc/irq", READONLY, true, }, | ||
| { "/sys", READONLY, false, }, | ||
| { "/sys/kernel/debug", READONLY, true, }, |
There was a problem hiding this comment.
I think we should protect the /sys/kernel/tracing too.
https://lwn.net/Articles/630526/
Finally, a new directory is created when tracefs is enabled called
/sys/kernel/tracing. This will be the new location that system admins
may mount tracefs if they are not using debugfs.
https://lkml.org/lkml/2016/5/13/327
On my system, simply enabling and disabling function graph tracer can
crash the kernel
:-)
There was a problem hiding this comment.
hehe, yep! I added the entry now, but actually Lennart already added that item about tracing ProtectTracing= in TODO 842bb2f , and I have that in a separate patch that's why... thanks
There was a problem hiding this comment.
Hm, this kind of whack'a'mole will not scale. But I don't have a better idea.
src/core/namespace.c
Outdated
|
|
||
| BindMount *m, *mounts = NULL; | ||
| ProtectSystem protect_system) | ||
| { |
src/core/namespace.c
Outdated
|
|
||
| BindMount *m, *mounts = NULL; | ||
| ProtectSystem protect_system) | ||
| { |
There was a problem hiding this comment.
I can confirm: this commit fixes #4018 (comment)
But I think we should squash 50d1975 and 1be5c8d
| for services which shall be able to install mount points in the main mount namespace. Note that the effect of | ||
| these settings may be undone by privileged processes. In order to set up an effective sandboxed environment for | ||
| a unit it is thus recommended to combine these settings with either | ||
| <varname>CapabilityBoundingSet=~CAP_SYS_ADMIN</varname> or <varname>SystemCallFilter=~@mount</varname>.</para></listitem> |
There was a problem hiding this comment.
Anyway, I can easy bypass ProtectKernelTunables and ProtectControlGroups
-bash-4.3# systemctl cat hola --no-pager
# /etc/systemd/system/hola.service
[Service]
ProtectKernelTunables=yes
CapabilityBoundingSet=~CAP_SYS_ADMIN
SystemCallFilter=~@mount
ProtectControlGroups=yes
ExecStart=/usr/bin/systemd-run sh -x -c 'echo HEY >/proc/sys/kernel/core_pattern'
-bash-4.3# echo OLD_VAL >/proc/sys/kernel/core_pattern
-bash-4.3# cat /proc/sys/kernel/core_pattern
OLD_VAL
-bash-4.3# systemctl start hola
-bash-4.3# cat /proc/sys/kernel/core_pattern
HEY
Do we really need these settings?
There was a problem hiding this comment.
InaccessiblePaths=/run/systemd/private /run/dbus/system_bus_socket
-bash-4.3# systemctl start hola
-bash-4.3# systemctl status hola
● hola.service
Loaded: loaded (/etc/systemd/system/hola.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Sat 2016-09-24 13:39:43 CEST; 3s ago
Process: 102 ExecStart=/usr/bin/systemd-run sh -x -c echo 0 > /proc/latency_stats; sleep 10; echo hhhhh; ls -lha /run/systemd/; (code=exited, status=1/FAILURE)
Main PID: 102 (code=exited, status=1/FAILURE)
Sep 24 13:39:43 fedora-tree-24 systemd[1]: Started hola.service.
Sep 24 13:39:43 fedora-tree-24 systemd-run[102]: Failed to create bus connection: Connection refused
Sep 24 13:39:43 fedora-tree-24 systemd[1]: hola.service: Main process exited, code=exited, status=1/FAILURE
Sep 24 13:39:43 fedora-tree-24 systemd[1]: hola.service: Unit entered failed state.
Sep 24 13:39:43 fedora-tree-24 systemd[1]: hola.service: Failed with result 'exit-code'.
So you have to protect the bus APIs also the user bus... in my setup I already do this, however lot of other tools don't do it... and I already discussed this @poettering we may need later a simple option ProtectBus... to abstract this...
with previous kdbus, the setup was bind mount custom endpoints on top of default bus to only show some predefined bus names... all the rest will be hidden. However these are unrelated details... which should not conflict with these settings. Of course we may need to document this for now.
Do we really need these settings?
Sure, we need them for our users and a better integration.
There was a problem hiding this comment.
@evverx Force pushed, thank you for the review!
4f7279f to
dff9dce
Compare
src/core/namespace.c
Outdated
| { "/proc/sys", READONLY, false, }, | ||
| { "/proc/sysrq-trigger", READONLY, true, /* Not always compiled into kernel */ }, | ||
| { "/sys", READONLY, false, }, | ||
| { "/sys/fs/cgroup", READWRITE, false, }, /* READONLY is set by ProtectControlGroups= option */ |
There was a problem hiding this comment.
Maybe it's just OCD, but the way that one comment is outside, while the other is it's own separate argument after a comma pains me :) Maybe you could make the whitespace a bit more frugal and remove the trailing commas?
There was a problem hiding this comment.
Ouu probably I have the opposite of OCD which is probably the same thing ? updated with some medication but not sure if they are the right one! hehe
keszybz
left a comment
There was a problem hiding this comment.
Oh, this damn thing wants me to have a separate "review" for every patch, and then another one for the whole of the patch set. It's going to be a lot of small "reviews".
src/core/namespace.c
Outdated
|
|
||
| int setup_namespace( | ||
| const char* root_directory, | ||
| static unsigned int namespace_calculate_mounts( |
There was a problem hiding this comment.
Plain "unsigned" (no "int") is customary.
src/core/namespace.c
Outdated
| ((protect_system != PROTECT_SYSTEM_NO ? 3 : 0) + | ||
| (protect_system == PROTECT_SYSTEM_FULL ? 1 : 0))); | ||
|
|
||
| return n; |
There was a problem hiding this comment.
Nice! I guess you could drop the n variable too, and just return.
Makefile.am
Outdated
| test/test-execute/exec-personality-aarch64.service \ | ||
| test/test-execute/exec-privatedevices-no.service \ | ||
| test/test-execute/exec-privatedevices-yes.service \ | ||
| test/test-execute/exec-privatedevices-no-capibility-mknod.service \ |
|
|
||
| [Service] | ||
| PrivateDevices=no | ||
| ExecStart=/bin/sh -x -c 'c=$$(capsh --print | grep "cap_mknod"); test -n "$$c"' |
There was a problem hiding this comment.
I don't know it it's a useful cleanup, but you could just do: capsh --print | grep cap_mknod, and ! capsh --print | grep cap_mknod below.
There was a problem hiding this comment.
Just used what other caps tests are using, ok updated now.
| @@ -106,26 +154,68 @@ static int append_mounts(BindMount **p, char **strv, MountMode mode) { | |||
| if (!path_is_absolute(*i)) | |||
There was a problem hiding this comment.
Would seem cleaner to do
bool ignore;
...
if (IN_SET(mode, INACCESSIBLE, READONLY, READWRITE) && startswith("-", *i)) {
(*i)++;
ignore = true;
} else
ignore = false;
and below replace (*p)->ignore with ignore .
There was a problem hiding this comment.
Ok this one is minor and not related, I added a patch to fix it.
src/core/namespace.c
Outdated
| { "/root", READWRITE, true, }, /* ProtectHome= */ | ||
| }; | ||
|
|
||
| static void append_mount(BindMount **p, const char *path, MountMode mode, bool ignore) { |
There was a problem hiding this comment.
I don't think this function should modify *p. Maybe call it configure_bind_mount and let the caller update *p.
src/core/namespace.c
Outdated
| ProtectHome protect_home, | ||
| ProtectSystem protect_system) { | ||
| unsigned n; | ||
| unsigned protect_system_cnt = 0; |
There was a problem hiding this comment.
Maybe it'd be nicer to do
protect_system_count =
protect_system == PROTECT_SYSTEM_COUNT ? ELEMENTSOF(protect_system_strict_table) :
protect_system == PROTECT_SYSTEM_YES ? ELEMENTSOF(protect_system_yes_table) :
protect_system == PROTECT_SYSTEM_FULL ? ELEMENTSOF(protect_system_yes_table) : 0;
I know it repeats protect_system a bit, but swith is annoyingly verbose.
src/core/namespace.c
Outdated
| { "/root", READONLY, true, }, | ||
| }; | ||
|
|
||
| /* ProtectHome=true table */ |
There was a problem hiding this comment.
"ProtectHome=yes", otherwise one has to think for a second if this is a copy*paste error or not.
| r = 0; | ||
| break; | ||
| default: | ||
| r = -EINVAL; |
There was a problem hiding this comment.
Shouldn't this be assert_not_reached instead? We should cover all the protect_home cases in this function.
There was a problem hiding this comment.
IMO at this stage assert_not_reached is not good since we break hard, better use it when we load the properties... and here just fail with -EINVAL, but if you think that I should change it, I will add a separate patch on top.
| break; | ||
| case PROTECT_HOME_YES: | ||
| r = append_target_mounts(p, root_directory, protect_home_yes_table, | ||
| ELEMENTSOF(protect_home_yes_table)); |
There was a problem hiding this comment.
It'd be a bit shorter (and safer) to simply return immediately. Then r does not need to be defined, etc.
src/core/execute.c
Outdated
| syscalls_found = true; | ||
| break; | ||
| } | ||
| } |
keszybz
left a comment
There was a problem hiding this comment.
Please squash this with the previous commit which implements the change.
| r = bind_remount_recursive(m->path, true, blacklist); | ||
| else if (m->mode == PRIVATE_DEV) { /* Can be readonly but the submounts can't*/ | ||
| if (mount(NULL, m->path, NULL, MS_REMOUNT|DEV_MOUNT_OPTIONS|MS_RDONLY, NULL) < 0) | ||
| r = -errno; |
There was a problem hiding this comment.
Fixed in the corresponding patch.
Let's make sure that all our rules apply to all archs the local kernel supports.
…ControlGroups= If enabled, these will block write access to /sys, /proc/sys and /proc/sys/fs/cgroup.
…r down If a dir is marked to be inaccessible then everything below it should be masked by it.
…util.c This adds a new call get_user_creds_clean(), which is just like get_user_creds() but returns NULL in the home/shell parameters if they contain no useful information. This code previously lived in execute.c, but by generalizing this we can reuse it in run.c.
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept otherwise makes no sense.
There's no point in mounting these, if they are outside of the root directory we'll move to.
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the main execution logic a bit.
Instead of having all these paths everywhere, put the ones that are protected by ProtectKernelTunables= into their own table. This way it is easy to add paths and track which ones are protected.
Move out mount calculation on its own function. Actually the logic is smart enough to later drop nop and duplicates mounts, this change improves code readability. --- src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 36 insertions(+), 11 deletions(-)
Documentation fixes for ReadWritePaths= and ProtectKernelTunables= as reported by Evgeny Vereshchagin.
…rivateDevices=true
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface, filesystems configuration and IRQ tuning readonly. Most of these interfaces now days should be in /sys but they are still available through /proc, so just protect them. This patch does not touch /proc/net/...
ProtectSystem= with all its different modes and other options like PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal, however currently it's a bit hard to parse that from the implementation view. Simplify it by giving each mode its own table with all paths and references to other Protect options. With this change some entries are duplicated, but we do not care since duplicate mounts are first sorted by the most restrictive mode then cleaned.
As with previous patch simplify ProtectHome and don't care about duplicates, they will be sorted by most restrictive mode and cleaned.
…vices= is set Instead of having a local syscall list, use the @raw-io group which contains the same set of syscalls to filter.
b24424b to
615a1f4
Compare
|
@keszybz updated, thank you! |
I think this is somehow related to the #4194 @martinpitt , please have a look |
|
Updated the packaging to current master and retried tests. |
Looks like #3008 (comment) |
|
I'm aware of the missing |
…t mount propagation Better safe.
9446fb8 to
cdfbd1f
Compare
|
@martinpitt thanks, @evverx @keszybz copied that mount propagation as a test and pushed on top, still not all options are covered. |
|
@tixxdz , thanks! |
…first-protection-v1"
…first-protection-v1" This reverts commit cc23859, reversing changes made to b8fafaf. Fixes systemd#4238.
This is a rebase of #4018 plus bug fixes that were reported there, more protection for ProtectKernelTunables and others. I am opening this to try to get a review before pushing more stuff otherwise outree patches will just grow...
Thanks!