Skip to content

core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes#4185

Merged
evverx merged 41 commits intosystemd:masterfrom
endocode:djalal-sandbox-first-protection-v1
Sep 28, 2016
Merged

core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes#4185
evverx merged 41 commits intosystemd:masterfrom
endocode:djalal-sandbox-first-protection-v1

Conversation

@tixxdz
Copy link
Member

@tixxdz tixxdz commented Sep 20, 2016

This is a rebase of #4018 plus bug fixes that were reported there, more protection for ProtectKernelTunables and others. I am opening this to try to get a review before pushing more stuff otherwise outree patches will just grow...

Thanks!

break;
case PROTECT_HOME_YES:
r = append_target_mounts(p, root_directory, protect_home_yes_table,
ELEMENTSOF(protect_home_yes_table));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing break?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! fixed and force pushed. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be a bit shorter (and safer) to simply return immediately. Then r does not need to be defined, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

@tixxdz tixxdz force-pushed the djalal-sandbox-first-protection-v1 branch 2 times, most recently from f13c70f to 4f7279f Compare September 20, 2016 14:14
Copy link
Contributor

@evverx evverx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test this branch tomorrow. There are two nitpicks

{ "/proc/fs", READONLY, true, },
{ "/proc/irq", READONLY, true, },
{ "/sys", READONLY, false, },
{ "/sys/kernel/debug", READONLY, true, },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should protect the /sys/kernel/tracing too.

https://lwn.net/Articles/630526/

Finally, a new directory is created when tracefs is enabled called
/sys/kernel/tracing. This will be the new location that system admins
may mount tracefs if they are not using debugfs.

https://lkml.org/lkml/2016/5/13/327

On my system, simply enabling and disabling function graph tracer can
crash the kernel

:-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, yep! I added the entry now, but actually Lennart already added that item about tracing ProtectTracing= in TODO 842bb2f , and I have that in a separate patch that's why... thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this kind of whack'a'mole will not scale. But I don't have a better idea.


BindMount *m, *mounts = NULL;
ProtectSystem protect_system)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: ) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


BindMount *m, *mounts = NULL;
ProtectSystem protect_system)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm: this commit fixes #4018 (comment)
But I think we should squash 50d1975 and 1be5c8d

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@evverx evverx added the smack label Sep 21, 2016
for services which shall be able to install mount points in the main mount namespace. Note that the effect of
these settings may be undone by privileged processes. In order to set up an effective sandboxed environment for
a unit it is thus recommended to combine these settings with either
<varname>CapabilityBoundingSet=~CAP_SYS_ADMIN</varname> or <varname>SystemCallFilter=~@mount</varname>.</para></listitem>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I can easy bypass ProtectKernelTunables and ProtectControlGroups

-bash-4.3# systemctl cat hola --no-pager
# /etc/systemd/system/hola.service
[Service]
ProtectKernelTunables=yes
CapabilityBoundingSet=~CAP_SYS_ADMIN
SystemCallFilter=~@mount
ProtectControlGroups=yes
ExecStart=/usr/bin/systemd-run sh -x -c 'echo HEY >/proc/sys/kernel/core_pattern'

-bash-4.3# echo OLD_VAL >/proc/sys/kernel/core_pattern
-bash-4.3# cat /proc/sys/kernel/core_pattern
OLD_VAL
-bash-4.3# systemctl start hola
-bash-4.3# cat /proc/sys/kernel/core_pattern
HEY

Do we really need these settings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InaccessiblePaths=/run/systemd/private /run/dbus/system_bus_socket

-bash-4.3# systemctl start hola
-bash-4.3# systemctl status hola
● hola.service
   Loaded: loaded (/etc/systemd/system/hola.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sat 2016-09-24 13:39:43 CEST; 3s ago
  Process: 102 ExecStart=/usr/bin/systemd-run sh -x -c echo 0 > /proc/latency_stats; sleep 10; echo hhhhh; ls -lha /run/systemd/; (code=exited, status=1/FAILURE)
 Main PID: 102 (code=exited, status=1/FAILURE)

Sep 24 13:39:43 fedora-tree-24 systemd[1]: Started hola.service.
Sep 24 13:39:43 fedora-tree-24 systemd-run[102]: Failed to create bus connection: Connection refused
Sep 24 13:39:43 fedora-tree-24 systemd[1]: hola.service: Main process exited, code=exited, status=1/FAILURE
Sep 24 13:39:43 fedora-tree-24 systemd[1]: hola.service: Unit entered failed state.
Sep 24 13:39:43 fedora-tree-24 systemd[1]: hola.service: Failed with result 'exit-code'.

So you have to protect the bus APIs also the user bus... in my setup I already do this, however lot of other tools don't do it... and I already discussed this @poettering we may need later a simple option ProtectBus... to abstract this...
with previous kdbus, the setup was bind mount custom endpoints on top of default bus to only show some predefined bus names... all the rest will be hidden. However these are unrelated details... which should not conflict with these settings. Of course we may need to document this for now.

Do we really need these settings?

Sure, we need them for our users and a better integration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evverx Force pushed, thank you for the review!

@tixxdz tixxdz force-pushed the djalal-sandbox-first-protection-v1 branch from 4f7279f to dff9dce Compare September 24, 2016 11:33
{ "/proc/sys", READONLY, false, },
{ "/proc/sysrq-trigger", READONLY, true, /* Not always compiled into kernel */ },
{ "/sys", READONLY, false, },
{ "/sys/fs/cgroup", READWRITE, false, }, /* READONLY is set by ProtectControlGroups= option */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just OCD, but the way that one comment is outside, while the other is it's own separate argument after a comma pains me :) Maybe you could make the whitespace a bit more frugal and remove the trailing commas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouu probably I have the opposite of OCD which is probably the same thing ? updated with some medication but not sure if they are the right one! hehe

Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this damn thing wants me to have a separate "review" for every patch, and then another one for the whole of the patch set. It's going to be a lot of small "reviews".


int setup_namespace(
const char* root_directory,
static unsigned int namespace_calculate_mounts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain "unsigned" (no "int") is customary.

((protect_system != PROTECT_SYSTEM_NO ? 3 : 0) +
(protect_system == PROTECT_SYSTEM_FULL ? 1 : 0)));

return n;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I guess you could drop the n variable too, and just return.

Makefile.am Outdated
test/test-execute/exec-personality-aarch64.service \
test/test-execute/exec-privatedevices-no.service \
test/test-execute/exec-privatedevices-yes.service \
test/test-execute/exec-privatedevices-no-capibility-mknod.service \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not capibility !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouu done


[Service]
PrivateDevices=no
ExecStart=/bin/sh -x -c 'c=$$(capsh --print | grep "cap_mknod"); test -n "$$c"'
Copy link
Member

@keszybz keszybz Sep 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know it it's a useful cleanup, but you could just do: capsh --print | grep cap_mknod, and ! capsh --print | grep cap_mknod below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just used what other caps tests are using, ok updated now.

@@ -106,26 +154,68 @@ static int append_mounts(BindMount **p, char **strv, MountMode mode) {
if (!path_is_absolute(*i))
Copy link
Member

@keszybz keszybz Sep 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would seem cleaner to do

     bool ignore;
     ...
     if (IN_SET(mode, INACCESSIBLE, READONLY, READWRITE) && startswith("-", *i)) {
                         (*i)++;
                         ignore = true;
     } else
                         ignore = false;

and below replace (*p)->ignore with ignore .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this one is minor and not related, I added a patch to fix it.

{ "/root", READWRITE, true, }, /* ProtectHome= */
};

static void append_mount(BindMount **p, const char *path, MountMode mode, bool ignore) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function should modify *p. Maybe call it configure_bind_mount and let the caller update *p.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok set_bind_mount()

ProtectHome protect_home,
ProtectSystem protect_system) {
unsigned n;
unsigned protect_system_cnt = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'd be nicer to do

    protect_system_count =
                protect_system == PROTECT_SYSTEM_COUNT ? ELEMENTSOF(protect_system_strict_table) :
                protect_system == PROTECT_SYSTEM_YES ? ELEMENTSOF(protect_system_yes_table) :
                protect_system == PROTECT_SYSTEM_FULL ? ELEMENTSOF(protect_system_yes_table) : 0;

I know it repeats protect_system a bit, but swith is annoyingly verbose.

{ "/root", READONLY, true, },
};

/* ProtectHome=true table */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ProtectHome=yes", otherwise one has to think for a second if this is a copy*paste error or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

r = 0;
break;
default:
r = -EINVAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be assert_not_reached instead? We should cover all the protect_home cases in this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO at this stage assert_not_reached is not good since we break hard, better use it when we load the properties... and here just fail with -EINVAL, but if you think that I should change it, I will add a separate patch on top.

break;
case PROTECT_HOME_YES:
r = append_target_mounts(p, root_directory, protect_home_yes_table,
ELEMENTSOF(protect_home_yes_table));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be a bit shorter (and safer) to simply return immediately. Then r does not need to be defined, etc.

syscalls_found = true;
break;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{} unneeded.

Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash this with the previous commit which implements the change.

r = bind_remount_recursive(m->path, true, blacklist);
else if (m->mode == PRIVATE_DEV) { /* Can be readonly but the submounts can't*/
if (mount(NULL, m->path, NULL, MS_REMOUNT|DEV_MOUNT_OPTIONS|MS_RDONLY, NULL) < 0)
r = -errno;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may leave r unitialized!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the corresponding patch.

Let's make sure that all our rules apply to all archs the local kernel
supports.
…ControlGroups=

If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
…r down

If a dir is marked to be inaccessible then everything below it should be masked
by it.
…util.c

This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept
otherwise makes no sense.
poettering and others added 14 commits September 25, 2016 10:52
There's no point in mounting these, if they are outside of the root directory
we'll move to.
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the
main execution logic a bit.
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
Move out mount calculation on its own function. Actually the logic is
smart enough to later drop nop and duplicates mounts, this change
improves code readability.
---
 src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 11 deletions(-)
Documentation fixes for ReadWritePaths= and ProtectKernelTunables=
as reported by Evgeny Vereshchagin.
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.

Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
ProtectSystem= with all its different modes and other options like
PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal,
however currently it's a bit hard to parse that from the implementation
view. Simplify it by giving each mode its own table with all paths and
references to other Protect options.

With this change some entries are duplicated, but we do not care since
duplicate mounts are first sorted by the most restrictive mode then
cleaned.
As with previous patch simplify ProtectHome and don't care about
duplicates, they will be sorted by most restrictive mode and cleaned.
…vices= is set

Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
@tixxdz tixxdz force-pushed the djalal-sandbox-first-protection-v1 branch from b24424b to 615a1f4 Compare September 25, 2016 11:06
@tixxdz
Copy link
Member Author

tixxdz commented Sep 25, 2016

@keszybz updated, thank you!

@evverx
Copy link
Contributor

evverx commented Sep 25, 2016

# fix location of NSS modules
cd debian/install/deb/; if $(ls usr/lib/*/libnss_* >/dev/null 2>&1); then for f in usr/lib/*/libnss_*; do mv $f ${f#usr/}; done; fi
dh_install --remaining-packages --sourcedir=debian/install/deb --list-missing
dh_install: libsystemd-dev missing files: usr/lib/*/libsystemd.so
dh_install: libudev-dev missing files: usr/lib/*/libudev.so
dh_install: usr/bin/systemd-mount exists in debian/install/deb but is not installed to anywhere
dh_install: lib/x86_64-linux-gnu/libnss_systemd.so.2 exists in debian/install/deb but is not installed to anywhere
dh_install: lib/x86_64-linux-gnu/libudev.so exists in debian/install/deb but is not installed to anywhere
dh_install: lib/x86_64-linux-gnu/libsystemd.so exists in debian/install/deb but is not installed to anywhere
dh_install: missing files, aborting

I think this is somehow related to the #4194

@martinpitt , please have a look

@martinpitt
Copy link
Contributor

martinpitt commented Sep 25, 2016

Updated the packaging to current master and retried tests.

@evverx
Copy link
Contributor

evverx commented Sep 25, 2016

modprobe: FATAL: Module scsi_debug not found in directory /lib/modules/4.4.0-38-generic
ERROR

======================================================================
ERROR: setUpClass (__main__.CryptsetupTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.s8yIUq/build.QIb/systemd-debian/debian/tests/storage", line 21, in setUpClass
    subprocess.check_call(['modprobe', 'scsi_debug'])
  File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['modprobe', 'scsi_debug']' returned non-zero exit status 1

Looks like #3008 (comment)

@martinpitt
Copy link
Contributor

I'm aware of the missing scsi_debug thing; long story, ignore for now, it'll come back sometime today. The other tests are fine, so tests are passing for the sake of this PR.

@tixxdz tixxdz force-pushed the djalal-sandbox-first-protection-v1 branch from 9446fb8 to cdfbd1f Compare September 27, 2016 07:25
@tixxdz
Copy link
Member Author

tixxdz commented Sep 27, 2016

@martinpitt thanks, @evverx @keszybz copied that mount propagation as a test and pushed on top, still not all options are covered.

@evverx evverx merged commit cc23859 into systemd:master Sep 28, 2016
@evverx
Copy link
Contributor

evverx commented Sep 28, 2016

@tixxdz , thanks!

yuwata added a commit to yuwata/systemd that referenced this pull request Oct 3, 2016
yuwata added a commit to yuwata/systemd that referenced this pull request Oct 5, 2016
…first-protection-v1"

This reverts commit cc23859, reversing
changes made to b8fafaf.

Fixes systemd#4238.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants