[RFC] run systemd in an unprivileged container by giuseppe · Pull Request #4280 · systemd/systemd

giuseppe · 2016-10-04T11:10:22Z

a container might not have enough capabilities to run systemd. This series fixes these cases:

missing CAP_AUDIT_[READ|WRITE].
setgroups blocked via /proc/[pid]/setgroups
Fail to set capabilities from the container.
Fail to unmount.

The goal is to be able to run systemd from a non privileged user via bubblewrap, that provides only a safe set of caps to be used in a container.

bubblewrap PR here: containers/bubblewrap#101

poettering

We don't use "Signed-off-by" in systemd, that's a kernel thing.

(This means: we prefer patches without this, but it is not a blocker to have them)

giuseppe · 2016-10-04T12:55:57Z

thanks for the quick review. I have pushed a new version without the "Signed-off-by" part ⬆️

poettering · 2016-10-04T12:58:36Z

sorry, this wasn't supposed to be considered "reviewed", I am not used to the new github facilities for this. Sorry for the confusion. I am still reviewing...

poettering · 2016-10-04T12:59:20Z

What is an "unprivileged" container here? One with userns but 1:1 uid mapping? Or what precisely do you mean here?

poettering · 2016-10-04T13:01:16Z

src/basic/capability-util.c


 finish:
+        if (r == -EPERM && (detect_container() > 0))
+                return 0;


I really don't like this bit. The thing is that this code is used in various places in systemd, not just in PID 1 (and it should be used everywhere, I think). The logic you added only really makes sense in PID 1 however, afaics...

Moreover I don't really grok what precisely actually fails here... I mean, dropping caps should always be safe: if we have caps we should be able to safely drop them, and if we don't have them then we shouldn't need to drop them and hence not need to generate an error. I don't really grok here what syscall precisely fails in your case. Can you elaborate?

I'd much prefer if we'd do as little container specific checks as necessary here. And in this case it appears that dropping the caps should be fully safe if we have no caps in the first place... hence, please elaborate.

(also, having such "blanket tape over errors a-posteriori" concepts makes all my alarms sound loudly... I would really prefer that if we ignore an error in some condition, it should be as precise as possible, and hence be right to the one syscall where it matters, but not in a global cleanup path for a function, if you follow what I mean)

poettering · 2016-10-04T13:06:12Z

src/basic/capability-util.c

+        if (r < 0 && errno == EPERM && (detect_container() > 0))
+                return 0;
        return r;
 }


I think it would be better to guard this one function on the caller side, and bind it to the availability of the right kind of cap?

poettering · 2016-10-04T13:06:35Z

src/basic/capability-util.c

                return log_error_errno(errno, "Failed to change group ID: %m");

-        if (setgroups(0, NULL) < 0)
+        if (setgroups(0, NULL) < 0 && (errno != EPERM || (detect_container() <= 0)))


what is this about? why wouldn't we be able to drop groups ever?

poettering · 2016-10-04T13:07:16Z

src/basic/capability-util.c

                        return log_error_errno(errno, "Failed to enable capabilities bits: %m");

-                if (cap_set_proc(d) < 0)
+                if (cap_set_proc(d) < 0 && (errno != EPERM || (detect_container() <= 0)))


what is this about? why wouldn't we able to drop caps we have? (this is the same case as the same as the bounding set thing further up)

AFAICS, we don't really check that we have the capabilities listed in keep_capabilities, so that cap_set_proc adds more capabilties than we originally had. Anyway, I am going to drop this change and require the container to have CAP_NEW_RAW and CAP_NET_BROADCAST, as required by systemd-networkd.

hmm I'd argue that we probably should fail if we try to reduce the set of caps to caps we don't have...

poettering · 2016-10-04T13:08:03Z

src/basic/capability-util.c

-            (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0))
+            (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0)) {
+                if (errno == EPERM && (detect_container() > 0))
+                        return 0;


hmm, cap_set_flag() only alters the internal state of the "tmp_cap" userspace object, no? This should never fail with EPERM, should it?

poettering · 2016-10-04T13:08:19Z

src/basic/capability-util.c


-        if (cap_set_proc(tmp_cap) < 0)
+        if (cap_set_proc(tmp_cap) < 0) {
+                if (errno == EPERM && (detect_container() > 0))


Same case as above...

poettering · 2016-10-04T13:10:09Z

src/basic/user-util.c


-        if (setresuid(0, 0, 0) < 0)
+        if (setresuid(0, 0, 0) < 0 && (errno != EPERM || (detect_container() <= 0)))
                return -errno;


not sure i grok this, but what precisely is failing here in your case? why shouldn't we be able to reset our creds to root?

poettering · 2016-10-04T13:11:56Z

src/core/mount.c

        m->control_pid = 0;

-        if (is_clean_exit(code, status, NULL))
+        if (is_clean_exit(code, status, NULL) || is_container)


umpf, this one I really don't like. If umounting fails in your container, we should fail to the unit. To make the error go away we should rather not start or stop the unit, rather than just ignoring the result if we do...

poettering

Not sure I am using the new github review tools properly, but I left some comments now, this needs mor discussion I figure.

giuseppe · 2016-10-04T14:24:23Z

thanks for your comments. I set the RFC here to start a discussion and see what changes make sense in systemd and what definitely should not be included. The goal is to be able to run systemd from bubblewrap, and requiring as less caps as possible in the container. I've probably guarded with detect_container more places than needed just to be safe, I am going to drop these and send a stripped down version.

johannbg · 2016-10-04T15:14:17Z

This rewrite work could cause significant regression with downstream so I have to ask why would systemd make drastic rewrite on it self with potential bug and breakage which can affect several distributions for bubblewrap?

What's the usecase here outside bubblewrap that everyone will benefit from this rewrite work from embedded/server/desktop ?

martinpitt · 2016-10-04T15:36:31Z

I looked at that in https://launchpad.net/bugs/1576341 a while ago (comment 5 ff.), as it's similar with LXD (which runs containers as unprivileged user by default). Things like dev-hugepages.mount and systemd-journald-audit.socket fail there as the simple capability checks are insufficient there as capabilities are not sufficiently namespace aware. This is one of these ugly "emergent" bugs where lxd, systemd, and the units are all doing the right thing from their perspective, but in combination it doesn't work.

giuseppe · 2016-10-04T15:52:26Z

I have polished the patches and handle only these two cases:

disable audit if we have not enough permissions to create the NETLINK_AUDIT socket
do not fail if it is not possible to use setgroups

poettering · 2016-10-04T21:35:13Z

bubblewrap appears to be a sandbox tool for desktop apps, no? why would you want to make systemd run in that? systemd really only makes sense in environments that actually try to emulate more complete systems that actually need a PID1 (think: nspawn).

poettering · 2016-10-04T21:36:32Z

can you elaborate on the setgroups() thing, why is that necessary?

giuseppe · 2016-10-05T07:55:25Z

we are trying to run containers as non root users through bubblewrap. In order to do that, some changes are required in bubblewrap as well.

The setgroups thing is needed when the usage of setgroups is blocked through /proc/$PID/setgroups. Blocking setgroups in the new user namespace is done to prevent that a container gains privileges by dropping its supplementary groups (CVE-2014-8989); in other words, to prevent that a process circumvents restrictions on a group by dropping its additional groups.
Once setgroups is blocked in an user namespace, it is not possible to enable it again.

poettering · 2016-10-05T13:37:41Z

@poettering yes, that is exactly the case. I will work on the maybe_setgroups wrapper function. Is user-util.c the right place for it?

Yes, that looks like a good place for it.

poettering · 2016-10-05T13:38:20Z

@poettering , @giuseppe , thanks for the explanation! I'm reading https://github.com/projectatomic/bubblewrap. I wonder why do people want to run "system"-containers inside the bubblewrap. But I guess I'm missing something

Yeah, I don't really grok the usecase either. But well, I think it's fine to support unprivileged userns this way...

evverx · 2016-10-05T14:27:19Z

I wonder why do people want to run "system"-containers inside the bubblewrap.

Yeah, I don't really grok the usecase either.

@giuseppe , why do people need this?

giuseppe · 2016-10-05T16:49:57Z

@evverx our goal is to leverage user namespaces to run containers as a non privileged user. The main reason is to increase security, as root won't run these containers (partially true as bubblewrap is a setuid program and we will still leave some capabilities in the container).

evverx · 2016-10-05T17:43:58Z

@giuseppe , thanks!

partially true as bubblewrap is a setuid program and we will still leave some capabilities in the container

Why not to implement something like --private-users=pick

The value "pick" turns on user namespacing. In this case the UID/GID range is automatically chosen.

?

core: do not fail in a container if we can't use setgroups …
It might be blocked through /proc/PID/setgroups

How does this commit affect Group=, SupplementaryGroups= inside the container?

evverx · 2016-10-05T19:01:49Z

@giuseppe , another question: do /proc/self/uid_map, /proc/self/gid_map look like

0 some-id 1

?
So, how does the User-setting work? You should get setresuid failed: Invalid argument

evverx · 2016-10-05T20:42:04Z

src/basic/capability-util.c

                return log_error_errno(errno, "Failed to change group ID: %m");

-        if (setgroups(0, NULL) < 0)
+        if (maybe_setgroups(0, NULL) < 0)


$ git grep drop_priv src/ | grep -v test src/basic/capability-util.c:int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities) { src/basic/capability-util.h:int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities); src/coredump/coredump.c: return drop_privileges(uid, gid, 0); src/network/networkd.c: r = drop_privileges(uid, gid, src/resolve/resolved.c: r = drop_privileges(uid, gid, src/timesync/timesyncd.c: r = drop_privileges(uid, gid, (1ULL << CAP_SYS_TIME));

Seems like you can't run all these binaries inside the unprivileged user namespace: setresuid(uid, uid, uid) fails. Why do we need this?

giuseppe · 2016-10-05T20:44:02Z

thanks for the additional inputs. I am currently mapping the uids/guids 1-999 to some high range in the host uids/guids similarly to what --private-users=pick does, so that setting uid/gid works. Differently, the SupplementaryGroups= directive won't work as setgroups is blocked in the container. Even using a range of users, won't solve the security problem of having the possibility of dropping groups.

evverx · 2016-10-05T20:45:47Z

src/core/execute.c

                }

-                if (setgroups(k, gids) < 0) {
+                if (maybe_setgroups(k, gids) < 0) {


SupplementaryGroups= is no-op. Does it make sense? I guess we should fail if we can't set supplementary groups.

I have pushed another version that reverts this change, so that maybe_setgroups is used only for dropping auxiliary groups

evverx · 2016-10-05T21:16:29Z

@giuseppe , sorry, I missed your previous comment

I am currently mapping the uids/guids 1-999

I see, containers/bubblewrap@9182998
But containers/bubblewrap@9182998#r81550526

This breaks any use of non-privileged user namespaces use, because that only allows a single mapping.

,

Differently, the SupplementaryGroups= directive won't work as setgroups is blocked in the container.

#4280 (comment)

I guess we should fail if we can't set supplementary groups.

poettering

We are getting there! But a few more notes.

poettering · 2016-10-06T08:39:58Z

src/basic/user-util.c

 }
+
+int maybe_setgroups(size_t size, const gid_t *list) {
+        static int cached_can_setgroups = -1;


hmm, i wonder if it's a good idea to cache this, after all it can change during runtime? Also, afaics we are unlikely to call this frequently anyway, hence there's no point in caching this, is there?

I think it should be fine to cache it, /proc/PID/setgroups can be set only before the gid_map file is written.

But what about:

maybe_setgroups(); p = clone(NEWUSER); ...child... maybe_setgroups(); ...child...

?

that won't change the /proc/ID/setgroups value for the new process. It could be a problem if we were caching "allow" then cloning a new user namespace and change for the new userspace setgroups to "deny" before set its gid_map file.

poettering · 2016-10-06T08:40:51Z

src/basic/user-util.c

+                        /* old kernels don't have /proc/self/setgroups, so assume we can use it */
+                        cached_can_setgroups = true;
+                } else {
+                        cached_can_setgroups = !!strcmp(setgroups_content, "deny");


we try to avoid calling strcmp() if we are not interested in ordering between the strings. If you just want to compare two strings, please use streq(), which is a macro wrapper around it and helps readability.

poettering · 2016-10-06T08:43:09Z

src/basic/user-util.c

+                return 0;
+
+        return setgroups(size, list);
+}


I think one further refinement for this call would make sense: only skip the setgroups() invocation if the groups list is reset (i.e. size == 0), and continue generate an error if the list is anything but empty. This way if people actually set SupplementaryGroups= to something non-empty, then we'll continue to fail (and I think we should), but for the typical case where the list is just supposed to be reset we eat up the error. Does that make sense?

this should be documented. No?

oh, this is not a user-visible change. Right?

right, it is not an user visible change, when don't attempt to drop all the auxiliary groups when setprocs is blocked

poettering · 2016-10-06T08:43:35Z

src/basic/user-util.c

 #include "strv.h"
 #include "user-util.h"
 #include "utf8.h"
+#include "virt.h"


we don't need this #include anymore, do we?

poettering · 2016-10-06T08:44:42Z

src/basic/audit-util.c

-                if (fd < 0)
-                        cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT;
+                if (fd < 0) {
+                        cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT && errno != EPERM;


not that it matters, but we have this pretty macro IN_SET() for cases like this:

cached_use = !IN_SET(errno, EAFNOSUPPORT, EAPROTONOSUPPORT, EPERM)

poettering · 2016-10-06T08:45:58Z

src/basic/audit-util.c

+                if (fd < 0) {
+                        cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT && errno != EPERM;
+                        if (errno == EPERM)
+                                log_debug_errno(errno, "Audit disabled");


The debug message is misleading here. EAFNOSUPPORT/EAPROTONOSUPPORT are the errors suggesting audit was actually disabled. EPERM otoh is a permission error, hence this should really say "Audit access prohibited, won't talk to audit" or something like that.

It might be blocked through /proc/PID/setgroups

giuseppe · 2016-10-06T10:13:53Z

@poettering I've addressed all your comments, except caching /proc/PID/setgroups that cannot be modified once the gid_map file is written and pushed a new version here ⬆️

evverx · 2016-10-06T11:32:38Z

@poettering , @giuseppe , why not to add some test cases to src/test/test-capability.c?

poettering · 2016-10-06T13:43:58Z

@evverx more docs and testing would be very welcome of course. But I'll leave that to @giuseppe, as my own interest in systemd-in-unpriv-userns is limited.

evverx · 2016-10-06T15:06:47Z

more docs and testing would be very welcome of course.

👍

Anyway, I'm not sure about caching: #4280 (comment)

poettering · 2016-10-06T15:40:08Z

I don't like the caching still either I must say...

poettering · 2016-10-06T16:36:26Z

I added patch to #4299 that drops the caching, please have a look!

poettering approved these changes Oct 4, 2016

View reviewed changes

poettering reviewed Oct 4, 2016

View reviewed changes

giuseppe force-pushed the unprivileged-user branch from b068b32 to 91dbdac Compare October 4, 2016 12:55

poettering reviewed Oct 4, 2016

View reviewed changes

poettering added the pid1 label Oct 4, 2016

poettering reviewed Oct 4, 2016

View reviewed changes

poettering requested changes Oct 4, 2016

View reviewed changes

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 4, 2016

giuseppe force-pushed the unprivileged-user branch 4 times, most recently from 74607e8 to 540f296 Compare October 4, 2016 15:31

giuseppe force-pushed the unprivileged-user branch from 540f296 to 704f2d9 Compare October 4, 2016 15:51

giuseppe force-pushed the unprivileged-user branch from 704f2d9 to 67278f3 Compare October 4, 2016 19:39

Fix typo

7753186

giuseppe force-pushed the unprivileged-user branch from 67278f3 to 944a921 Compare October 5, 2016 16:50

evverx reviewed Oct 5, 2016

View reviewed changes

giuseppe force-pushed the unprivileged-user branch from 944a921 to 5999375 Compare October 6, 2016 07:40

poettering requested changes Oct 6, 2016

View reviewed changes

giuseppe added 2 commits October 6, 2016 11:49

audit: disable if cannot create NETLINK_AUDIT socket

f006b30

core: do not fail in a container if we can't use setgroups

36d8547

It might be blocked through /proc/PID/setgroups

giuseppe force-pushed the unprivileged-user branch from 5999375 to 36d8547 Compare October 6, 2016 10:12

poettering approved these changes Oct 6, 2016

View reviewed changes

poettering added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Oct 6, 2016

poettering merged commit e057995 into systemd:master Oct 6, 2016

evverx mentioned this pull request Oct 9, 2016

nspawn: change the synced cgroup hierarchy permissions too #4223

Merged

keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 9, 2016

poettering mentioned this pull request Dec 18, 2018

Restore call to pam_setcred #11199

Merged

Uh oh!

Conversation

giuseppe commented Oct 4, 2016

Uh oh!

poettering left a comment

Choose a reason for hiding this comment

Uh oh!

giuseppe commented Oct 4, 2016

Uh oh!

poettering commented Oct 4, 2016

Uh oh!

poettering commented Oct 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering left a comment

Choose a reason for hiding this comment

Uh oh!

giuseppe commented Oct 4, 2016

Uh oh!

johannbg commented Oct 4, 2016

Uh oh!

martinpitt commented Oct 4, 2016

Uh oh!

giuseppe commented Oct 4, 2016

Uh oh!

poettering commented Oct 4, 2016

Uh oh!

poettering commented Oct 4, 2016

Uh oh!

giuseppe commented Oct 5, 2016

Uh oh!

poettering commented Oct 5, 2016

Uh oh!

poettering commented Oct 5, 2016

Uh oh!

evverx commented Oct 5, 2016

Uh oh!

giuseppe commented Oct 5, 2016

Uh oh!

evverx commented Oct 5, 2016

Uh oh!

evverx commented Oct 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giuseppe commented Oct 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evverx commented Oct 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poettering left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evverx commented Oct 5, 2016 •

edited

Loading