Skip to content

[RFC] run systemd in an unprivileged container#4280

Merged
poettering merged 3 commits intosystemd:masterfrom
giuseppe:unprivileged-user
Oct 6, 2016
Merged

[RFC] run systemd in an unprivileged container#4280
poettering merged 3 commits intosystemd:masterfrom
giuseppe:unprivileged-user

Conversation

@giuseppe
Copy link
Contributor

@giuseppe giuseppe commented Oct 4, 2016

a container might not have enough capabilities to run systemd. This series fixes these cases:

  1. missing CAP_AUDIT_[READ|WRITE].
  2. setgroups blocked via /proc/[pid]/setgroups
  3. Fail to set capabilities from the container.
  4. Fail to unmount.

The goal is to be able to run systemd from a non privileged user via bubblewrap, that provides only a safe set of caps to be used in a container.

bubblewrap PR here: containers/bubblewrap#101

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use "Signed-off-by" in systemd, that's a kernel thing.

(This means: we prefer patches without this, but it is not a blocker to have them)

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 4, 2016

thanks for the quick review. I have pushed a new version without the "Signed-off-by" part ⬆️

@poettering
Copy link
Member

sorry, this wasn't supposed to be considered "reviewed", I am not used to the new github facilities for this. Sorry for the confusion. I am still reviewing...

@poettering
Copy link
Member

What is an "unprivileged" container here? One with userns but 1:1 uid mapping? Or what precisely do you mean here?


finish:
if (r == -EPERM && (detect_container() > 0))
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this bit. The thing is that this code is used in various places in systemd, not just in PID 1 (and it should be used everywhere, I think). The logic you added only really makes sense in PID 1 however, afaics...

Moreover I don't really grok what precisely actually fails here... I mean, dropping caps should always be safe: if we have caps we should be able to safely drop them, and if we don't have them then we shouldn't need to drop them and hence not need to generate an error. I don't really grok here what syscall precisely fails in your case. Can you elaborate?

I'd much prefer if we'd do as little container specific checks as necessary here. And in this case it appears that dropping the caps should be fully safe if we have no caps in the first place... hence, please elaborate.

(also, having such "blanket tape over errors a-posteriori" concepts makes all my alarms sound loudly... I would really prefer that if we ignore an error in some condition, it should be as precise as possible, and hence be right to the one syscall where it matters, but not in a global cleanup path for a function, if you follow what I mean)

if (r < 0 && errno == EPERM && (detect_container() > 0))
return 0;
return r;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to guard this one function on the caller side, and bind it to the availability of the right kind of cap?

return log_error_errno(errno, "Failed to change group ID: %m");

if (setgroups(0, NULL) < 0)
if (setgroups(0, NULL) < 0 && (errno != EPERM || (detect_container() <= 0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this about? why wouldn't we be able to drop groups ever?

return log_error_errno(errno, "Failed to enable capabilities bits: %m");

if (cap_set_proc(d) < 0)
if (cap_set_proc(d) < 0 && (errno != EPERM || (detect_container() <= 0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this about? why wouldn't we able to drop caps we have? (this is the same case as the same as the bounding set thing further up)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, we don't really check that we have the capabilities listed in keep_capabilities, so that cap_set_proc adds more capabilties than we originally had. Anyway, I am going to drop this change and require the container to have CAP_NEW_RAW and CAP_NET_BROADCAST, as required by systemd-networkd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I'd argue that we probably should fail if we try to reduce the set of caps to caps we don't have...

(cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0))
(cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0)) {
if (errno == EPERM && (detect_container() > 0))
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, cap_set_flag() only alters the internal state of the "tmp_cap" userspace object, no? This should never fail with EPERM, should it?


if (cap_set_proc(tmp_cap) < 0)
if (cap_set_proc(tmp_cap) < 0) {
if (errno == EPERM && (detect_container() > 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same case as above...

@poettering poettering added the pid1 label Oct 4, 2016

if (setresuid(0, 0, 0) < 0)
if (setresuid(0, 0, 0) < 0 && (errno != EPERM || (detect_container() <= 0)))
return -errno;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i grok this, but what precisely is failing here in your case? why shouldn't we be able to reset our creds to root?

src/core/mount.c Outdated
m->control_pid = 0;

if (is_clean_exit(code, status, NULL))
if (is_clean_exit(code, status, NULL) || is_container)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umpf, this one I really don't like. If umounting fails in your container, we should fail to the unit. To make the error go away we should rather not start or stop the unit, rather than just ignoring the result if we do...

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I am using the new github review tools properly, but I left some comments now, this needs mor discussion I figure.

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 4, 2016
@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 4, 2016

thanks for your comments. I set the RFC here to start a discussion and see what changes make sense in systemd and what definitely should not be included. The goal is to be able to run systemd from bubblewrap, and requiring as less caps as possible in the container. I've probably guarded with detect_container more places than needed just to be safe, I am going to drop these and send a stripped down version.

@johannbg
Copy link
Contributor

johannbg commented Oct 4, 2016

This rewrite work could cause significant regression with downstream so I have to ask why would systemd make drastic rewrite on it self with potential bug and breakage which can affect several distributions for bubblewrap?

What's the usecase here outside bubblewrap that everyone will benefit from this rewrite work from embedded/server/desktop ?

@giuseppe giuseppe force-pushed the unprivileged-user branch 4 times, most recently from 74607e8 to 540f296 Compare October 4, 2016 15:31
@martinpitt
Copy link
Contributor

I looked at that in https://launchpad.net/bugs/1576341 a while ago (comment 5 ff.), as it's similar with LXD (which runs containers as unprivileged user by default). Things like dev-hugepages.mount and systemd-journald-audit.socket fail there as the simple capability checks are insufficient there as capabilities are not sufficiently namespace aware. This is one of these ugly "emergent" bugs where lxd, systemd, and the units are all doing the right thing from their perspective, but in combination it doesn't work.

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 4, 2016

I have polished the patches and handle only these two cases:

  1. disable audit if we have not enough permissions to create the NETLINK_AUDIT socket
  2. do not fail if it is not possible to use setgroups

@poettering
Copy link
Member

bubblewrap appears to be a sandbox tool for desktop apps, no? why would you want to make systemd run in that? systemd really only makes sense in environments that actually try to emulate more complete systems that actually need a PID1 (think: nspawn).

@poettering
Copy link
Member

can you elaborate on the setgroups() thing, why is that necessary?

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 5, 2016

we are trying to run containers as non root users through bubblewrap. In order to do that, some changes are required in bubblewrap as well.

The setgroups thing is needed when the usage of setgroups is blocked through /proc/$PID/setgroups. Blocking setgroups in the new user namespace is done to prevent that a container gains privileges by dropping its supplementary groups (CVE-2014-8989); in other words, to prevent that a process circumvents restrictions on a group by dropping its additional groups.
Once setgroups is blocked in an user namespace, it is not possible to enable it again.

@poettering
Copy link
Member

@poettering yes, that is exactly the case. I will work on the maybe_setgroups wrapper function. Is user-util.c the right place for it?

Yes, that looks like a good place for it.

@poettering
Copy link
Member

@poettering , @giuseppe , thanks for the explanation! I'm reading https://github.com/projectatomic/bubblewrap. I wonder why do people want to run "system"-containers inside the bubblewrap. But I guess I'm missing something

Yeah, I don't really grok the usecase either. But well, I think it's fine to support unprivileged userns this way...

@evverx
Copy link
Contributor

evverx commented Oct 5, 2016

I wonder why do people want to run "system"-containers inside the bubblewrap.

Yeah, I don't really grok the usecase either.

@giuseppe , why do people need this?

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 5, 2016

@evverx our goal is to leverage user namespaces to run containers as a non privileged user. The main reason is to increase security, as root won't run these containers (partially true as bubblewrap is a setuid program and we will still leave some capabilities in the container).

@evverx
Copy link
Contributor

evverx commented Oct 5, 2016

@giuseppe , thanks!

partially true as bubblewrap is a setuid program and we will still leave some capabilities in the container

Why not to implement something like --private-users=pick

The value "pick" turns on user namespacing. In this case the UID/GID range is automatically chosen.

?

core: do not fail in a container if we can't use setgroups …
It might be blocked through /proc/PID/setgroups

How does this commit affect Group=, SupplementaryGroups= inside the container?

@evverx
Copy link
Contributor

evverx commented Oct 5, 2016

@giuseppe , another question: do /proc/self/uid_map, /proc/self/gid_map look like

0 some-id 1

?
So, how does the User-setting work? You should get setresuid failed: Invalid argument

return log_error_errno(errno, "Failed to change group ID: %m");

if (setgroups(0, NULL) < 0)
if (maybe_setgroups(0, NULL) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ git grep drop_priv src/ | grep -v test
src/basic/capability-util.c:int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities) {
src/basic/capability-util.h:int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities);
src/coredump/coredump.c:        return drop_privileges(uid, gid, 0);
src/network/networkd.c:        r = drop_privileges(uid, gid,
src/resolve/resolved.c:        r = drop_privileges(uid, gid,
src/timesync/timesyncd.c:        r = drop_privileges(uid, gid, (1ULL << CAP_SYS_TIME));

Seems like you can't run all these binaries inside the unprivileged user namespace: setresuid(uid, uid, uid) fails. Why do we need this?

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 5, 2016

thanks for the additional inputs. I am currently mapping the uids/guids 1-999 to some high range in the host uids/guids similarly to what --private-users=pick does, so that setting uid/gid works. Differently, the SupplementaryGroups= directive won't work as setgroups is blocked in the container. Even using a range of users, won't solve the security problem of having the possibility of dropping groups.

}

if (setgroups(k, gids) < 0) {
if (maybe_setgroups(k, gids) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SupplementaryGroups= is no-op. Does it make sense? I guess we should fail if we can't set supplementary groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed another version that reverts this change, so that maybe_setgroups is used only for dropping auxiliary groups

@evverx
Copy link
Contributor

evverx commented Oct 5, 2016

@giuseppe , sorry, I missed your previous comment

I am currently mapping the uids/guids 1-999

I see, containers/bubblewrap@9182998
But containers/bubblewrap@9182998#r81550526

This breaks any use of non-privileged user namespaces use, because that only allows a single mapping.

,

Differently, the SupplementaryGroups= directive won't work as setgroups is blocked in the container.

#4280 (comment)

I guess we should fail if we can't set supplementary groups.

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting there! But a few more notes.

}

int maybe_setgroups(size_t size, const gid_t *list) {
static int cached_can_setgroups = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i wonder if it's a good idea to cache this, after all it can change during runtime? Also, afaics we are unlikely to call this frequently anyway, hence there's no point in caching this, is there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine to cache it, /proc/PID/setgroups can be set only before the gid_map file is written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what about:

maybe_setgroups();
p = clone(NEWUSER);
...child...
maybe_setgroups();
...child...

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that won't change the /proc/ID/setgroups value for the new process. It could be a problem if we were caching "allow" then cloning a new user namespace and change for the new userspace setgroups to "deny" before set its gid_map file.

/* old kernels don't have /proc/self/setgroups, so assume we can use it */
cached_can_setgroups = true;
} else {
cached_can_setgroups = !!strcmp(setgroups_content, "deny");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we try to avoid calling strcmp() if we are not interested in ordering between the strings. If you just want to compare two strings, please use streq(), which is a macro wrapper around it and helps readability.

return 0;

return setgroups(size, list);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one further refinement for this call would make sense: only skip the setgroups() invocation if the groups list is reset (i.e. size == 0), and continue generate an error if the list is anything but empty. This way if people actually set SupplementaryGroups= to something non-empty, then we'll continue to fail (and I think we should), but for the typical case where the list is just supposed to be reset we eat up the error. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be documented. No?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this is not a user-visible change. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it is not an user visible change, when don't attempt to drop all the auxiliary groups when setprocs is blocked

#include "strv.h"
#include "user-util.h"
#include "utf8.h"
#include "virt.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this #include anymore, do we?

if (fd < 0)
cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT;
if (fd < 0) {
cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT && errno != EPERM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that it matters, but we have this pretty macro IN_SET() for cases like this:

cached_use = !IN_SET(errno, EAFNOSUPPORT, EAPROTONOSUPPORT, EPERM)

if (fd < 0) {
cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT && errno != EPERM;
if (errno == EPERM)
log_debug_errno(errno, "Audit disabled");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug message is misleading here. EAFNOSUPPORT/EAPROTONOSUPPORT are the errors suggesting audit was actually disabled. EPERM otoh is a permission error, hence this should really say "Audit access prohibited, won't talk to audit" or something like that.

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 6, 2016

@poettering I've addressed all your comments, except caching /proc/PID/setgroups that cannot be modified once the gid_map file is written and pushed a new version here ⬆️

@poettering poettering added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Oct 6, 2016
@evverx
Copy link
Contributor

evverx commented Oct 6, 2016

@poettering , @giuseppe , why not to add some test cases to src/test/test-capability.c?

@poettering
Copy link
Member

@evverx more docs and testing would be very welcome of course. But I'll leave that to @giuseppe, as my own interest in systemd-in-unpriv-userns is limited.

@poettering poettering merged commit e057995 into systemd:master Oct 6, 2016
@evverx
Copy link
Contributor

evverx commented Oct 6, 2016

more docs and testing would be very welcome of course.

👍

Anyway, I'm not sure about caching: #4280 (comment)

@poettering
Copy link
Member

I don't like the caching still either I must say...

@poettering
Copy link
Member

I added patch to #4299 that drops the caching, please have a look!

@keszybz keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

6 participants