[RFC] run systemd in an unprivileged container#4280
[RFC] run systemd in an unprivileged container#4280poettering merged 3 commits intosystemd:masterfrom
Conversation
poettering
left a comment
There was a problem hiding this comment.
We don't use "Signed-off-by" in systemd, that's a kernel thing.
(This means: we prefer patches without this, but it is not a blocker to have them)
b068b32 to
91dbdac
Compare
|
thanks for the quick review. I have pushed a new version without the "Signed-off-by" part ⬆️ |
|
sorry, this wasn't supposed to be considered "reviewed", I am not used to the new github facilities for this. Sorry for the confusion. I am still reviewing... |
|
What is an "unprivileged" container here? One with userns but 1:1 uid mapping? Or what precisely do you mean here? |
src/basic/capability-util.c
Outdated
|
|
||
| finish: | ||
| if (r == -EPERM && (detect_container() > 0)) | ||
| return 0; |
There was a problem hiding this comment.
I really don't like this bit. The thing is that this code is used in various places in systemd, not just in PID 1 (and it should be used everywhere, I think). The logic you added only really makes sense in PID 1 however, afaics...
Moreover I don't really grok what precisely actually fails here... I mean, dropping caps should always be safe: if we have caps we should be able to safely drop them, and if we don't have them then we shouldn't need to drop them and hence not need to generate an error. I don't really grok here what syscall precisely fails in your case. Can you elaborate?
I'd much prefer if we'd do as little container specific checks as necessary here. And in this case it appears that dropping the caps should be fully safe if we have no caps in the first place... hence, please elaborate.
(also, having such "blanket tape over errors a-posteriori" concepts makes all my alarms sound loudly... I would really prefer that if we ignore an error in some condition, it should be as precise as possible, and hence be right to the one syscall where it matters, but not in a global cleanup path for a function, if you follow what I mean)
| if (r < 0 && errno == EPERM && (detect_container() > 0)) | ||
| return 0; | ||
| return r; | ||
| } |
There was a problem hiding this comment.
I think it would be better to guard this one function on the caller side, and bind it to the availability of the right kind of cap?
src/basic/capability-util.c
Outdated
| return log_error_errno(errno, "Failed to change group ID: %m"); | ||
|
|
||
| if (setgroups(0, NULL) < 0) | ||
| if (setgroups(0, NULL) < 0 && (errno != EPERM || (detect_container() <= 0))) |
There was a problem hiding this comment.
what is this about? why wouldn't we be able to drop groups ever?
src/basic/capability-util.c
Outdated
| return log_error_errno(errno, "Failed to enable capabilities bits: %m"); | ||
|
|
||
| if (cap_set_proc(d) < 0) | ||
| if (cap_set_proc(d) < 0 && (errno != EPERM || (detect_container() <= 0))) |
There was a problem hiding this comment.
what is this about? why wouldn't we able to drop caps we have? (this is the same case as the same as the bounding set thing further up)
There was a problem hiding this comment.
AFAICS, we don't really check that we have the capabilities listed in keep_capabilities, so that cap_set_proc adds more capabilties than we originally had. Anyway, I am going to drop this change and require the container to have CAP_NEW_RAW and CAP_NET_BROADCAST, as required by systemd-networkd.
There was a problem hiding this comment.
hmm I'd argue that we probably should fail if we try to reduce the set of caps to caps we don't have...
src/basic/capability-util.c
Outdated
| (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0)) | ||
| (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0)) { | ||
| if (errno == EPERM && (detect_container() > 0)) | ||
| return 0; |
There was a problem hiding this comment.
hmm, cap_set_flag() only alters the internal state of the "tmp_cap" userspace object, no? This should never fail with EPERM, should it?
src/basic/capability-util.c
Outdated
|
|
||
| if (cap_set_proc(tmp_cap) < 0) | ||
| if (cap_set_proc(tmp_cap) < 0) { | ||
| if (errno == EPERM && (detect_container() > 0)) |
|
|
||
| if (setresuid(0, 0, 0) < 0) | ||
| if (setresuid(0, 0, 0) < 0 && (errno != EPERM || (detect_container() <= 0))) | ||
| return -errno; |
There was a problem hiding this comment.
not sure i grok this, but what precisely is failing here in your case? why shouldn't we be able to reset our creds to root?
src/core/mount.c
Outdated
| m->control_pid = 0; | ||
|
|
||
| if (is_clean_exit(code, status, NULL)) | ||
| if (is_clean_exit(code, status, NULL) || is_container) |
There was a problem hiding this comment.
umpf, this one I really don't like. If umounting fails in your container, we should fail to the unit. To make the error go away we should rather not start or stop the unit, rather than just ignoring the result if we do...
poettering
left a comment
There was a problem hiding this comment.
Not sure I am using the new github review tools properly, but I left some comments now, this needs mor discussion I figure.
|
thanks for your comments. I set the RFC here to start a discussion and see what changes make sense in systemd and what definitely should not be included. The goal is to be able to run systemd from bubblewrap, and requiring as less caps as possible in the container. I've probably guarded with |
|
This rewrite work could cause significant regression with downstream so I have to ask why would systemd make drastic rewrite on it self with potential bug and breakage which can affect several distributions for bubblewrap? What's the usecase here outside bubblewrap that everyone will benefit from this rewrite work from embedded/server/desktop ? |
74607e8 to
540f296
Compare
|
I looked at that in https://launchpad.net/bugs/1576341 a while ago (comment 5 ff.), as it's similar with LXD (which runs containers as unprivileged user by default). Things like |
540f296 to
704f2d9
Compare
|
I have polished the patches and handle only these two cases:
|
704f2d9 to
67278f3
Compare
|
bubblewrap appears to be a sandbox tool for desktop apps, no? why would you want to make systemd run in that? systemd really only makes sense in environments that actually try to emulate more complete systems that actually need a PID1 (think: nspawn). |
|
can you elaborate on the setgroups() thing, why is that necessary? |
|
we are trying to run containers as non root users through bubblewrap. In order to do that, some changes are required in bubblewrap as well. The setgroups thing is needed when the usage of setgroups is blocked through /proc/$PID/setgroups. Blocking setgroups in the new user namespace is done to prevent that a container gains privileges by dropping its supplementary groups (CVE-2014-8989); in other words, to prevent that a process circumvents restrictions on a group by dropping its additional groups. |
Yes, that looks like a good place for it. |
Yeah, I don't really grok the usecase either. But well, I think it's fine to support unprivileged userns this way... |
@giuseppe , why do people need this? |
|
@evverx our goal is to leverage user namespaces to run containers as a non privileged user. The main reason is to increase security, as root won't run these containers (partially true as bubblewrap is a setuid program and we will still leave some capabilities in the container). |
67278f3 to
944a921
Compare
|
@giuseppe , thanks!
Why not to implement something like
?
How does this commit affect |
| return log_error_errno(errno, "Failed to change group ID: %m"); | ||
|
|
||
| if (setgroups(0, NULL) < 0) | ||
| if (maybe_setgroups(0, NULL) < 0) |
There was a problem hiding this comment.
$ git grep drop_priv src/ | grep -v test
src/basic/capability-util.c:int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities) {
src/basic/capability-util.h:int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities);
src/coredump/coredump.c: return drop_privileges(uid, gid, 0);
src/network/networkd.c: r = drop_privileges(uid, gid,
src/resolve/resolved.c: r = drop_privileges(uid, gid,
src/timesync/timesyncd.c: r = drop_privileges(uid, gid, (1ULL << CAP_SYS_TIME));Seems like you can't run all these binaries inside the unprivileged user namespace: setresuid(uid, uid, uid) fails. Why do we need this?
|
thanks for the additional inputs. I am currently mapping the uids/guids 1-999 to some high range in the host uids/guids similarly to what |
| } | ||
|
|
||
| if (setgroups(k, gids) < 0) { | ||
| if (maybe_setgroups(k, gids) < 0) { |
There was a problem hiding this comment.
SupplementaryGroups= is no-op. Does it make sense? I guess we should fail if we can't set supplementary groups.
There was a problem hiding this comment.
I have pushed another version that reverts this change, so that maybe_setgroups is used only for dropping auxiliary groups
|
@giuseppe , sorry, I missed your previous comment
I see, containers/bubblewrap@9182998
,
|
944a921 to
5999375
Compare
poettering
left a comment
There was a problem hiding this comment.
We are getting there! But a few more notes.
| } | ||
|
|
||
| int maybe_setgroups(size_t size, const gid_t *list) { | ||
| static int cached_can_setgroups = -1; |
There was a problem hiding this comment.
hmm, i wonder if it's a good idea to cache this, after all it can change during runtime? Also, afaics we are unlikely to call this frequently anyway, hence there's no point in caching this, is there?
There was a problem hiding this comment.
I think it should be fine to cache it, /proc/PID/setgroups can be set only before the gid_map file is written.
There was a problem hiding this comment.
But what about:
maybe_setgroups();
p = clone(NEWUSER);
...child...
maybe_setgroups();
...child...?
There was a problem hiding this comment.
that won't change the /proc/ID/setgroups value for the new process. It could be a problem if we were caching "allow" then cloning a new user namespace and change for the new userspace setgroups to "deny" before set its gid_map file.
src/basic/user-util.c
Outdated
| /* old kernels don't have /proc/self/setgroups, so assume we can use it */ | ||
| cached_can_setgroups = true; | ||
| } else { | ||
| cached_can_setgroups = !!strcmp(setgroups_content, "deny"); |
There was a problem hiding this comment.
we try to avoid calling strcmp() if we are not interested in ordering between the strings. If you just want to compare two strings, please use streq(), which is a macro wrapper around it and helps readability.
| return 0; | ||
|
|
||
| return setgroups(size, list); | ||
| } |
There was a problem hiding this comment.
I think one further refinement for this call would make sense: only skip the setgroups() invocation if the groups list is reset (i.e. size == 0), and continue generate an error if the list is anything but empty. This way if people actually set SupplementaryGroups= to something non-empty, then we'll continue to fail (and I think we should), but for the typical case where the list is just supposed to be reset we eat up the error. Does that make sense?
There was a problem hiding this comment.
this should be documented. No?
There was a problem hiding this comment.
oh, this is not a user-visible change. Right?
There was a problem hiding this comment.
right, it is not an user visible change, when don't attempt to drop all the auxiliary groups when setprocs is blocked
src/basic/user-util.c
Outdated
| #include "strv.h" | ||
| #include "user-util.h" | ||
| #include "utf8.h" | ||
| #include "virt.h" |
There was a problem hiding this comment.
we don't need this #include anymore, do we?
src/basic/audit-util.c
Outdated
| if (fd < 0) | ||
| cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT; | ||
| if (fd < 0) { | ||
| cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT && errno != EPERM; |
There was a problem hiding this comment.
not that it matters, but we have this pretty macro IN_SET() for cases like this:
cached_use = !IN_SET(errno, EAFNOSUPPORT, EAPROTONOSUPPORT, EPERM)
src/basic/audit-util.c
Outdated
| if (fd < 0) { | ||
| cached_use = errno != EAFNOSUPPORT && errno != EPROTONOSUPPORT && errno != EPERM; | ||
| if (errno == EPERM) | ||
| log_debug_errno(errno, "Audit disabled"); |
There was a problem hiding this comment.
The debug message is misleading here. EAFNOSUPPORT/EAPROTONOSUPPORT are the errors suggesting audit was actually disabled. EPERM otoh is a permission error, hence this should really say "Audit access prohibited, won't talk to audit" or something like that.
It might be blocked through /proc/PID/setgroups
5999375 to
36d8547
Compare
|
@poettering I've addressed all your comments, except caching /proc/PID/setgroups that cannot be modified once the gid_map file is written and pushed a new version here ⬆️ |
|
@poettering , @giuseppe , why not to add some test cases to |
👍 Anyway, I'm not sure about caching: #4280 (comment) |
|
I don't like the caching still either I must say... |
|
I added patch to #4299 that drops the caching, please have a look! |
a container might not have enough capabilities to run systemd. This series fixes these cases:
The goal is to be able to run systemd from a non privileged user via bubblewrap, that provides only a safe set of caps to be used in a container.
bubblewrap PR here: containers/bubblewrap#101