-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
add ! and !! ExecStart= flags to make ambient caps useful #6577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
And before anyone comments on this: yes "!!" is a bit cryptic, but it's supposed to be, because it's ultimately just a stopgap until everybody runs kernels that support ambient caps, and at that time the cruft we should have in our configuration language should remain minimal... That's why I much prefer this new prefix over a whole new setting. Also, "!" extends the concept of the pre-existing "+" a bit, and "!!" then extends on that again. I hope that makes sense? |
6d0059e to
cf670b7
Compare
src/core/execute.c
Outdated
| /* Don't change the owner of the configuration directory, as in the common case it is not written to by a | ||
| * service, and shall not be writable. */ | ||
| if (type == EXEC_DIRECTORY_CONFIGURATION) | ||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block should be moved after mkdir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed
| bool needs_sandboxing, needs_mount_namespace; | ||
| #ifdef HAVE_SELINUX | ||
| bool needs_selinux = false; | ||
| bool use_selinux = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit title, has -> use
src/core/execute.c
Outdated
| if (needs_ambient_hack) | ||
| bset |= (UINT64_C(1) << CAP_SETPCAP) | | ||
| (UINT64_C(1) << CAP_SETUID) | | ||
| (UINT64_C(1) << CAP_SETGID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this block should be moved before cap_test_all() check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed!
|
How about networkd and timesyncd? |
| * available. */ | ||
|
|
||
| cache = prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_IS_SET, CAP_KILL, 0, 0) >= 0 || | ||
| !IN_SET(errno, EINVAL, EOPNOTSUPP, ENOSYS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the second condition is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because system calls can fail for many different reasons (for example, some seccomp or selinux thing), and I wanted to hence make sure we only consider ambient caps to be unavailable if the error code returned is actually one of the "not supported" ones, which the three above are. (prctl usually reports unavailable options with EINVAL, and EOPNOTSUPP and ENOSYS are usually how this is reported).
This way, if ambient caps fail with EPERM we will still consider them to be available, which then means we'll try to use them later on, where we then will see EPERM again and report it as usual...
Or to say this differently: this function is not supposed to process errors or anything, it shall simple report back if the ambient caps API is implemented or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess so. But tje mam page does not explain well the situation such that it returns e.g. EOPNOTSUPP. The reason I made the question is on the sysytem whose kernel does not support kernel prctl() may returns EINVAL. I so not know it ia true or not. But at least I rwad the man page, I underatand that it returns EINVAL in such a situation. If it ia true, then the second condition causes missdetection of if the kernel support ambient capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Gah... Sorry for many typos... I wrote them on my smartphone...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the kernel doesn't do ambient caps, prctl() will return -1, and errno is set to EINVAL, and hence the expression should resolve to false, and the case is properly detected. Or did I misunderstand what you are saying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I've misunderstood, sorry. The code is now lgtm.
| goto finish; | ||
| /* Drop privileges, but only if we have been started as root. If we are not running as root we assume all | ||
| * privileges are already dropped. */ | ||
| if (getuid() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getuid() or geteuid()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, and i don't have a clear answer for this. Ultimately we know exactly that we will be called with uid==euid, as it is systemd that spawns resolved, hence the distinction doesn't matter too much. I decided to use getuid() here simply because it is the more obvious one, and if euid/uid really don't match, then maybe it could even be assumed that something is set up in some way and we shouldn't alter it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I do not have a clear opinion. I just concerned this since my past commit #5964.
This way, we can extend it later on in an easier way, and can pass it along nicely.
Let's try to decouple the execution engine a bit from the Unit/Manager concept, and hence pass one more flag as part of the ExecParameters flags field.
…irectory chowning Let's decouple the Manager object from the execution logic a bit more here too, and simply pass along the fact whether we should unconditionally chown the runtime/... directories via the ExecFlags field too.
src/core/execute.c
Outdated
| if (c->syscall_whitelist) { | ||
| default_action = negative_action; | ||
| action = SCMP_ACT_ALLOW; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yupp
I left this for a later PR. I just ported resolved over as an excercise and test to make sure this all works... Most likely we can just take your commits for that and rebase them on top of this to make this work. |
|
OK, I've prepared the commits on the private repository. I will post them after this PR is merged. |
The configuration directory is commonly not owned by a service, but remains root-owned, hence don't change the owner automatically for it.
Not that it really matters, but it matches how we set the flags in manager_set_exec_params() too.
… field Also, correct the logic while we are at it: the variable is only required for system services, not user services.
Let's simplify things a bit.
The new unit_set_exec_params() call is to units what manager_set_exec_params() is to the manager object: it initializes the various fields from the relevant generic properties set.
"Permissions" was a bit of a misnomer, as it suggests that UNIX file permission bits are adjusted, which aren't really changed here. Instead, this is about UNIX credentials such as users or groups, as well as namespacing, hence let's use a more generic term here, without any misleading reference to UNIX file permissions: "sandboxing", which shall refer to all kinds of sandboxing technologies, including UID/GID dropping, selinux relabelling, namespacing, seccomp, and so on.
Let's merge three if blocks that shall only run when sandboxing is applied into one. Note that this changes behaviour in one corner case: PrivateUsers=1 is now honours both PermissionsStartOnly= and the "+" modifier in ExecStart=, and not just the former, as before. This was an oversight, so let's fix this now, at a point in time the option isn't used much yet.
These booleans simply store whether selinux/apparmor/smack are supposed ot be used, and chache the various mac_xyz_use() calls before we transition into the namespace, hence let's use the same verb for the variables and the functions: "use"
This new group lists all UID/GID credential changing syscalls (which are quite a number these days). This will become particularly useful in a later commit, which uses this group to optionally permit user credential changing to daemons in case ambient capabilities are not available.
…privileges when executing a NOP This way daemons which already dropped all caps may use the call to drop priviliges again, which becomes a non-failing NOP.
This new function reports whether ambient caps are available, and should be quick because the result is cached.
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":
"!" → much like "+", except with a much reduced effect as it only
disables the actual setresuid()/setresgid()/setgroups() calls, but
leaves all other security features on, including namespace
options. This is very useful in combination with
RuntimeDirectory= or DynamicUser= and similar option, as a user
is still allocated and used for the runtime directory, but the
actual UID/GID dropping is left to the daemon process itself.
This should make RuntimeDirectory= a lot more useful for daemons
which insist on doing their own privilege dropping.
"!!" → Similar to "!", but on systems supporting ambient caps this
becomes a NOP. This makes it relatively straightforward to write
unit files that make use of ambient capabilities to let systemd
drop all privs while retaining compatibility with systems that
lack ambient caps, where priv dropping is the left to the daemon
codes themselves.
This is an alternative approach to systemd#6564 and related PRs.
…ervice Let's make use of !! to run resolved with ambient capabilities on systems supporting them.
cf670b7 to
635f3df
Compare
|
I force pushed a new version now, addressing @yuwata's and @Siosm's points. Please have a look! (@yuwata, btw I just invited you to the github commiters group. Hope that interests you? Thank you very much for all your contributions so far! Please use your new powers wisely! Whenever you see a PR, and like it and are confident enough it's the right thing, please feel free to go ahead and merge it. If in doubt, ask, but usually we prefer if commiters just go ahead with what they think is right. All we require for commits is a review from one person who didn't write the PR. Feel free to commit PRs on every area you like, we try to avoid specific kingdoms in the source tree! And thanks again for all your contributions) |
|
( @poettering I've accepted your invitation. Thank you so much. That is my honor. I am so glad that my recent activities are appreciated. ) |
|
I like this PR and it LGTM. |
|
LGTM. |
I started working on this PR a while back, but never finished it off, but @yuwata's #6564 made me look into this again.
I think with these two new ExecStart= prefixes we can neatly address two different issues:
First of all, DynamicUser= and RuntimeDirectory= now nicely combine with daemons where sandboxing should be applied but the actual priv dropping needs to be done in the daemons themselves.
ExecStart=!/usr/bin/mydaemonmeans that DynamicUser= and RuntimeDirectory= will be applied as usual (and the latter with a dir owned by the use selected by User=), but the daemon is still invoked as root. Thus RuntimeDirectoryUser=/RuntimeDirectoryGroup= and suchlike are not necessary.Secondly,
ExecStart=!!/usr/bin/mydaemonis a way to have services that can make use of systemd's priv dropping and ambient caps where possible but can fallback to daemon-intrenal priv-dropping in cases where ambient caps are not availeble. This PR then makes use of that to run resolved like this, as an example, and the other relevant daemons should really do the same eventually...This PR also carries a large number of other, smaller fixes in preparation for the main work.