Skip to content

Conversation

@poettering
Copy link
Member

I started working on this PR a while back, but never finished it off, but @yuwata's #6564 made me look into this again.

I think with these two new ExecStart= prefixes we can neatly address two different issues:

  1. First of all, DynamicUser= and RuntimeDirectory= now nicely combine with daemons where sandboxing should be applied but the actual priv dropping needs to be done in the daemons themselves. ExecStart=!/usr/bin/mydaemon means that DynamicUser= and RuntimeDirectory= will be applied as usual (and the latter with a dir owned by the use selected by User=), but the daemon is still invoked as root. Thus RuntimeDirectoryUser=/RuntimeDirectoryGroup= and suchlike are not necessary.

  2. Secondly, ExecStart=!!/usr/bin/mydaemon is a way to have services that can make use of systemd's priv dropping and ambient caps where possible but can fallback to daemon-intrenal priv-dropping in cases where ambient caps are not availeble. This PR then makes use of that to run resolved like this, as an example, and the other relevant daemons should really do the same eventually...

This PR also carries a large number of other, smaller fixes in preparation for the main work.

@poettering poettering added the pid1 label Aug 9, 2017
@poettering
Copy link
Member Author

And before anyone comments on this: yes "!!" is a bit cryptic, but it's supposed to be, because it's ultimately just a stopgap until everybody runs kernels that support ambient caps, and at that time the cruft we should have in our configuration language should remain minimal... That's why I much prefer this new prefix over a whole new setting. Also, "!" extends the concept of the pre-existing "+" a bit, and "!!" then extends on that again. I hope that makes sense?

/* Don't change the owner of the configuration directory, as in the common case it is not written to by a
* service, and shall not be writable. */
if (type == EXEC_DIRECTORY_CONFIGURATION)
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block should be moved after mkdir.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

bool needs_sandboxing, needs_mount_namespace;
#ifdef HAVE_SELINUX
bool needs_selinux = false;
bool use_selinux = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit title, has -> use

if (needs_ambient_hack)
bset |= (UINT64_C(1) << CAP_SETPCAP) |
(UINT64_C(1) << CAP_SETUID) |
(UINT64_C(1) << CAP_SETGID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this block should be moved before cap_test_all() check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed!

@yuwata
Copy link
Member

yuwata commented Aug 9, 2017

How about networkd and timesyncd?

* available. */

cache = prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_IS_SET, CAP_KILL, 0, 0) >= 0 ||
!IN_SET(errno, EINVAL, EOPNOTSUPP, ENOSYS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the second condition is necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because system calls can fail for many different reasons (for example, some seccomp or selinux thing), and I wanted to hence make sure we only consider ambient caps to be unavailable if the error code returned is actually one of the "not supported" ones, which the three above are. (prctl usually reports unavailable options with EINVAL, and EOPNOTSUPP and ENOSYS are usually how this is reported).

This way, if ambient caps fail with EPERM we will still consider them to be available, which then means we'll try to use them later on, where we then will see EPERM again and report it as usual...

Or to say this differently: this function is not supposed to process errors or anything, it shall simple report back if the ambient caps API is implemented or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess so. But tje mam page does not explain well the situation such that it returns e.g. EOPNOTSUPP. The reason I made the question is on the sysytem whose kernel does not support kernel prctl() may returns EINVAL. I so not know it ia true or not. But at least I rwad the man page, I underatand that it returns EINVAL in such a situation. If it ia true, then the second condition causes missdetection of if the kernel support ambient capabilities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Gah... Sorry for many typos... I wrote them on my smartphone...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the kernel doesn't do ambient caps, prctl() will return -1, and errno is set to EINVAL, and hence the expression should resolve to false, and the case is properly detected. Or did I misunderstand what you are saying?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I've misunderstood, sorry. The code is now lgtm.

goto finish;
/* Drop privileges, but only if we have been started as root. If we are not running as root we assume all
* privileges are already dropped. */
if (getuid() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getuid() or geteuid()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, and i don't have a clear answer for this. Ultimately we know exactly that we will be called with uid==euid, as it is systemd that spawns resolved, hence the distinction doesn't matter too much. I decided to use getuid() here simply because it is the more obvious one, and if euid/uid really don't match, then maybe it could even be assumed that something is set up in some way and we shouldn't alter it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I do not have a clear opinion. I just concerned this since my past commit #5964.

This way, we can extend it later on in an easier way, and can pass it
along nicely.
Let's try to decouple the execution engine a bit from the Unit/Manager
concept, and hence pass one more flag as part of the ExecParameters flags
field.
…irectory chowning

Let's decouple the Manager object from the execution logic a bit more
here too, and simply pass along the fact whether we should
unconditionally chown the runtime/... directories via the ExecFlags
field too.
if (c->syscall_whitelist) {
default_action = negative_action;
action = SCMP_ACT_ALLOW;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yupp

@poettering
Copy link
Member Author

How about networkd and timesyncd?

I left this for a later PR. I just ported resolved over as an excercise and test to make sure this all works... Most likely we can just take your commits for that and rebase them on top of this to make this work.

@yuwata
Copy link
Member

yuwata commented Aug 10, 2017

OK, I've prepared the commits on the private repository. I will post them after this PR is merged.

The configuration directory is commonly not owned by a service, but
remains root-owned, hence don't change the owner automatically for it.
Not that it really matters, but it matches how we set the flags in
manager_set_exec_params() too.
… field

Also, correct the logic while we are at it: the variable is only
required for system services, not user services.
The new unit_set_exec_params() call is to units what
manager_set_exec_params() is to the manager object: it initializes the
various fields from the relevant generic properties set.
"Permissions" was a bit of a misnomer, as it suggests that UNIX file
permission bits are adjusted, which aren't really changed here. Instead,
this is about UNIX credentials such as users or groups, as well as
namespacing, hence let's use a more generic term here, without any
misleading reference to UNIX file permissions: "sandboxing", which shall
refer to all kinds of sandboxing technologies, including UID/GID
dropping, selinux relabelling, namespacing, seccomp, and so on.
Let's merge three if blocks that shall only run when sandboxing is applied
into one.

Note that this changes behaviour in one corner case: PrivateUsers=1 is
now honours both PermissionsStartOnly= and the "+" modifier in
ExecStart=, and not just the former, as before. This was an oversight,
so let's fix this now, at a point in time the option isn't used much
yet.
These booleans simply store whether selinux/apparmor/smack are supposed
ot be used, and chache the various mac_xyz_use() calls before we
transition into the namespace, hence let's use the same verb for the
variables and the functions: "use"
This new group lists all UID/GID credential changing syscalls (which are
quite a number these days). This will become particularly useful in a
later commit, which uses this group to optionally permit user credential
changing to daemons in case ambient capabilities are not available.
…privileges when executing a NOP

This way daemons which already dropped all caps may use the call to
drop priviliges again, which becomes a non-failing NOP.
This new function reports whether ambient caps are available, and should
be quick because the result is cached.
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":

"!"  → much like "+", except with a much reduced effect as it only
       disables the actual setresuid()/setresgid()/setgroups() calls, but
       leaves all other security features on, including namespace
       options. This is very useful in combination with
       RuntimeDirectory= or DynamicUser= and similar option, as a user
       is still allocated and used for the runtime directory, but the
       actual UID/GID dropping is left to the daemon process itself.
       This should make RuntimeDirectory= a lot more useful for daemons
       which insist on doing their own privilege dropping.

"!!" → Similar to "!", but on systems supporting ambient caps this
       becomes a NOP. This makes it relatively straightforward to write
       unit files that make use of ambient capabilities to let systemd
       drop all privs while retaining compatibility with systems that
       lack ambient caps, where priv dropping is the left to the daemon
       codes themselves.

This is an alternative approach to systemd#6564 and related PRs.
…ervice

Let's make use of !! to run resolved with ambient capabilities on
systems supporting them.
@poettering
Copy link
Member Author

I force pushed a new version now, addressing @yuwata's and @Siosm's points. Please have a look!

(@yuwata, btw I just invited you to the github commiters group. Hope that interests you? Thank you very much for all your contributions so far! Please use your new powers wisely! Whenever you see a PR, and like it and are confident enough it's the right thing, please feel free to go ahead and merge it. If in doubt, ask, but usually we prefer if commiters just go ahead with what they think is right. All we require for commits is a review from one person who didn't write the PR. Feel free to commit PRs on every area you like, we try to avoid specific kingdoms in the source tree! And thanks again for all your contributions)

@yuwata
Copy link
Member

yuwata commented Aug 10, 2017

( @poettering I've accepted your invitation. Thank you so much. That is my honor. I am so glad that my recent activities are appreciated. )

@Siosm
Copy link

Siosm commented Aug 11, 2017

I like this PR and it LGTM.

@yuwata
Copy link
Member

yuwata commented Aug 12, 2017

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

3 participants