let's make "systemd-run --scope --user" work again on cgroupsv2 by poettering · Pull Request #8125 · systemd/systemd

poettering · 2018-02-07T22:13:37Z

lots of fixes in order to finally close #3388.

evverx · 2018-02-08T00:01:19Z

systemd seems to be crashing on xenial, which could be because cgroup2 is not supported there. Here are the logs that I was able to get from a xenial host:

Feb 07 23:13:46 systemd[1]: Found cgroup on /sys/fs/cgroup/systemd, legacy hierarchy
...
Feb 07 23:14:05 systemd[1]: session-1.scope: Failed to attach PID 224 to realized cgroup /user.slice/user-0.slice/session-1.scope in controller cpu, ignoring: No such file or directory
Feb 07 23:14:05 systemd[1]: Assertion 'path' failed at ../src/basic/cgroup-util.c:823, function cg_attach(). Aborting.
Feb 07 23:14:05 systemd-coredump[235]: Due to PID 1 having crashed coredump collection will now be turned off.
Feb 07 23:14:05 systemd[1]: Caught <ABRT>, dumped core as pid 234.
Feb 07 23:14:05 systemd[1]: Freezing execution.
Feb 07 23:14:11 systemd-coredump[235]: Failed to generate stack trace: Unwinding not supported for this architecture
Feb 07 23:14:11 systemd-coredump[235]: Process 234 (systemd) of user 0 dumped core.
Feb 07 23:14:30 login[224]: pam_systemd(login:session): Failed to create session: Connection timed out

and the backtrace:

#0  0x00007f15cf2fb767     kill - libc.so.6
#1  0x00005606c3a0a05c - 1 crash - /lib/systemd/systemd
    ../src/core/main.c:196
#2  0x00007f15cf6a1390     __restore_rt - libpthread.so.0
#3  0x00007f15cf2fb428     raise - libc.so.6
#4  0x00007f15cf2fd02a - 1 abort - libc.so.6
#5  0x00007f15d1122e48 - 1 log_assert_failed_realm - libsystemd-shared-237.so
    ../src/basic/log.c:820
#6  0x00007f15d10fa789 - 1 cg_attach - libsystemd-shared-237.so
    ../src/basic/cgroup-util.c:823
#7  0x00005606c3aca9d5 - 1 unit_attach_pids_to_cgroup - /lib/systemd/systemd
    ../src/core/cgroup.c:1636
#8  0x00005606c3af647b - 1 scope_start - /lib/systemd/systemd
    ../src/core/scope.c:346
#9  0x00005606c3aab540 - 1 unit_start - /lib/systemd/systemd
    ../src/core/unit.c:1861
#10 0x00005606c3a429c4 - 1 job_perform_on_unit - /lib/systemd/systemd
    ../src/core/job.c:529
#11 0x00005606c3a42d39 - 1 job_run_and_invalidate - /lib/systemd/systemd
    ../src/core/job.c:594
#12 0x00005606c3a655d5 - 1 manager_dispatch_run_queue - /lib/systemd/systemd
    ../src/core/manager.c:1861
#13 0x00007f15d11bab86 - 1 source_dispatch - libsystemd-shared-237.so
    ../src/libsystemd/sd-event/sd-event.c:2333
#14 0x00007f15d11bbeb1 - 1 sd_event_dispatch - libsystemd-shared-237.so
    ../src/libsystemd/sd-event/sd-event.c:2663
#15 0x00007f15d11bc349 - 1 sd_event_run - libsystemd-shared-237.so
    ../src/libsystemd/sd-event/sd-event.c:2723
#16 0x00005606c3a685cb - 1 manager_loop - /lib/systemd/systemd
    ../src/core/manager.c:2573
#17 0x00005606c3a10b7b - 1 invoke_main_loop - /lib/systemd/systemd
    ../src/core/main.c:1778
#18 0x00005606c3a138e6 - 1 main - /lib/systemd/systemd
    ../src/core/main.c:2563
#19 0x00007f15cf2e6830 - 1 __libc_start_main - libc.so.6
#20 0x00005606c3a093a9 - 1 _start - /lib/systemd/systemd

evverx · 2018-02-08T12:36:44Z

I seem to have been wrong to say that the crash has anything to do with whether cgroup2 is supported or not. I can reproduce it on a machine where the hybrid hierarchy is being used. Passing systemd.unified_cgroup_hierarchy=yes appears to be enough to make everything work.

poettering · 2018-02-09T11:53:49Z

@evverx thanks a lot for finding this! just force pushed a new version that should fix the issue!

keszybz

Looks nice. I see you pushed an updated version while I was reading this, so I'll push out the comments I have. I'll test this later.

keszybz · 2018-02-09T11:16:16Z

src/core/manager.c

-                u = manager_get_unit(m, SPECIAL_DBUS_SERVICE);
-
-                if (!u || UNIT_IS_ACTIVE_OR_RELOADING(unit_active_state(u))) {
+                b = manager_dbus_is_running(m, false);


Would be nicer to get rid of b and put the condition directly in the conditional.

keszybz · 2018-02-09T11:17:45Z

src/core/manager.c

+         * connection of the API bus). That's because the system bus after all runs as service of the system instance,
+         * while in the user instance we can assume it's already there. */
+
+        b = manager_dbus_is_running(m, false);


The same here.

keszybz · 2018-02-09T11:23:01Z

src/core/manager.c

        /* Sync current state of bus names with our set of listening units */
-        if (m->api_bus)
-                manager_sync_bus_names(m, m->api_bus);
+        manager_enqueue_sync_bus_names(m);


q = ... if (q < 0 && r >= 0) r = q;

?

keszybz · 2018-02-09T11:29:43Z

src/core/cgroup.c

+        if (!isempty(suffix_path))
+                pp = strjoina("/", pp, suffix_path);
+        else
+                pp = strjoina("/", pp);


That if is not necessary, the "true" branch works for both cases.

keszybz · 2018-02-09T11:34:13Z

src/core/cgroup.c

+
+                        q = unit_attach_pid_to_cgroup_via_bus(u, pid, suffix_path);
+                        if (q < 0)
+                                log_unit_debug_errno(u, q, "Going via the bus didn't work either, ignoring: %m");


This log line will is hard to understand without context. I think it'd be nicer to exchange the order here, so that both errors (with errnos) are logged. This will make things clearer when debugging. Also, I don't like the artificial setting of the errno (EACCESS → EPERM).

q = cg_attach(SYSTEMD_CGROUP_CONTROLLER, p, pid); if (q < 0) { log_unit_debug_errno(...); if (MANAGER_IS_SYSTEM(u->manager) || !IN_SET(q, -EPERM, -EACCES)) continue; q = unit_attach_pid_to_cgroup_via_bus(...); if (q < 0) log_unit_debug_errno(u, q, "Couldn't move process " PID_FMT " to requested cgroup %s via bus: %m", pid, p);

keszybz · 2018-02-09T11:56:18Z

src/core/cgroup.c

+                /* First, attach the PID to the main cgroup hierarchy */
+                q = cg_attach(SYSTEMD_CGROUP_CONTROLLER, p, pid);
+                if (IN_SET(q, -EPERM, -EACCES) && MANAGER_IS_USER(u->manager)) {
+                        /* If we are in a user instance, and we can't moe the process ourselves, let's ask the system


keszybz · 2018-02-09T11:58:51Z

src/core/unit.c

+        if (r == -ESRCH)
+                return sd_bus_error_setf(error, SD_BUS_ERROR_UNIX_PROCESS_ID_UNKNOWN, "Process with ID " PID_FMT " does not exist.", pid);
+        if (r < 0)
+                return sd_bus_error_set_errnof(error, r, "Failed to determine whether process " PID_FMT " is kernel thread: %m", pid);


keszybz · 2018-02-09T11:59:04Z

src/core/unit.c

+        if (r < 0)
+                return sd_bus_error_set_errnof(error, r, "Failed to determine whether process " PID_FMT " is kernel thread: %m", pid);
+        if (r > 0)
+                return sd_bus_error_setf(error, SD_BUS_ERROR_INVALID_ARGS, "Process " PID_FMT " is kernel thread, refusing.", pid);


poettering · 2018-02-09T20:38:12Z

I force pushed a new version, adressing the issues @keszybz found, too! PTAL!

…for scope and service units only Currently we allowed delegation for alluntis with cgroup backing except for slices. Let's make this a bit more strict for now, and only allow this in service and scope units. Let's also add a generic accessor unit_cgroup_delegate() for checking whether a unit has delegation turned on that checks the new bool first. Also, when doing transient units, let's explcitly refuse turning on delegation for unit types that don#t support it. This is mostly cosmetical as we wouldn't act on the delegation request anyway, but certainly helpful for debugging.

Let's simplify things a bit: we so far called both functions every single time, let's just merge one into the other, so that we have fewer functions to call.

This reworks is_kernel_thread() a bit. Instead of checking whether /proc/$pid/cmdline is entirely empty we now parse the 'flags' field from /proc/$pid/stat and check the PF_KTHREAD flag, which directly encodes whether something is a kernel thread. Why all this? With current kernels userspace processes can set their command line to empty too (through PR_SET_MM_ARG_START and friends), and could potentially confuse us. Hence, let's use a more reliable way to detect kernels like this.

I figure saying "systemd" here was a typo, and it should have been "system". (Yes, it becomes very hard after a while typing "system" correctly if you type "systemd" so often.) That said, "systemd" in some ways is actually more correct, since BPF might be available for the system instance but not in the user instance. Either way, talking of "this systemd" is weird, let's reword this to be "this manager", to emphasize that it's the local instance of systemd where BPF is not available, but that it might be available otherwise.

No functional changes, but let's make this a bit more finegrained. (The individual functions are exported, which is used in a later commit)

This removes the current bus_init() call, as it had multiple problems: it munged handling of the three bus connections we care about (private, "api" and system) into one, even though the conditions when which was ready are very different. It also added redundant logging, as the individual calls it called all logged on their own anyway. The three calls bus_init_api(), bus_init_private() and bus_init_system() are now made public. A new call manager_dbus_is_running() is added that works much like manager_journal_is_running() and is a lot more careful when checking whether dbus is around. Optionally it checks the unit's deserialized_state rather than state, in order to accomodate for cases where we cant to connect to the bus before deserializing the "subscribed" list, before coldplugging the units. manager_recheck_dbus() is added, that works a lot like manager_recheck_journal() and is invoked in unit_notify(), i.e. when units change state. All in all this should make handling a bit more alike to journal handling, and it also fixes one major bug: when running in user mode we'll now connect to the system bus early on, without conditionalizing this in anyway.

In test mode, let's not consider the journal to be up ever: we want all output to go to stderr.

Let's also use the journal if it is currently reloading. In that state it should also be able to process our requests. Moreover, we might otherwise end up disconnecting/reconnecting from the journal without really any need to hence, relax the check accordingly.

No need for an if check if we just pass along a bool anyway.

This patch does four things: 1. Adds more comments that clarify the order in which things appear in the file 2. All entries are placed in the order in which their SD_BUS_METHOD() macros appear in the C vtables. 3. A couple of missing entries are added that should be open to all or do polkit 4. Corrects the interface name for the GetProcesses() calls. They belong to the per-unit interface, not to Unit

…s calls This splits out the code that translates a unit name into a Unit* object from method_get_unit(), and reuses it all other functions that operate similar to it. This effectively means all those calls now optionally take an empty unit string which now means the same as the client's unit. This useful behaviour of the GetUnit() bus call is thus extended to all other matching bus calls. Similar, the same logic from method_load_unit() is also generalized and reused wherever appropriate.

Let's make debugging easier, by synthesizing a name when we have some indication what kind of bus this is.

Let's make things easier to debug

… event loop iteration Previously, we'd synchronize bus names immediately when we succeeded connecting to the bus, potentially even before coldplugging the units. This was problematic, as synchronizing bus names meant invoking the per-unit name change handler function which might change the unit's state — which will result in consistency when done before we coldplug things. With this change we instead enqueue a job for the event loop to resync the names in a later loop iteration, i.e. at a point where we know coldplugging has finished.

…ssociated with it Let's not keep the old bus object pinned this way, let's destroy all relevant trackers for units, the way we already destroy them for jobs.

Let's better be safe than sorry, and validate that api_bus is not NULL before we send messages to it. Of course, strictly speaking this shouldn't actually be necessary, as the tracker object should not exist without the bus, but let's be extra sure.

…we inherit the API bus This corrects the control flow: when we reuse the API bus as system bus, let's definitely invoke bus_init_system() too, so that it is called regardless how we acquired the bus object. (Note that this doesn't actually change anything, as we only inherit the bus like this in system mode, and bus_init_system() doesn't do anything in system bus, besides writing a log message)

…ervice units This adds a new bus call to service and scope units called AttachProcesses() that moves arbitrary processes into the cgroup of the unit. The primary user for this new API is systemd itself: the systemd --user instance uses this call of the systemd --system instance to migrate processes if itself gets the request to migrate processes and the kernel refuses this due to access restrictions. The primary use-case of this is to make "systemd-run --scope --user …" invoked from user session scopes work correctly on pure cgroupsv2 environments. There, the kernel refuses to migrate processes between two unprivileged-owned cgroups unless the requestor as well as the ownership of the closest parent cgroup all match. This however is not the case between the session-XYZ.scope unit of a login session and the [email protected] of the systemd --user instance. The new logic always tries to move the processes on its own, but if that doesn't work when being the user manager, then the system manager is asked to do it instead. The new operation is relatively restrictive: it will only allow to move the processes like this if the caller is root, or the UID of the target unit, caller and process all match. Note that this means that unprivileged users cannot attach processes to scope units, as those do not have "owning" users (i.e. they have now User= field). Fixes: systemd#3388

evverx

systemd has stopped crashing. LGTM. Thank you.

keszybz

Looks great! Everything seems to work too.

keszybz · 2018-02-15T15:00:42Z

src/core/dbus-unit.c

+         * representation. If a process is already in the cgroup no operation is executed – in this case the specified
+         * subcgroup path has no effect! */
+
+        r = mac_selinux_unit_access_check(u, message, "start", error);


IIUC, if we wanted something different than "start", if would be necessary to amend selinux policy. Might be worth it. Maybe we should reach out to selinux maintainers for comments?

Trivial merge conflict resolved locally.

sourcejedi · 2018-02-15T15:42:58Z

src/core/unit.c

@@ -2463,12 +2461,6 @@ void unit_notify(Unit *u, UnitActiveState os, UnitActiveState ns, bool reload_su
        /* Some names are special */


This comment can be removed. The if/else block doesn't have any special names in it anymore.

systemd#8125 (review)

poettering added cgroups pid1 labels Feb 7, 2018

poettering force-pushed the cgroups-migrate branch from 19e32a9 to 764e053 Compare February 9, 2018 11:53

keszybz reviewed Feb 9, 2018

View reviewed changes

poettering force-pushed the cgroups-migrate branch from 764e053 to 166c2fc Compare February 9, 2018 20:26

poettering added 21 commits February 12, 2018 11:34

core: fold manager_set_exec_params() into unit_set_exec_params()

004c7f1

Let's simplify things a bit: we so far called both functions every single time, let's just merge one into the other, so that we have fewer functions to call.

mkosi: update to fedora 27, it's released since a while

d66a984

dbus: split up bus_done() into seperate functions

4502c40

No functional changes, but let's make this a bit more finegrained. (The individual functions are exported, which is used in a later commit)

manager: tweak manager_journal_is_running() a bit regarding test mode

7d814a1

In test mode, let's not consider the journal to be up ever: we want all output to go to stderr.

core: simplify manager_recheck_journal() a bit

cedf508

No need for an if check if we just pass along a bool anyway.

user-util: also consider /bin/false and /bin/true as non-shell

7cb6091

sd-bus: synthesize a description for user/system bus if otherwise unset

201e419

Let's make debugging easier, by synthesizing a name when we have some indication what kind of bus this is.

core: set a description on private bus connections

030fa56

Let's make things easier to debug

bus: when destroying a bus, also destroy per-unit bus track objects a…

dfeff66

…ssociated with it Let's not keep the old bus object pinned this way, let's destroy all relevant trackers for units, the way we already destroy them for jobs.

update TODO

1e78432

poettering force-pushed the cgroups-migrate branch from 166c2fc to 1e78432 Compare February 12, 2018 10:36

ghost mentioned this pull request Feb 12, 2018

systemd error messages flatpak/flatpak#1216

Closed

evverx approved these changes Feb 13, 2018

View reviewed changes

keszybz approved these changes Feb 15, 2018

View reviewed changes

keszybz merged commit 1e78432 into systemd:master Feb 15, 2018

keszybz added a commit that referenced this pull request Feb 15, 2018

Merge pull request #8125 from poettering/cgroups-migrate

648461c

Trivial merge conflict resolved locally.

sourcejedi reviewed Feb 15, 2018

View reviewed changes

keszybz added a commit to keszybz/systemd that referenced this pull request Feb 16, 2018

core: drop obsolete comment

2c2fbd9

systemd#8125 (review)

keszybz added a commit to keszybz/systemd that referenced this pull request Feb 19, 2018

core: drop obsolete comment

f4aa0bd

systemd#8125 (review)

marcelpaulo mentioned this pull request Jun 6, 2018

Whenever I run GIMP, systemd logs: Failed to add PIDs to scope's control group: Permission denied flathub/org.gimp.GIMP#23

Closed

friday mentioned this pull request Nov 9, 2021

ULauncher not working after latest upgrade Ulauncher/Ulauncher#842

Closed

		@@ -2463,12 +2461,6 @@ void unit_notify(Unit *u, UnitActiveState os, UnitActiveState ns, bool reload_su
		/* Some names are special */

Uh oh!

Conversation

poettering commented Feb 7, 2018

Uh oh!

evverx commented Feb 8, 2018

Uh oh!

evverx commented Feb 8, 2018

Uh oh!

poettering commented Feb 9, 2018

Uh oh!

keszybz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering commented Feb 9, 2018

Uh oh!

evverx left a comment

Choose a reason for hiding this comment

Uh oh!

keszybz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants