core: coldplug possible nop_job #13124

ypf791 · 2019-07-21T03:57:59Z

Recently, I ran into a problem similar to issue#2419. I noticed that invoking try-restart to an inactive service may hang when a daemon-reload is invoked before the try-restart returned.

To reproduce the problem, one could create a terminal running

while true; do date +%s.%N; systemctl try-restart foo.service; done

and manually invoke daemon-reload in another terminal. In fresh installed Ubuntu 16.04 (v229) and Ubuntu 18.04 (v237), I could observe that the first terminal stops printing new timestamp (which implies that the process hangs in try-restart) in less than 10 daemon-reload trials.

When the problem occurred, one could see that a job with TYPE == nop and STATE == waiting stuck in the job list, but it could never be resolved.

The problem may result from the fact that an nop job is not coldplugged after daemon-reload, and thus never enters the run_queue hereafter, unless another nop installed into this job.

This coldplugs nop_job for any units u with u->job == NULL but u->nop_job != NULL.

keszybz

I can reproduce the issue:

$ systemctl list-jobs
 JOB UNIT        TYPE STATE  
1810 foo.service nop  waiting

1 jobs listed.

keszybz · 2019-09-17T09:00:11Z

src/core/unit.c

@@ -3857,8 +3858,9 @@ int unit_coldplug(Unit *u) {
                        r = q;
        }

-        if (u->job) {
-                q = job_coldplug(u->job);
+        uj = (u->job) ? u->job : u->nop_job;


uj = u->job ?: u->nop_job;

keszybz · 2019-09-17T09:32:14Z

Yep, this commits seems to fix the issue. I'll force-push with the trivial style fix and merge.

mrc0mmand · 2019-09-17T10:43:32Z

Almost there:

Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]: ../src/network/networkd-link.c:2398:28: runtime error: member access within null pointer of type 'Network' (aka 'struct Network')
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #0 0x55bdeddaa0a1 in link_drop_foreign_config /build/build/../src/network/networkd-link.c:2398:28
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #1 0x55bdedd8f63b in link_carrier_lost /build/build/../src/network/networkd-link.c:3234:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #2 0x55bdedd9182b in link_update /build/build/../src/network/networkd-link.c:3422:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #3 0x55bdedc98717 in manager_rtnl_process_link /build/build/../src/network/networkd-manager.c:894:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #4 0x7f34927011c6 in process_match /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:377:29
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #5 0x7f34926fc4a6 in process_running /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:410:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #6 0x7f34926fc174 in sd_netlink_process /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:443:13
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #7 0x7f34926ff022 in io_callback /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:712:13
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #8 0x7f3492743659 in source_dispatch /build/build/../src/libsystemd/sd-event/sd-event.c:2828:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #9 0x7f3492742580 in sd_event_dispatch /build/build/../src/libsystemd/sd-event/sd-event.c:3241:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #10 0x7f3492744e5c in sd_event_run /build/build/../src/libsystemd/sd-event/sd-event.c:3299:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #11 0x7f349274575d in sd_event_loop /build/build/../src/libsystemd/sd-event/sd-event.c:3321:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #12 0x55bdedc855e0 in run /build/build/../src/network/networkd.c:118:13
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #13 0x55bdedc84f13 in main /build/build/../src/network/networkd.c:125:1
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #14 0x7f3491d86ee2 in __libc_start_main (/lib64/libc.so.6+0x26ee2)
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #15 0x55bdedc84e2d in _start (/build/build/systemd-networkd+0x1a5e2d)
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]: SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/network/networkd-link.c:2398:28 in


 ____________________________________________
/ Found   1 sanitizer errors (  0 ASan,   1  \
| UBSan,   0 MSan). Looks like you need to   |
\ look at the log                            /
 --------------------------------------------
 \
  \
     __
    /  \
    |  |
    @  @
    |  |
    || |/
    || ||
    |\_/|
    \___/

mrc0mmand · 2019-09-17T10:44:25Z

Uhh, okay, the issue above is not related to this PR, cc'ing @yuwata and re-setting the green CI label.

mrc0mmand · 2019-09-17T13:46:02Z

As mentioned above, the UBSan issue is not relevant to this PR and is being fixed in #13577.

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: run systemctl try-restart test.servcie (inactive) repeatly in a terminal. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 Resolves: #1829754

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: run systemctl try-restart test.servcie (inactive) repeatly in a terminal. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit b49e14d) Resolves: #1829798

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit 8ee6529) Resolves: #1847336

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 Resolves: #1847335 (cherry picked from commit 8ee6529)

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit 8ee6529) Resolves: #1847334

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit 8ee6529) Resolves: #1847335

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit 8ee6529) Resolves: #1847334

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit 8ee6529) Resolves: #1847335

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: 1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal. 2. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit 8ee6529) Resolves: #1847336

Fix #7180 Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124 Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238. Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment Signed-off-by: Stepan Blyschak <[email protected]>

…#7228) Fix sonic-net#7180 Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124 Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238. Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment Signed-off-by: Stepan Blyschak <[email protected]>

When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead of u->job. If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload just install it during deserialization. Without a coldplug, the job will not be in m->run_queue, which results in a hung try-restart or try-reload process. Reproduce: run systemctl try-restart test.servcie (inactive) repeatly in a terminal. run systemctl daemon-reload repeatly in other terminals. After successful reproduce, systemctl list-jobs will list the hang job. Upsteam: systemd/systemd#13124 (cherry picked from commit b49e14d5f3081dfcd363d8199a14c0924ae9152f) Resolves: #1829798

keszybz added the pid1 label Sep 17, 2019

keszybz requested changes Sep 17, 2019

View reviewed changes

core: coldplug possible nop_job

321255e

keszybz force-pushed the try-restart-hang branch from ceaca6b to 321255e Compare September 17, 2019 09:35

keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Sep 17, 2019

mrc0mmand removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Sep 17, 2019

mrc0mmand added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Sep 17, 2019

mrc0mmand added the ci-failure-appears-unrelated label Sep 17, 2019

mrc0mmand merged commit b49e14d into systemd:master Sep 17, 2019

chenglin130 mentioned this pull request Nov 9, 2019

(#1829754) core: coldplug possible nop_job redhat-plumbers/systemd-rhel7#57

Merged

chenglin130 mentioned this pull request Nov 9, 2019

(#1829798) core: coldplug possible nop_job redhat-plumbers/systemd-rhel8#39

Closed

dtardon mentioned this pull request Jun 16, 2020

(#1847336) core: coldplug possible nop_job redhat-plumbers/systemd-rhel7#114

Merged

dtardon mentioned this pull request Jun 16, 2020

(#1847335) core: coldplug possible nop_job redhat-plumbers/systemd-rhel7#115

Merged

dtardon mentioned this pull request Jun 16, 2020

(#1847334) core: coldplug possible nop_job redhat-plumbers/systemd-rhel7#116

Merged

alexrallen mentioned this pull request Mar 29, 2021

SONiC docker containers show as "Exited" after config-reload sonic-net/sonic-buildimage#7180

Closed

This was referenced Apr 2, 2021

[debian] install systemd version 247 from buster-backports stepanblyschak/sonic-buildimage#30

Closed

[debian] install systemd version 247 from buster-backports sonic-net/sonic-buildimage#7228

Merged

lclgo mentioned this pull request Apr 7, 2021

unit: coldplug both job and nop_job if possible #19180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: coldplug possible nop_job #13124

core: coldplug possible nop_job #13124

ypf791 commented Jul 21, 2019

keszybz left a comment

keszybz Sep 17, 2019

keszybz commented Sep 17, 2019

mrc0mmand commented Sep 17, 2019

mrc0mmand commented Sep 17, 2019

mrc0mmand commented Sep 17, 2019

core: coldplug possible nop_job #13124

core: coldplug possible nop_job #13124

Conversation

ypf791 commented Jul 21, 2019

keszybz left a comment

Choose a reason for hiding this comment

keszybz Sep 17, 2019

Choose a reason for hiding this comment

keszybz commented Sep 17, 2019

mrc0mmand commented Sep 17, 2019

mrc0mmand commented Sep 17, 2019

mrc0mmand commented Sep 17, 2019