Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: coldplug possible nop_job #13124

Merged
merged 1 commit into from
Sep 17, 2019
Merged

Conversation

ypf791
Copy link
Contributor

@ypf791 ypf791 commented Jul 21, 2019

Recently, I ran into a problem similar to issue#2419. I noticed that invoking try-restart to an inactive service may hang when a daemon-reload is invoked before the try-restart returned.

To reproduce the problem, one could create a terminal running

while true; do date +%s.%N; systemctl try-restart foo.service; done

and manually invoke daemon-reload in another terminal. In fresh installed Ubuntu 16.04 (v229) and Ubuntu 18.04 (v237), I could observe that the first terminal stops printing new timestamp (which implies that the process hangs in try-restart) in less than 10 daemon-reload trials.

When the problem occurred, one could see that a job with TYPE == nop and STATE == waiting stuck in the job list, but it could never be resolved.

The problem may result from the fact that an nop job is not coldplugged after daemon-reload, and thus never enters the run_queue hereafter, unless another nop installed into this job.

This coldplugs nop_job for any units u with u->job == NULL but u->nop_job != NULL.

@keszybz keszybz added the pid1 label Sep 17, 2019
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce the issue:

$ systemctl list-jobs
 JOB UNIT        TYPE STATE  
1810 foo.service nop  waiting

1 jobs listed.

src/core/unit.c Outdated
@@ -3857,8 +3858,9 @@ int unit_coldplug(Unit *u) {
r = q;
}

if (u->job) {
q = job_coldplug(u->job);
uj = (u->job) ? u->job : u->nop_job;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uj = u->job ?: u->nop_job;

@keszybz
Copy link
Member

keszybz commented Sep 17, 2019

Yep, this commits seems to fix the issue. I'll force-push with the trivial style fix and merge.

@keszybz keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Sep 17, 2019
@mrc0mmand
Copy link
Member

Almost there:

Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]: ../src/network/networkd-link.c:2398:28: runtime error: member access within null pointer of type 'Network' (aka 'struct Network')
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #0 0x55bdeddaa0a1 in link_drop_foreign_config /build/build/../src/network/networkd-link.c:2398:28
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #1 0x55bdedd8f63b in link_carrier_lost /build/build/../src/network/networkd-link.c:3234:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #2 0x55bdedd9182b in link_update /build/build/../src/network/networkd-link.c:3422:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #3 0x55bdedc98717 in manager_rtnl_process_link /build/build/../src/network/networkd-manager.c:894:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #4 0x7f34927011c6 in process_match /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:377:29
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #5 0x7f34926fc4a6 in process_running /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:410:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #6 0x7f34926fc174 in sd_netlink_process /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:443:13
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #7 0x7f34926ff022 in io_callback /build/build/../src/libsystemd/sd-netlink/sd-netlink.c:712:13
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #8 0x7f3492743659 in source_dispatch /build/build/../src/libsystemd/sd-event/sd-event.c:2828:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #9 0x7f3492742580 in sd_event_dispatch /build/build/../src/libsystemd/sd-event/sd-event.c:3241:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #10 0x7f3492744e5c in sd_event_run /build/build/../src/libsystemd/sd-event/sd-event.c:3299:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #11 0x7f349274575d in sd_event_loop /build/build/../src/libsystemd/sd-event/sd-event.c:3321:21
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #12 0x55bdedc855e0 in run /build/build/../src/network/networkd.c:118:13
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #13 0x55bdedc84f13 in main /build/build/../src/network/networkd.c:125:1
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #14 0x7f3491d86ee2 in __libc_start_main (/lib64/libc.so.6+0x26ee2)
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]:     #15 0x55bdedc84e2d in _start (/build/build/systemd-networkd+0x1a5e2d)
Sep 17 10:00:03 arch.localdomain systemd-networkd[98934]: SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/network/networkd-link.c:2398:28 in


 ____________________________________________
/ Found   1 sanitizer errors (  0 ASan,   1  \
| UBSan,   0 MSan). Looks like you need to   |
\ look at the log                            /
 --------------------------------------------
 \
  \
     __
    /  \
    |  |
    @  @
    |  |
    || |/
    || ||
    |\_/|
    \___/

@mrc0mmand mrc0mmand removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Sep 17, 2019
@mrc0mmand
Copy link
Member

Uhh, okay, the issue above is not related to this PR, cc'ing @yuwata and re-setting the green CI label.

@mrc0mmand mrc0mmand added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Sep 17, 2019
@mrc0mmand
Copy link
Member

As mentioned above, the UBSan issue is not relevant to this PR and is being fixed in #13577.

@mrc0mmand mrc0mmand merged commit b49e14d into systemd:master Sep 17, 2019
chenglin130 added a commit to chenglin130/rhel-7 that referenced this pull request Nov 9, 2019
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124
chenglin130 added a commit to chenglin130/rhel-8 that referenced this pull request Nov 9, 2019
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:

run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
run systemctl daemon-reload repeatly in other terminals.
After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124
systemd-rhel-bot pushed a commit to redhat-plumbers/systemd-rhel7 that referenced this pull request Apr 30, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124
jsynacek pushed a commit to redhat-plumbers/systemd-rhel7 that referenced this pull request Apr 30, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

Resolves: #1829754
jsynacek pushed a commit to redhat-plumbers/systemd-rhel8 that referenced this pull request Jun 8, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:

    run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
    run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit b49e14d)

Resolves: #1829798
dtardon pushed a commit to dtardon/systemd-rhel7 that referenced this pull request Jun 16, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit 8ee6529)

Resolves: #1847336
dtardon pushed a commit to dtardon/systemd-rhel7 that referenced this pull request Jun 16, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

Resolves: #1847335
(cherry picked from commit 8ee6529)
dtardon pushed a commit to dtardon/systemd-rhel7 that referenced this pull request Jun 16, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit 8ee6529)

Resolves: #1847334
dtardon pushed a commit to dtardon/systemd-rhel7 that referenced this pull request Jun 16, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit 8ee6529)

Resolves: #1847335
systemd-rhel-bot pushed a commit to redhat-plumbers/systemd-rhel7 that referenced this pull request Jun 17, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit 8ee6529)

Resolves: #1847334
systemd-rhel-bot pushed a commit to redhat-plumbers/systemd-rhel7 that referenced this pull request Jun 17, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit 8ee6529)

Resolves: #1847335
systemd-rhel-bot pushed a commit to redhat-plumbers/systemd-rhel7 that referenced this pull request Jun 17, 2020
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:
1. run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
2. run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit 8ee6529)

Resolves: #1847336
lguohan pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Apr 8, 2021
Fix #7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <[email protected]>
yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Apr 8, 2021
Fix #7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <[email protected]>
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-buildimage that referenced this pull request May 23, 2021
…#7228)

Fix sonic-net#7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <[email protected]>
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
…#7228)

Fix sonic-net#7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <[email protected]>
zlind0 pushed a commit to zlind0/systemd-239 that referenced this pull request Sep 14, 2024
When a unit in a state INACTIVE or DEACTIVATING, JobType JOB_TRY_RESTART or
JOB_TRY_RELOAD will be collapsed to JOB_NOP. And use u->nop_job instead
of u->job.

If a JOB_NOP job is going on with a waiting state, a parallel daemon-reload
just install it during deserialization. Without a coldplug, the job will
not be in m->run_queue, which results in a hung try-restart or
try-reload process.

Reproduce:

    run systemctl try-restart test.servcie (inactive) repeatly in a terminal.
    run systemctl daemon-reload repeatly in other terminals.

After successful reproduce, systemctl list-jobs will list the hang job.

Upsteam:
systemd/systemd#13124

(cherry picked from commit b49e14d5f3081dfcd363d8199a14c0924ae9152f)

Resolves: #1829798
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure-appears-unrelated good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed pid1
Development

Successfully merging this pull request may close these issues.

3 participants