Fix issue: pmon services's restart count is not cleared during config reload#4314
Conversation
… reload What I did Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared. After multi-ASIC PRs are merged, there is a high probability that pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit. How I did it Clear restart count also for services that have reverse dependency on sonic.target. How to verify it Previous command output (if the output of a command-line utility has changed) New command output (if the output of a command-line utility has changed) Signed-off-by: Stephen Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Cross-reference: Complementary fix in sonic-buildimageRelated PR: sonic-net/sonic-buildimage#25932 — adds the missing Root cause analysisAfter deeper investigation, we found the
Why 202505 doesn't have this issueOn 202505, How the two PRs complement each other
Both PRs are valid fixes. sonic-buildimage#25932 addresses all 3 factors, while this PR provides defense-in-depth for Tracked by: sonic-net/sonic-buildimage#25931 |
|
hi @qiluo-msft could you help to review and merge? |
Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]>
| def _get_sonic_services(reverse=False): | ||
| cmd = ['systemctl', 'list-dependencies', '--plain', 'sonic.target'] | ||
| if reverse: | ||
| cmd.append('--reverse') |
There was a problem hiding this comment.
I would suggest also adding --type=service. when adding --reverse. Otherwise, this will also include multi-user.target and graphical.target, and while there probably wouldn't be any negative effects from resetting the failed state of the units, it might be better to restrict it to SONiC services.
Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]>
|
hi @stephenxs could you check comments from Saikrishna? And file a separate PR to 202511 branch? Thanks for the fix. Fixes: sonic-net/sonic-buildimage#25931 |
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]> Signed-off-by: mssonicbld <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
As feedback handled and PR now has the additional request, i assume we can merge it. |
… reload (sonic-net#4314) - What I did Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared. - How I did it Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit. - How to verify it Clear restart count also for services that have reverse dependency on sonic.target. Signed-off-by: Stephen Sun <[email protected]>
…ng config reload (sonic-net#4314) This is the same fix as PR sonic-net#4336 by @stephenxs, with corrected test assertion counts. During config reload, _reset_failed_services() now also resets services that have a reverse dependency (BindsTo) on sonic.target, such as pmon. This ensures pmon's restart count is properly cleared and prevents start-limit-hit failures. Changes: - _get_sonic_services() accepts reverse=False parameter to query reverse deps - _reset_failed_services() unions forward and reverse dependencies - Test mock updates with correct call_count assertions (19/15/15 vs original 16/12/12) The original PR sonic-net#4336 had an off-by-one error in test counts because the first mock was updated to return featured.timer as an additional forward dep, adding one more reset-failed call that wasn't accounted for. Signed-off-by: Storm Liang <[email protected]> Co-authored-by: Copilot <[email protected]>
Cherry-pick of PR sonic-net#22775 to 202511, rebased on latest 202511 to include the duplicate key fix from PR sonic-net#22796. The test performs multiple config reloads that cause pmon start-limit-hit due to missing sonic.target symlinks after systemd-sonic-generator rework. Fix PRs: sonic-net/sonic-buildimage#25932, sonic-net/sonic-utilities#4314 Tracking issue: sonic-net/sonic-buildimage#25931 Signed-off-by: Storm Liang <[email protected]> Co-authored-by: Copilot <[email protected]>
…ng config reload (#4314) (#4336) * Fix issue: pmon services's restart count is not cleared during config reload (#4314) - What I did Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared. - How I did it Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit. - How to verify it Clear restart count also for services that have reverse dependency on sonic.target. Signed-off-by: Stephen Sun <[email protected]> * Fix UT failure Signed-off-by: Stephen Sun <[email protected]> --------- Signed-off-by: Stephen Sun <[email protected]>
…fig (#22830) Cherry-pick of PR #22775 to 202511, rebased on latest 202511 to include the duplicate key fix from PR #22796. The test performs multiple config reloads that cause pmon start-limit-hit due to missing sonic.target symlinks after systemd-sonic-generator rework. Fix PRs: sonic-net/sonic-buildimage#25932, sonic-net/sonic-utilities#4314 Tracking issue: sonic-net/sonic-buildimage#25931 Co-authored-by: Copilot <[email protected]>
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]>
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]>
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]> Signed-off-by: Mihut Aronovici <[email protected]>
…g config reload (sonic-net#4314)" This reverts commit 4d0cc93.
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]> Signed-off-by: selldinesh <[email protected]>
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]> Signed-off-by: Abhishek <[email protected]>
|
@stephenxs Can you cherry-pick this PR for 202511 branch? It seems there are cherry-pick conflicts. |
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]> Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]>
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]> Signed-off-by: selldinesh <[email protected]>
…c-net#22775) Skip test_load_minigraph_with_golden_config when issue #25931 is open. This test performs 4 consecutive config reloads which causes pmon to hit start-limit-hit due to missing sonic.target.wants/ symlinks after the systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340). The test leaves pmon in a bad state (start-limit-hit), which can affect subsequent tests in the nightly run. Fix PRs: - sonic-net/sonic-buildimage#25932 (add [Install] to service templates) - sonic-net/sonic-utilities#4314 (fix _reset_failed_services) The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed. Signed-off-by: Storm Liang <[email protected]>
What I did
Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.
How I did it
Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.
How to verify it
Clear restart count also for services that have reverse dependency on sonic.target.
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)