[202511] Fix issue: pmon services's restart count is not cleared during config reload (#4314)#4336
Conversation
… reload (sonic-net#4314) - What I did Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared. - How I did it Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit. - How to verify it Clear restart count also for services that have reverse dependency on sonic.target. Signed-off-by: Stephen Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi @stephenxs could you check the PR failures? |
|
Hi @stephenxs, I analyzed the CI failure and found an off-by-one error in the \mock_run_command.call_count\ assertions in 3 tests: Root Cause: Failing assertions:
Fix: Increment each \call_count\ assertion by 1 in \ ests/config_test.py. I've also raised a parallel PR with the fix applied (#4340 forthcoming) in case this can't be addressed today, so we can unblock CI. |
Signed-off-by: Stephen Sun <[email protected]>
|
/azp run |
Thanks @StormLiangMS |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ng config reload (sonic-net#4314) This is the same fix as PR sonic-net#4336 by @stephenxs, with corrected test assertion counts. During config reload, _reset_failed_services() now also resets services that have a reverse dependency (BindsTo) on sonic.target, such as pmon. This ensures pmon's restart count is properly cleared and prevents start-limit-hit failures. Changes: - _get_sonic_services() accepts reverse=False parameter to query reverse deps - _reset_failed_services() unions forward and reverse dependencies - Test mock updates with correct call_count assertions (19/15/15 vs original 16/12/12) The original PR sonic-net#4336 had an off-by-one error in test counts because the first mock was updated to return featured.timer as an additional forward dep, adding one more reset-failed call that wasn't accounted for. Signed-off-by: Storm Liang <[email protected]> Co-authored-by: Copilot <[email protected]>
Port #4314 to 202511
What I did Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.
How I did it Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.
How to verify it Clear restart count also for services that have reverse dependency on sonic.target.
What I did
How I did it
How to verify it
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)