Skip to content

[202511] Fix issue: pmon services's restart count is not cleared during config reload (#4314)#4336

Merged
vmittal-msft merged 2 commits intosonic-net:202511from
stephenxs:202511-fix-pmon-rest-count
Mar 10, 2026
Merged

[202511] Fix issue: pmon services's restart count is not cleared during config reload (#4314)#4336
vmittal-msft merged 2 commits intosonic-net:202511from
stephenxs:202511-fix-pmon-rest-count

Conversation

@stephenxs
Copy link
Copy Markdown
Collaborator

Port #4314 to 202511

  • What I did Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.

  • How I did it Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.

  • How to verify it Clear restart count also for services that have reverse dependency on sonic.target.

What I did

How I did it

How to verify it

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

… reload (sonic-net#4314)

- What I did
Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.

- How I did it
Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.

- How to verify it
Clear restart count also for services that have reverse dependency on sonic.target.

Signed-off-by: Stephen Sun <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Copy Markdown
Contributor

hi @stephenxs could you check the PR failures?

@StormLiangMS
Copy link
Copy Markdown
Contributor

Hi @stephenxs, I analyzed the CI failure and found an off-by-one error in the \mock_run_command.call_count\ assertions in 3 tests:

Root Cause:
The first mock (\mock_run_command_side_effect) now returns 'sonic.target\nswss\nfeatured.timer'\ for forward deps (was 'sonic.target\nswss'), which adds an extra \systemctl reset-failed featured.timer\ call. Combined with the new reverse deps call (\pmon), the actual delta from original is +3 calls (1 list-deps reverse + 1 reset-failed \ eatured.timer\ + 1 reset-failed \pmon), not +2.

Failing assertions:

  • \ est_load_minigraph: actual 19, expected 18 → should be 19
  • \ est_load_minigraph_bypass_lock: actual 15, expected 14 → should be 15
  • \ est_load_minigraph_platform_plugin: actual 15, expected 14 → should be 15

Fix: Increment each \call_count\ assertion by 1 in \ ests/config_test.py.

I've also raised a parallel PR with the fix applied (#4340 forthcoming) in case this can't be addressed today, so we can unblock CI.

Signed-off-by: Stephen Sun <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@stephenxs
Copy link
Copy Markdown
Collaborator Author

Hi @stephenxs, I analyzed the CI failure and found an off-by-one error in the \mock_run_command.call_count\ assertions in 3 tests:

Root Cause: The first mock (\mock_run_command_side_effect) now returns 'sonic.target\nswss\nfeatured.timer'\ for forward deps (was 'sonic.target\nswss'), which adds an extra \systemctl reset-failed featured.timer\ call. Combined with the new reverse deps call (\pmon), the actual delta from original is +3 calls (1 list-deps reverse + 1 reset-failed \ eatured.timer\ + 1 reset-failed \pmon), not +2.

Failing assertions:

  • \ est_load_minigraph: actual 19, expected 18 → should be 19
  • \ est_load_minigraph_bypass_lock: actual 15, expected 14 → should be 15
  • \ est_load_minigraph_platform_plugin: actual 15, expected 14 → should be 15

Fix: Increment each \call_count\ assertion by 1 in \ ests/config_test.py.

I've also raised a parallel PR with the fix applied (#4340 forthcoming) in case this can't be addressed today, so we can unblock CI.

Thanks @StormLiangMS
Fixed

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

StormLiangMS added a commit to StormLiangMS/sonic-utilities that referenced this pull request Mar 10, 2026
…ng config reload (sonic-net#4314)

This is the same fix as PR sonic-net#4336 by @stephenxs, with corrected test assertion counts.

During config reload, _reset_failed_services() now also resets services that
have a reverse dependency (BindsTo) on sonic.target, such as pmon. This ensures
pmon's restart count is properly cleared and prevents start-limit-hit failures.

Changes:
- _get_sonic_services() accepts reverse=False parameter to query reverse deps
- _reset_failed_services() unions forward and reverse dependencies
- Test mock updates with correct call_count assertions (19/15/15 vs original 16/12/12)

The original PR sonic-net#4336 had an off-by-one error in test counts because the first
mock was updated to return featured.timer as an additional forward dep, adding
one more reset-failed call that wasn't accounted for.

Signed-off-by: Storm Liang <[email protected]>

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants