Skip to content

Fix issue: pmon services's restart count is not cleared during config reload#4314

Merged
liat-grozovik merged 2 commits intosonic-net:masterfrom
stephenxs:fix-pmon-restart-count-not-clear
Mar 9, 2026
Merged

Fix issue: pmon services's restart count is not cleared during config reload#4314
liat-grozovik merged 2 commits intosonic-net:masterfrom
stephenxs:fix-pmon-restart-count-not-clear

Conversation

@stephenxs
Copy link
Copy Markdown
Collaborator

What I did

Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.

How I did it

Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.

How to verify it

Clear restart count also for services that have reverse dependency on sonic.target.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

… reload

What I did
Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target.
However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.

After multi-ASIC PRs are merged, there is a high probability that pmon fails to restart due to reaching start limit (3 times in 1200 seconds).
The pmon service can be started by featured, syncd during config reload.
Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.

How I did it
Clear restart count also for services that have reverse dependency on sonic.target.

How to verify it
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)

Signed-off-by: Stephen Sun <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs stephenxs marked this pull request as draft February 27, 2026 06:44
@stephenxs stephenxs marked this pull request as ready for review March 2, 2026 07:16
@stephenxs
Copy link
Copy Markdown
Collaborator Author

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Copy Markdown
Contributor

Cross-reference: Complementary fix in sonic-buildimage

Related PR: sonic-net/sonic-buildimage#25932 — adds the missing [Install] WantedBy=sonic.target section to 9 container service templates (pmon, lldp, gnmi, snmp, telemetry, otel, sflow, bmp, mgmt-framework).

Root cause analysis

After deeper investigation, we found the start-limit-hit issue has three contributing factors:

  1. Missing [Install] section (fixed by sonic-buildimage#25932): After the systemd-sonic-generator rework (PR Trixie base image upgrade sonic-buildimage#23340), the generator only creates sonic.target.wants/ symlinks based on [Install] targets. These 9 services have BindsTo=sonic.target but no [Install], so they are no longer listed as sonic.target dependencies.

  2. _reset_failed_services() misses these services (fixed by this PR): It iterates systemctl list-dependencies --plain sonic.target which no longer includes these services. Rate limit counters are never reset between config reloads.

  3. featured daemon issues redundant systemctl start calls: On 202511, enable_feature() checks if unit_file_state == 'enabled': continue — but UnitFileState is static (no [Install]), so the check fails. Featured runs systemctl enable (fails silently due to the raise_exception=False Trixie change) then proceeds to systemctl start — adding an extra start attempt on every config reload. This is why pmon exceeds StartLimitBurst=3 after multiple reloads.

Why 202505 doesn't have this issue

On 202505, featured's enable_feature() uses raise_exception=True for the enable command → systemctl enable fails for static services → exception caught → systemctl start is never reached. On 202511, raise_exception=False causes the failure to be silently ignored → systemctl start proceeds → extra start attempts accumulate.

How the two PRs complement each other

Fix What it addresses
sonic-buildimage#25932 Adds [Install] WantedBy=sonic.target → services become proper sonic.target dependencies, UnitFileState becomes enabled, featured skips already-enabled services
This PR (sonic-utilities#4314) Adds --reverse to _get_sonic_services()_reset_failed_services covers BindsTo services even without [Install]

Both PRs are valid fixes. sonic-buildimage#25932 addresses all 3 factors, while this PR provides defense-in-depth for _reset_failed_services. Ideally both should merge.

Tracked by: sonic-net/sonic-buildimage#25931

cc @stephenxs @saiarcot895

StormLiangMS
StormLiangMS previously approved these changes Mar 6, 2026
Copy link
Copy Markdown
Contributor

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS
Copy link
Copy Markdown
Contributor

hi @qiluo-msft could you help to review and merge?

StormLiangMS added a commit to StormLiangMS/sonic-mgmt that referenced this pull request Mar 6, 2026
Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
def _get_sonic_services(reverse=False):
cmd = ['systemctl', 'list-dependencies', '--plain', 'sonic.target']
if reverse:
cmd.append('--reverse')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest also adding --type=service. when adding --reverse. Otherwise, this will also include multi-user.target and graphical.target, and while there probably wouldn't be any negative effects from resetting the failed state of the units, it might be better to restrict it to SONiC services.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

wangxin pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 7, 2026
Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
@StormLiangMS
Copy link
Copy Markdown
Contributor

StormLiangMS commented Mar 7, 2026

hi @stephenxs could you check comments from Saikrishna? And file a separate PR to 202511 branch? Thanks for the fix. Fixes: sonic-net/sonic-buildimage#25931

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 7, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: mssonicbld <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Copy Markdown
Collaborator

As feedback handled and PR now has the additional request, i assume we can merge it.
@stephenxs can you please provide the PR for 202511 and relate to this one?

@liat-grozovik liat-grozovik merged commit 4d0cc93 into sonic-net:master Mar 9, 2026
9 checks passed
@stephenxs stephenxs deleted the fix-pmon-restart-count-not-clear branch March 9, 2026 07:49
stephenxs added a commit to stephenxs/sonic-utilities that referenced this pull request Mar 9, 2026
… reload (sonic-net#4314)

- What I did
Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.

- How I did it
Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.

- How to verify it
Clear restart count also for services that have reverse dependency on sonic.target.

Signed-off-by: Stephen Sun <[email protected]>
StormLiangMS added a commit to StormLiangMS/sonic-utilities that referenced this pull request Mar 10, 2026
…ng config reload (sonic-net#4314)

This is the same fix as PR sonic-net#4336 by @stephenxs, with corrected test assertion counts.

During config reload, _reset_failed_services() now also resets services that
have a reverse dependency (BindsTo) on sonic.target, such as pmon. This ensures
pmon's restart count is properly cleared and prevents start-limit-hit failures.

Changes:
- _get_sonic_services() accepts reverse=False parameter to query reverse deps
- _reset_failed_services() unions forward and reverse dependencies
- Test mock updates with correct call_count assertions (19/15/15 vs original 16/12/12)

The original PR sonic-net#4336 had an off-by-one error in test counts because the first
mock was updated to return featured.timer as an additional forward dep, adding
one more reset-failed call that wasn't accounted for.

Signed-off-by: Storm Liang <[email protected]>

Co-authored-by: Copilot <[email protected]>
StormLiangMS added a commit to StormLiangMS/sonic-mgmt that referenced this pull request Mar 10, 2026
Cherry-pick of PR sonic-net#22775 to 202511, rebased on latest 202511 to include
the duplicate key fix from PR sonic-net#22796.

The test performs multiple config reloads that cause pmon start-limit-hit
due to missing sonic.target symlinks after systemd-sonic-generator rework.

Fix PRs: sonic-net/sonic-buildimage#25932, sonic-net/sonic-utilities#4314
Tracking issue: sonic-net/sonic-buildimage#25931

Signed-off-by: Storm Liang <[email protected]>

Co-authored-by: Copilot <[email protected]>
vmittal-msft pushed a commit that referenced this pull request Mar 10, 2026
…ng config reload (#4314) (#4336)

* Fix issue: pmon services's restart count is not cleared during config reload (#4314)

- What I did
Currently, when "config reload" is executed, services' restart count are cleared to avoid reaching restart limit. This is done by listing all services using command systemctl list-dependencies --plain .target. However, this doesn't include pmon service, neither all other services that don't have WantedBy=sonic.target, which means pmon's start count is not cleared.

- How I did it
Sometimes pmon fails to restart due to reaching start limit (3 times in 1200 seconds). The pmon service can be started by featured, syncd during config reload. Before multi-ASIC, pmon depends on syncd. The dependency is removed after multi-ASIC, which means pmon can restart immediately triggered by sonic.target which is once more restarting. As a result the pmon service is more likely to reach the restart limit.

- How to verify it
Clear restart count also for services that have reverse dependency on sonic.target.

Signed-off-by: Stephen Sun <[email protected]>

* Fix UT failure

Signed-off-by: Stephen Sun <[email protected]>

---------

Signed-off-by: Stephen Sun <[email protected]>
StormLiangMS added a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 10, 2026
…fig (#22830)

Cherry-pick of PR #22775 to 202511, rebased on latest 202511 to include
the duplicate key fix from PR #22796.

The test performs multiple config reloads that cause pmon start-limit-hit
due to missing sonic.target symlinks after systemd-sonic-generator rework.

Fix PRs: sonic-net/sonic-buildimage#25932, sonic-net/sonic-utilities#4314
Tracking issue: sonic-net/sonic-buildimage#25931

Co-authored-by: Copilot <[email protected]>
ksravani-hcl pushed a commit to ksravani-hcl/sonic-mgmt that referenced this pull request Mar 10, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
ksravani-hcl pushed a commit to ksravani-hcl/sonic-mgmt that referenced this pull request Mar 10, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
aronovic pushed a commit to aronovic/sonic-mgmt that referenced this pull request Mar 10, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: Mihut Aronovici <[email protected]>
zili11720 added a commit to zili11720/sonic-utilities that referenced this pull request Mar 11, 2026
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request Mar 16, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: selldinesh <[email protected]>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: Abhishek <[email protected]>
@saiarcot895
Copy link
Copy Markdown
Contributor

@stephenxs Can you cherry-pick this PR for 202511 branch? It seems there are cherry-pick conflicts.

vrajeshe pushed a commit to vrajeshe/sonic-mgmt that referenced this pull request Mar 23, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 27, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request Apr 1, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: selldinesh <[email protected]>
albertovillarreal-keys pushed a commit to albertovillarreal-keys/sonic-mgmt that referenced this pull request Apr 3, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants