Skip to content

Bug: systemd-sonic-generator rework causes container services to hit start-limit-hit after multiple config reloads (202511) #25931

@StormLiangMS

Description

@StormLiangMS

Description

The systemd-sonic-generator rework in commit d9b0434c9 ("Update systemd-sonic-generator to make it work on Trixie") introduced two behavioral changes on the 202511 branch that cause container services (pmon, lldp, gnmi, snmp, telemetry, etc.) to hit start-limit-hit failures after multiple rapid config reloads:

  1. Container services now have UnitFileState=static instead of enabled — The featured daemon's enable_feature() checks if unit_file_state == "enabled": continue to skip starting already-enabled services. Since the state is now static, this check fails, and featured issues an extra systemctl start on every config reload cycle.

  2. Container services are no longer listed as dependencies of sonic.targetsystemctl list-dependencies --plain sonic.target no longer includes pmon, lldp, gnmi, snmp, telemetry, or any container service. This means _reset_failed_services() (in sonic-utilities config load_minigraph) never resets their systemd start rate limit counters between reloads.

Combined effect: the start rate limit (StartLimitBurst=3 within StartLimitIntervalSec=1200) accumulates across config reload cycles, and after 3-4 reloads within 20 minutes, container services hit start-limit-hit.

Steps to Reproduce

  1. Run the nightly test override_config_table/test_override_config_table.py::test_load_minigraph_with_golden_config on a 202511 image
  2. This test performs 4 consecutive config load_minigraph operations (setup + empty_input + partial_config + full_config)
  3. On the 4th reload, pmon fails with start-limit-hit

Evidence from DUT (202511 image)

admin@dut:~$ systemctl show pmon.service --property=UnitFileState
UnitFileState=static

admin@dut:~$ systemctl show lldp.service --property=UnitFileState
UnitFileState=static

admin@dut:~$ systemctl list-dependencies --plain sonic.target | grep pmon
(empty - pmon is NOT listed)

admin@dut:~$ systemctl list-dependencies --plain sonic.target | grep lldp
(empty - lldp is NOT listed either)

Syslog showing start-limit-hit on 4th reload:

WARNING systemd[1]: pmon.service: Start request repeated too quickly.
WARNING systemd[1]: pmon.service: Failed with result 'start-limit-hit'.
ERR systemd[1]: Failed to start pmon.service - Platform monitor container.

Root Cause

Commit d9b0434c9 reworked systemd-sonic-generator for Trixie compatibility. This changed how container service unit files are generated/installed, resulting in:

  • UnitFileState changing from enabled to static
  • Container services no longer being linked as Wants= dependencies of sonic.target

Impact

  • Affected branch: 202511 (confirmed), potentially any branch with the generator rework
  • Affected tests: Any test performing 3+ config reloads within 20 minutes
  • Affected services: All container services managed by featured (pmon, lldp, gnmi, snmp, telemetry, dhcp_relay, etc.)
  • Not seen on 202505: 202505 does not have the generator rework

Suggested Fix

Three options (can be combined):

  1. Fix the generator (preferred): Restore container services as Wants= of sonic.target so _reset_failed_services() properly resets their rate limits
  2. Fix featured: Change the enable_feature() check from if unit_file_state == "enabled" to if unit_file_state in ("enabled", "static") to skip redundant starts
  3. Fix _reset_failed_services(): Include all SONiC container services, not just sonic.target dependencies

Workaround

On a running DUT, create symlinks to restore container services as dependencies of sonic.target:

sudo mkdir -p /etc/systemd/system/sonic.target.wants/
for svc in pmon lldp snmp gnmi telemetry dhcp_relay dhcp_server radv eventd; do
    svc_path=$(systemctl show ${svc}.service -p FragmentPath --value 2>/dev/null)
    if [ -n "$svc_path" ] && [ -f "$svc_path" ]; then
        sudo ln -sf "$svc_path" /etc/systemd/system/sonic.target.wants/${svc}.service
    fi
done
sudo systemctl daemon-reload

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions