-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Bug: systemd-sonic-generator rework causes container services to hit start-limit-hit after multiple config reloads (202511) #25931
Description
Description
The systemd-sonic-generator rework in commit d9b0434c9 ("Update systemd-sonic-generator to make it work on Trixie") introduced two behavioral changes on the 202511 branch that cause container services (pmon, lldp, gnmi, snmp, telemetry, etc.) to hit start-limit-hit failures after multiple rapid config reloads:
-
Container services now have
UnitFileState=staticinstead ofenabled— Thefeatureddaemon'senable_feature()checksif unit_file_state == "enabled": continueto skip starting already-enabled services. Since the state is nowstatic, this check fails, andfeaturedissues an extrasystemctl starton every config reload cycle. -
Container services are no longer listed as dependencies of
sonic.target—systemctl list-dependencies --plain sonic.targetno longer includes pmon, lldp, gnmi, snmp, telemetry, or any container service. This means_reset_failed_services()(in sonic-utilitiesconfig load_minigraph) never resets their systemd start rate limit counters between reloads.
Combined effect: the start rate limit (StartLimitBurst=3 within StartLimitIntervalSec=1200) accumulates across config reload cycles, and after 3-4 reloads within 20 minutes, container services hit start-limit-hit.
Steps to Reproduce
- Run the nightly test
override_config_table/test_override_config_table.py::test_load_minigraph_with_golden_configon a 202511 image - This test performs 4 consecutive
config load_minigraphoperations (setup + empty_input + partial_config + full_config) - On the 4th reload, pmon fails with
start-limit-hit
Evidence from DUT (202511 image)
admin@dut:~$ systemctl show pmon.service --property=UnitFileState
UnitFileState=static
admin@dut:~$ systemctl show lldp.service --property=UnitFileState
UnitFileState=static
admin@dut:~$ systemctl list-dependencies --plain sonic.target | grep pmon
(empty - pmon is NOT listed)
admin@dut:~$ systemctl list-dependencies --plain sonic.target | grep lldp
(empty - lldp is NOT listed either)
Syslog showing start-limit-hit on 4th reload:
WARNING systemd[1]: pmon.service: Start request repeated too quickly.
WARNING systemd[1]: pmon.service: Failed with result 'start-limit-hit'.
ERR systemd[1]: Failed to start pmon.service - Platform monitor container.
Root Cause
Commit d9b0434c9 reworked systemd-sonic-generator for Trixie compatibility. This changed how container service unit files are generated/installed, resulting in:
UnitFileStatechanging fromenabledtostatic- Container services no longer being linked as
Wants=dependencies ofsonic.target
Impact
- Affected branch: 202511 (confirmed), potentially any branch with the generator rework
- Affected tests: Any test performing 3+ config reloads within 20 minutes
- Affected services: All container services managed by
featured(pmon, lldp, gnmi, snmp, telemetry, dhcp_relay, etc.) - Not seen on 202505: 202505 does not have the generator rework
Suggested Fix
Three options (can be combined):
- Fix the generator (preferred): Restore container services as
Wants=ofsonic.targetso_reset_failed_services()properly resets their rate limits - Fix
featured: Change theenable_feature()check fromif unit_file_state == "enabled"toif unit_file_state in ("enabled", "static")to skip redundant starts - Fix
_reset_failed_services(): Include all SONiC container services, not justsonic.targetdependencies
Workaround
On a running DUT, create symlinks to restore container services as dependencies of sonic.target:
sudo mkdir -p /etc/systemd/system/sonic.target.wants/
for svc in pmon lldp snmp gnmi telemetry dhcp_relay dhcp_server radv eventd; do
svc_path=$(systemctl show ${svc}.service -p FragmentPath --value 2>/dev/null)
if [ -n "$svc_path" ] && [ -f "$svc_path" ]; then
sudo ln -sf "$svc_path" /etc/systemd/system/sonic.target.wants/${svc}.service
fi
done
sudo systemctl daemon-reload