Description
SNMP service start is delayed for 3 min and depends on SWSS service.
When issuing sudo warm-reboot after daemons reconciled state was reached (~2 min after system start) SNMP service may wake up in the middle of warm-reboot script, after SWSS was killed by warm-reboot script, and start SWSS container again. This causes warm reboot failure and system becomes in inconsistent state.
See syslog messages from "Apr 5 13:24:46.321628" to "Apr 5 13:25:50.512102"
Apr 5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: drained queue
Apr 5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- handleRestartQuery: received WARM switch shutdown event
Apr 5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- profile_get_value: SAI_WARM_BOOT_WRITE_FILE: /var/warmboot/
Apr 5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: using warmBootWriteFile: '/var/warmboot/'
Apr 5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Warm Reboot requested, keeping data plane running
Apr 5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Removing the switch gSwitchId=0x100000021
Apr 5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Fast/warm reboot requested, keeping data plane running
Apr 5 13:25:03.784348 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set RESTART_WARM, key:Switch ID 1, val:true
Apr 5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set UNINIT_DATA_PLANE_ON_REMOVAL, key:Switch ID 1, val:false
Apr 5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE SAI_SWITCH: mlnx_sai_switch.c[5041]- mlnx_disconnect_switch: Disconnect switch
Apr 5 13:25:03.784896 arc-switch1025 INFO syncd.sh[12173]: requested WARM shutdown
Apr 5 13:25:03.785978 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: remove switch took 0.001976 sec
Apr 5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: calling api uninitialize
Apr 5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: uninitialize finished
Apr 5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,752 INFO reaped unknown pid 39
Apr 5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,753 INFO reaped unknown pid 31
Apr 5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,754 INFO reaped unknown pid 23
Apr 5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,756 INFO reaped unknown pid 47
Apr 5 13:25:04.325406 arc-switch1025 NOTICE root: Finished warm shutdown syncd process ...
Apr 5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:03,801 INFO exited: syncd (exit status 0; expected)
Apr 5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 WARN received SIGTERM indicating exit request
Apr 5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr 5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:07,484 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr 5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:10,488 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr 5 13:25:14.482720 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.482018945Z" level=info msg="Container f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf failed to exit within 10 seconds of signal 15 - us
ing the force"
Apr 5 13:25:14.746779 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:14.746534087Z" level=info msg="shim reaped" id=f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf
Apr 5 13:25:14.756912 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.756756411Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 5 13:25:14.812920 arc-switch1025 INFO syncd.sh[12173]: syncd
Apr 5 13:25:14.813636 arc-switch1025 INFO syncd.sh[6050]: 137
Apr 5 13:25:14.822741 arc-switch1025 NOTICE root: Stopped syncd service...
Apr 5 13:25:14.829094 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Apr 5 13:25:14.842953 arc-switch1025 INFO systemd[1]: Stopped syncd service.
Apr 5 13:25:43.154294 arc-switch1025 INFO systemd[1]: Starting switch state service...
Apr 5 13:25:43.164059 arc-switch1025 NOTICE root: Starting swss service...
Apr 5 13:25:43.171856 arc-switch1025 NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Apr 5 13:25:43.181013 arc-switch1025 NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Apr 5 13:25:44.918430 arc-switch1025 NOTICE root: Warm boot flag: swss true.
Apr 5 13:25:46.137482 arc-switch1025 INFO swss.sh[12753]: Starting existing swss container with HWSKU ACS-MSN2700
Apr 5 13:25:46.244446 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:46.242046842Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/d8fdfb3a13a185551f59650deea154bd6144adeb2c030377374133b58b7b2
54c/shim.sock" debug=false pid=13130
Apr 5 13:25:46.502350 arc-switch1025 INFO swss.sh[12753]: swss
Apr 5 13:25:47.386488 arc-switch1025 NOTICE root: Started swss service...
Apr 5 13:25:47.399593 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
Apr 5 13:25:47.412764 arc-switch1025 INFO systemd[1]: Started switch state service.
Apr 5 13:25:47.413771 arc-switch1025 INFO systemd[1]: Starting SNMP container...
Apr 5 13:25:49.406666 arc-switch1025 INFO snmp.sh[13325]: Starting existing snmp container with HWSKU ACS-MSN2700
Apr 5 13:25:49.575907 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:49.575188155Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/13a2ea7a84dfe96b70d702893c561f4cd8732f50be9552937f1f169060951c25/shim.sock" debug=false pid=13547
Apr 5 13:25:49.949186 arc-switch1025 INFO snmp.sh[13325]: snmp
Steps to reproduce the issue:
- execute
sudo warm-reboot few times
Describe the results you received:
See description
Describe the results you expected:
SNMP start does not mess with warm-reboot
Additional information you deem important (e.g. issue happens only occasionally):
**Output of `show version`:**
SONiC Software Version: SONiC.HEAD.42-5c663ca
Distribution: Debian 9.8
Kernel: 4.9.0-8-amd64
Build commit: 5c663ca
Build date: Fri Apr 5 04:51:41 UTC 2019
Built by: johnar@jenkins-worker-3
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-orchagent-mlnx HEAD.42-5c663ca 92227cd6d9e5 286MB
docker-orchagent-mlnx latest 92227cd6d9e5 286MB
docker-syncd-mlnx HEAD.42-5c663ca 213b01ae948a 331MB
docker-syncd-mlnx latest 213b01ae948a 331MB
docker-lldp-sv2 HEAD.42-5c663ca 27eb551cdb9e 274MB
docker-lldp-sv2 latest 27eb551cdb9e 274MB
docker-dhcp-relay HEAD.42-5c663ca c918d4fb6c35 256MB
docker-dhcp-relay latest c918d4fb6c35 256MB
docker-database HEAD.42-5c663ca 20c13f3c77c4 255MB
docker-database latest 20c13f3c77c4 255MB
docker-teamd HEAD.42-5c663ca a29189c45fe7 274MB
docker-teamd latest a29189c45fe7 274MB
docker-snmp-sv2 HEAD.42-5c663ca c102aa38ddf5 294MB
docker-snmp-sv2 latest c102aa38ddf5 294MB
docker-router-advertiser HEAD.42-5c663ca c46d7f87c715 254MB
docker-router-advertiser latest c46d7f87c715 254MB
docker-platform-monitor HEAD.42-5c663ca 6784680cc095 286MB
docker-platform-monitor latest 6784680cc095 286MB
docker-fpm-quagga HEAD.42-5c663ca d26abe5b1234 281MB
docker-fpm-quagga latest d26abe5b1234 281MB
**Attach debug file `sudo generate_dump`:**
```
syslog.txt
(paste your output here)
```
Description
SNMP service start is delayed for 3 min and depends on SWSS service.
When issuing
sudo warm-rebootafter daemons reconciled state was reached (~2 min after system start) SNMP service may wake up in the middle of warm-reboot script, after SWSS was killed by warm-reboot script, and start SWSS container again. This causes warm reboot failure and system becomes in inconsistent state.See syslog messages from "Apr 5 13:24:46.321628" to "Apr 5 13:25:50.512102"
Steps to reproduce the issue:
sudo warm-rebootfew timesDescribe the results you received:
See description
Describe the results you expected:
SNMP start does not mess with warm-reboot
Additional information you deem important (e.g. issue happens only occasionally):
syslog.txt