Skip to content

SNMP service started and triggered SWSS start in the middle of warm-reboot #2750

@stepanblyschak

Description

@stepanblyschak

Description
SNMP service start is delayed for 3 min and depends on SWSS service.
When issuing sudo warm-reboot after daemons reconciled state was reached (~2 min after system start) SNMP service may wake up in the middle of warm-reboot script, after SWSS was killed by warm-reboot script, and start SWSS container again. This causes warm reboot failure and system becomes in inconsistent state.

See syslog messages from "Apr 5 13:24:46.321628" to "Apr 5 13:25:50.512102"

Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: drained queue
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- handleRestartQuery: received WARM switch shutdown event
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- profile_get_value: SAI_WARM_BOOT_WRITE_FILE: /var/warmboot/
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: using warmBootWriteFile: '/var/warmboot/'
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Warm Reboot requested, keeping data plane running
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Removing the switch gSwitchId=0x100000021
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Fast/warm reboot requested, keeping data plane running
Apr  5 13:25:03.784348 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set RESTART_WARM, key:Switch ID 1, val:true
Apr  5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set UNINIT_DATA_PLANE_ON_REMOVAL, key:Switch ID 1, val:false
Apr  5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_SWITCH: mlnx_sai_switch.c[5041]- mlnx_disconnect_switch: Disconnect switch
Apr  5 13:25:03.784896 arc-switch1025 INFO syncd.sh[12173]: requested WARM shutdown
Apr  5 13:25:03.785978 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: remove switch took 0.001976 sec
Apr  5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: calling api uninitialize
Apr  5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: uninitialize finished
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,752 INFO reaped unknown pid 39
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,753 INFO reaped unknown pid 31
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,754 INFO reaped unknown pid 23
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,756 INFO reaped unknown pid 47
Apr  5 13:25:04.325406 arc-switch1025 NOTICE root: Finished warm shutdown syncd process ...
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:03,801 INFO exited: syncd (exit status 0; expected)
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 WARN received SIGTERM indicating exit request
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:07,484 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:10,488 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:14.482720 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.482018945Z" level=info msg="Container f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf failed to exit within 10 seconds of signal 15 - us
ing the force"
Apr  5 13:25:14.746779 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:14.746534087Z" level=info msg="shim reaped" id=f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf
Apr  5 13:25:14.756912 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.756756411Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr  5 13:25:14.812920 arc-switch1025 INFO syncd.sh[12173]: syncd
Apr  5 13:25:14.813636 arc-switch1025 INFO syncd.sh[6050]: 137
Apr  5 13:25:14.822741 arc-switch1025 NOTICE root: Stopped syncd service...
Apr  5 13:25:14.829094 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Apr  5 13:25:14.842953 arc-switch1025 INFO systemd[1]: Stopped syncd service.
Apr  5 13:25:43.154294 arc-switch1025 INFO systemd[1]: Starting switch state service...
Apr  5 13:25:43.164059 arc-switch1025 NOTICE root: Starting swss service...
Apr  5 13:25:43.171856 arc-switch1025 NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Apr  5 13:25:43.181013 arc-switch1025 NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Apr  5 13:25:44.918430 arc-switch1025 NOTICE root: Warm boot flag: swss true.
Apr  5 13:25:46.137482 arc-switch1025 INFO swss.sh[12753]: Starting existing swss container with HWSKU ACS-MSN2700
Apr  5 13:25:46.244446 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:46.242046842Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/d8fdfb3a13a185551f59650deea154bd6144adeb2c030377374133b58b7b2
54c/shim.sock" debug=false pid=13130
Apr  5 13:25:46.502350 arc-switch1025 INFO swss.sh[12753]: swss
Apr  5 13:25:47.386488 arc-switch1025 NOTICE root: Started swss service...
Apr  5 13:25:47.399593 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
Apr  5 13:25:47.412764 arc-switch1025 INFO systemd[1]: Started switch state service.
Apr  5 13:25:47.413771 arc-switch1025 INFO systemd[1]: Starting SNMP container...
Apr  5 13:25:49.406666 arc-switch1025 INFO snmp.sh[13325]: Starting existing snmp container with HWSKU ACS-MSN2700
Apr  5 13:25:49.575907 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:49.575188155Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/13a2ea7a84dfe96b70d702893c561f4cd8732f50be9552937f1f169060951c25/shim.sock" debug=false pid=13547
Apr  5 13:25:49.949186 arc-switch1025 INFO snmp.sh[13325]: snmp

Steps to reproduce the issue:

  1. execute sudo warm-reboot few times

Describe the results you received:
See description

Describe the results you expected:
SNMP start does not mess with warm-reboot

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
SONiC Software Version: SONiC.HEAD.42-5c663ca
Distribution: Debian 9.8
Kernel: 4.9.0-8-amd64
Build commit: 5c663ca
Build date: Fri Apr  5 04:51:41 UTC 2019
Built by: johnar@jenkins-worker-3

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-orchagent-mlnx      HEAD.42-5c663ca     92227cd6d9e5        286MB
docker-orchagent-mlnx      latest              92227cd6d9e5        286MB
docker-syncd-mlnx          HEAD.42-5c663ca     213b01ae948a        331MB
docker-syncd-mlnx          latest              213b01ae948a        331MB
docker-lldp-sv2            HEAD.42-5c663ca     27eb551cdb9e        274MB
docker-lldp-sv2            latest              27eb551cdb9e        274MB
docker-dhcp-relay          HEAD.42-5c663ca     c918d4fb6c35        256MB
docker-dhcp-relay          latest              c918d4fb6c35        256MB
docker-database            HEAD.42-5c663ca     20c13f3c77c4        255MB
docker-database            latest              20c13f3c77c4        255MB
docker-teamd               HEAD.42-5c663ca     a29189c45fe7        274MB
docker-teamd               latest              a29189c45fe7        274MB
docker-snmp-sv2            HEAD.42-5c663ca     c102aa38ddf5        294MB
docker-snmp-sv2            latest              c102aa38ddf5        294MB
docker-router-advertiser   HEAD.42-5c663ca     c46d7f87c715        254MB
docker-router-advertiser   latest              c46d7f87c715        254MB
docker-platform-monitor    HEAD.42-5c663ca     6784680cc095        286MB
docker-platform-monitor    latest              6784680cc095        286MB
docker-fpm-quagga          HEAD.42-5c663ca     d26abe5b1234        281MB
docker-fpm-quagga          latest              d26abe5b1234        281MB
**Attach debug file `sudo generate_dump`:**

```

syslog.txt

(paste your output here)
```

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions