Skip to content

Add watchdog mechanism to swss service and generate alert when swss have issue.#14686

Merged
qiluo-msft merged 3 commits intosonic-net:masterfrom
liuh-80:dev/liuh/add-heart-beat
Jun 6, 2023
Merged

Add watchdog mechanism to swss service and generate alert when swss have issue.#14686
qiluo-msft merged 3 commits intosonic-net:masterfrom
liuh-80:dev/liuh/add-heart-beat

Conversation

@liuh-80
Copy link
Copy Markdown
Contributor

@liuh-80 liuh-80 commented Apr 17, 2023

This PR depends on sonic-net/sonic-swss#2737 merge first.

What I did
Add orchagent watchdog to monitor and alert orchagent stuck issue.

Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

How I verified it
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP ', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

Details if related
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306

@liuh-80 liuh-80 force-pushed the dev/liuh/add-heart-beat branch from d662879 to 2b05c34 Compare April 28, 2023 08:51
@liuh-80 liuh-80 changed the title [POC] Add heartbeat monitor for orchagent [POC] Add proc stuck watchdog for orchagent Apr 28, 2023
@liuh-80 liuh-80 changed the title [POC] Add proc stuck watchdog for orchagent Add watchdog mechanism to swss service and generate alert when swss have issue. May 15, 2023
@liuh-80 liuh-80 marked this pull request as ready for review May 15, 2023 02:40
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listener

Could you explore how much code could be reused if combined this listener with above "supervisor-proc-exit-listener"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, merged code to supervisor-proc-exit-listener

@liuh-80 liuh-80 force-pushed the dev/liuh/add-heart-beat branch from 93e8aa2 to 46cb307 Compare May 22, 2023 03:15
priority=4
autostart=false
autorestart=false
stdout_capture_maxbytes=1MB
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdout_capture_maxbytes

What is the reason of this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config will enable stdout capture on orchagent, then systemd will convert orchagent heartbeat message to systemd PROCESS_COMMUNICATION_STDOUT event.

Copy link
Copy Markdown
Collaborator

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with an open question in the comment.

Copy link
Copy Markdown
Collaborator

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with an open question in the comment.

liuh-80 added a commit to sonic-net/sonic-swss that referenced this pull request Jun 6, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
@qiluo-msft qiluo-msft merged commit 44427a2 into sonic-net:master Jun 6, 2023
yejianquan added a commit to yejianquan/sonic-buildimage that referenced this pull request Jun 8, 2023
wangxin pushed a commit that referenced this pull request Jun 9, 2023
…n swss have issue. (#14686)" (#15390)

This reverts commit 44427a2.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 20, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…n swss have issue. (sonic-net#14686)" (sonic-net#15390)

This reverts commit 44427a2.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants