[loganalyzer] Fix end marker lost during rsyslog reload (#23562) by yxieca · Pull Request #23591 · sonic-net/sonic-mgmt

yxieca · 2026-04-03T00:06:07Z

Description of PR

Summary

The place_marker_to_syslog() method called flush_rsyslogd() (systemctl reload rsyslog) immediately after writing the marker to syslog. The reload briefly closes and reopens log files, which can race with the marker message still sitting in rsyslog's internal buffer, causing the marker to be silently dropped.

Root cause: PR #22447 replaced kill -HUP with systemctl reload rsyslog to avoid syslog message loss during the previous flush. However, calling the reload right after writing a new marker creates a new race — the just-written marker can be dropped during the reload's file close/reopen cycle.

Fix: Flush rsyslog before writing the marker (to drain old buffers), then after writing the marker, use a simple time.sleep(2) instead of a reload to let rsyslog write the marker to disk naturally.

Type of change

Bug fix

Back port request

N/A

Approach

What is the motivation for this PR?

LogAnalyzer's add_end_marker action fails intermittently with RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog, causing test failures across all platforms.

How did you do it?

Moved the flush_rsyslogd() call to before writing the syslog marker and replaced the post-write flush with a 2-second sleep. This ensures:

Pre-existing buffered messages are flushed before the marker is written
The marker itself is not disrupted by a concurrent rsyslog reload
rsyslog has time to naturally write the marker to disk

How did you verify/test it?

Code review of the race condition timeline
The fix eliminates the problematic pattern (write then immediate reload) identified in the issue

Note

This PR was generated with the assistance of an AI agent.

yxieca · 2026-04-03T00:06:13Z

Note

This PR was generated with the assistance of an AI agent.

mssonicbld · 2026-04-03T00:06:16Z

/azp run

azure-pipelines · 2026-04-03T00:06:29Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-04-03T15:27:51Z

/azp run

azure-pipelines · 2026-04-03T15:28:04Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca · 2026-04-03T15:28:56Z

@wangxin I asked agent to replace the sleep with a wait until loop.

yxieca · 2026-04-03T23:14:31Z

/azp run

azure-pipelines · 2026-04-03T23:14:37Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

yxieca · 2026-04-04T03:12:53Z

/azp run

azure-pipelines · 2026-04-04T03:13:00Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

) The place_marker_to_syslog method called flush_rsyslogd() (systemctl reload rsyslog) immediately after writing the marker to syslog. The reload briefly closes and reopens log files, which can race with the marker message still sitting in rsyslog's internal buffer, causing the marker to be silently dropped. Fix: flush rsyslog *before* writing the marker (to drain old buffers), then after writing the marker, use a simple sleep(2) instead of a reload to let rsyslog write the marker to disk naturally. Fixes sonic-net#23562 Signed-off-by: Ying Xie <[email protected]>

mssonicbld · 2026-04-04T04:01:54Z

/azp run

azure-pipelines · 2026-04-04T04:02:07Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca · 2026-04-04T15:06:29Z

/azp run

azure-pipelines · 2026-04-04T15:06:42Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca · 2026-04-07T06:05:58Z

AI agent on behalf of Ying.\n\n- Unable to verify unresolved human review comments.\n\n

) (sonic-net#23591) ### Description of PR Fixes sonic-net#23562 #### Summary The `place_marker_to_syslog()` method called `flush_rsyslogd()` (`systemctl reload rsyslog`) immediately after writing the marker to syslog. The reload briefly closes and reopens log files, which can race with the marker message still sitting in rsyslog's internal buffer, causing the marker to be silently dropped. **Root cause:** PR sonic-net#22447 replaced `kill -HUP` with `systemctl reload rsyslog` to avoid syslog message loss during the *previous* flush. However, calling the reload *right after* writing a new marker creates a new race — the just-written marker can be dropped during the reload's file close/reopen cycle. **Fix:** Flush rsyslog *before* writing the marker (to drain old buffers), then after writing the marker, use a simple `time.sleep(2)` instead of a reload to let rsyslog write the marker to disk naturally. ### Type of change - [x] Bug fix ### Back port request - [ ] N/A ### Approach #### What is the motivation for this PR? LogAnalyzer's `add_end_marker` action fails intermittently with `RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog`, causing test failures across all platforms. #### How did you do it? Moved the `flush_rsyslogd()` call to *before* writing the syslog marker and replaced the post-write flush with a 2-second sleep. This ensures: 1. Pre-existing buffered messages are flushed before the marker is written 2. The marker itself is not disrupted by a concurrent rsyslog reload 3. rsyslog has time to naturally write the marker to disk #### How did you verify/test it? - Code review of the race condition timeline - The fix eliminates the problematic pattern (write then immediate reload) identified in the issue --- > [!NOTE] > This PR was generated with the assistance of an AI agent. Signed-off-by: Ying Xie <[email protected]>

) (sonic-net#23591) ### Description of PR Fixes sonic-net#23562 #### Summary The `place_marker_to_syslog()` method called `flush_rsyslogd()` (`systemctl reload rsyslog`) immediately after writing the marker to syslog. The reload briefly closes and reopens log files, which can race with the marker message still sitting in rsyslog's internal buffer, causing the marker to be silently dropped. **Root cause:** PR sonic-net#22447 replaced `kill -HUP` with `systemctl reload rsyslog` to avoid syslog message loss during the *previous* flush. However, calling the reload *right after* writing a new marker creates a new race — the just-written marker can be dropped during the reload's file close/reopen cycle. **Fix:** Flush rsyslog *before* writing the marker (to drain old buffers), then after writing the marker, use a simple `time.sleep(2)` instead of a reload to let rsyslog write the marker to disk naturally. ### Type of change - [x] Bug fix ### Back port request - [ ] N/A ### Approach #### What is the motivation for this PR? LogAnalyzer's `add_end_marker` action fails intermittently with `RuntimeError: cannot find marker end-LogAnalyzer-... in /var/log/syslog`, causing test failures across all platforms. #### How did you do it? Moved the `flush_rsyslogd()` call to *before* writing the syslog marker and replaced the post-write flush with a 2-second sleep. This ensures: 1. Pre-existing buffered messages are flushed before the marker is written 2. The marker itself is not disrupted by a concurrent rsyslog reload 3. rsyslog has time to naturally write the marker to disk #### How did you verify/test it? - Code review of the race condition timeline - The fix eliminates the problematic pattern (write then immediate reload) identified in the issue --- > [!NOTE] > This PR was generated with the assistance of an AI agent. Signed-off-by: Ying Xie <[email protected]> Signed-off-by: opcoder0 <[email protected]>

github-actions bot requested review from ZhaohuiS, wangxin and xwjiang-ms April 3, 2026 00:06

wangxin previously approved these changes Apr 3, 2026

View reviewed changes

yxieca dismissed wangxin’s stale review via e7bbd48 April 3, 2026 15:27

github-actions bot requested a review from wangxin April 3, 2026 15:28

yxieca force-pushed the fix/loganalyzer-end-marker-flush branch from e7bbd48 to 67a9484 Compare April 4, 2026 04:01

wangxin approved these changes Apr 8, 2026

View reviewed changes

wangxin merged commit 521f48f into sonic-net:master Apr 8, 2026
19 checks passed

Conversation

yxieca commented Apr 3, 2026

Description of PR

Summary

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Uh oh!

yxieca commented Apr 3, 2026

Uh oh!

mssonicbld commented Apr 3, 2026

Uh oh!

azure-pipelines bot commented Apr 3, 2026

Uh oh!

mssonicbld commented Apr 3, 2026

Uh oh!

azure-pipelines bot commented Apr 3, 2026

Uh oh!

yxieca commented Apr 3, 2026

Uh oh!

yxieca commented Apr 3, 2026

Uh oh!

azure-pipelines bot commented Apr 3, 2026

Uh oh!

yxieca commented Apr 4, 2026

Uh oh!

azure-pipelines bot commented Apr 4, 2026

Uh oh!

mssonicbld commented Apr 4, 2026

Uh oh!

azure-pipelines bot commented Apr 4, 2026

Uh oh!

yxieca commented Apr 4, 2026

Uh oh!

azure-pipelines bot commented Apr 4, 2026

Uh oh!

yxieca commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants