Skip to content

Wait for ASIC_DB confirmation after disabling LAG members#22849

Merged
yxieca merged 6 commits intosonic-net:masterfrom
yxieca:fix/lag-member-await-config
Apr 3, 2026
Merged

Wait for ASIC_DB confirmation after disabling LAG members#22849
yxieca merged 6 commits intosonic-net:masterfrom
yxieca:fix/lag-member-await-config

Conversation

@yxieca
Copy link
Copy Markdown
Collaborator

@yxieca yxieca commented Mar 10, 2026

Description of PR

Summary:
After swssconfig disables LAG members, the command returns before orchagent/syncd finishes applying the configuration to ASIC_DB. This race causes intermittent test failures in test_lag_member_forwarding_packets because traffic verification runs before hardware actually disables the LAG members (~280ms gap observed).

Added a wait_until poll on ASIC_DB to confirm all LAG members have SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=true before proceeding with traffic tests.

Fixes #17095

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

swssconfig returns immediately after writing to APP_DB, but orchagent has not yet processed the request and applied it via syncd to ASIC_DB/hardware. The test sends traffic immediately after swssconfig returns, hitting a race window where LAG members are still active. This causes test_lag_member_forwarding_packets to fail intermittently on hardware platforms.

How did you do it?

Added a wait_until(10, 0.5, 0, ...) poll after swssconfig that checks ASIC_DB for SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=true on all LAG member entries. This is the definitive signal that orchagent→syncd has fully applied the disable to hardware. The 0.5s poll interval with 10s timeout is generous — real-world logs show the gap is ~280ms.

How did you verify/test it?

  • Code review of the race condition timing from issue logs
  • flake8 lint pass (max-line-length=120)
  • The fix uses the same wait_until pattern used extensively throughout sonic-mgmt

Any platform specific information?

The VS (virtual switch) platform stores EGRESS_DISABLE in ASIC_DB but does not enforce it in the kernel dataplane — the existing VS skip block runs after this new wait, so VS tests still skip traffic verification as before.

Supported testbed topology if it's a new test case?

N/A (bug fix)

Documentation

N/A

After swssconfig disables LAG members, the command returns before
orchagent/syncd finishes applying the configuration to ASIC_DB.
This race condition causes subsequent traffic verification to fail
intermittently because packets are still forwarded through the LAG
members that haven't been disabled yet in hardware.

Add a wait_until poll on ASIC_DB to confirm all LAG members have
SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE set to true before proceeding
with traffic tests.

Fixes sonic-net#17095

Signed-off-by: Ying Xie <[email protected]>
@yxieca
Copy link
Copy Markdown
Collaborator Author

yxieca commented Mar 10, 2026

This PR was raised by an AI agent on behalf of Ying Xie.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

VS SAI does not populate SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in
ASIC_DB, causing the wait_until check to timeout. Move the VS
early-return before the ASIC_DB poll since VS doesn't enforce
LAG member disable in the dataplane anyway.

Signed-off-by: Ying Xie <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca
Copy link
Copy Markdown
Collaborator Author

yxieca commented Mar 10, 2026

Local KVM Test Results (commit 0a01830)

Test: pc/test_lag_member_forwarding.py on KVM T0 (vms-kvm-t0, vlab-01)

1 passed in 239.26s (0:03:59)

Fix in this commit: Moved VS early-return before the ASIC_DB wait — VS SAI does not populate SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in ASIC_DB, so the wait_until check was timing out. On real hardware, the ASIC_DB poll confirms config is applied before traffic verification.

The previous check collected all LAG members across all LAGs and
verified EGRESS_DISABLE=true on all of them. This would always fail
when other LAGs exist that were not disabled.

Fix: look up the SAI OID for the specific PortChannel under test via
COUNTERS_LAG_NAME_MAP, then filter LAG_MEMBER entries to only those
belonging to that LAG.

Also move the ASIC_DB check before the VS skip so it runs on VS too
(VS SAI does populate ASIC_DB, it just doesn't enforce disable in
the dataplane).

Signed-off-by: Ying Xie <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

The fallback checked all LAG members across all LAGs, which is the
bug we are fixing. Fail explicitly instead.

Signed-off-by: Ying Xie <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Instead of blindly picking the first two PortChannels, iterate and
find a pair where all BGP neighbors (v4 and v6) are established.
Skip the test if two suitable PortChannels cannot be found.

This avoids failures on testbeds where some PortChannels have
permanently idle IPv6 sessions.

Signed-off-by: Ying Xie <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

COUNTERS_LAG_NAME_MAP may not have entries for all PortChannels on
converged-peer testbeds. Instead, look up the SAI OIDs of the member
ports via COUNTERS_PORT_NAME_MAP and match them against
SAI_LAG_MEMBER_ATTR_PORT_ID in ASIC_DB. This works regardless of
whether the LAG itself has a counter entry.

Signed-off-by: Ying Xie <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca-admin
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca
Copy link
Copy Markdown
Collaborator Author

yxieca commented Apr 1, 2026

High-Level Summary of Changes

Problem (issue #17095):
test_lag_member_forwarding_packets fails intermittently on hardware. Root cause: swssconfig returns immediately after writing to APP_DB, but orchagent/syncd has not yet disabled the LAG members in ASIC_DB (~280ms gap). The test sends traffic during that race window, so disabled LAG members are still forwarding.

Evolution across 6 commits:

  1. Initial fix — Added a wait_until poll on ASIC_DB checking SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=true before sending traffic

  2. Move VS skip earlier — VS SAI does not populate the EGRESS_DISABLE attribute, so moved the VS platform skip before the ASIC_DB wait (otherwise it would timeout on VS)

  3. Scope the check to the tested LAG only — Original check was looking at ALL LAG members in ASIC_DB. Fixed to only verify members belonging to the specific PortChannel under test

  4. Remove broken fallback — If the LAG OID was not found, there was a silent fallback that would skip the check. Changed to fail explicitly instead of hiding the problem

  5. Select PortChannels with established BGP — Pick test candidates that actually have all BGP neighbors up, avoiding flaky selection of half-up LAGs

  6. Use port OIDs for member lookup — Instead of querying by LAG OID (which required an extra lookup step), directly match ASIC_DB LAG member entries using the port OIDs. Simpler and more reliable.

TL;DR: Started with a straightforward ASIC_DB wait, then iteratively tightened the scope (right LAG, right members), hardened error handling (no silent fallback), improved test robustness (only pick healthy LAGs), and simplified the ASIC_DB query path.

@lolyu lolyu self-requested a review April 2, 2026 01:16
@nhe-NV
Copy link
Copy Markdown
Contributor

nhe-NV commented Apr 2, 2026

@yxieca the issue also happen on the 202511 branch, should it also include in the 202511 branch, right?

Copy link
Copy Markdown
Collaborator

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved

ASIC_DB wait after swssconfig is the right fix for the race condition. PortChannel selection with BGP state check is a good improvement. LGTM 🚀

@yxieca yxieca merged commit c216233 into sonic-net:master Apr 3, 2026
16 checks passed
@yxieca yxieca deleted the fix/lag-member-await-config branch April 3, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Test Gap][LAG][t0-64] Applying lag members configuration is not awaited

5 participants