Wait for ASIC_DB confirmation after disabling LAG members#22849
Wait for ASIC_DB confirmation after disabling LAG members#22849yxieca merged 6 commits intosonic-net:masterfrom
Conversation
After swssconfig disables LAG members, the command returns before orchagent/syncd finishes applying the configuration to ASIC_DB. This race condition causes subsequent traffic verification to fail intermittently because packets are still forwarded through the LAG members that haven't been disabled yet in hardware. Add a wait_until poll on ASIC_DB to confirm all LAG members have SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE set to true before proceeding with traffic tests. Fixes sonic-net#17095 Signed-off-by: Ying Xie <[email protected]>
|
This PR was raised by an AI agent on behalf of Ying Xie. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
VS SAI does not populate SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in ASIC_DB, causing the wait_until check to timeout. Move the VS early-return before the ASIC_DB poll since VS doesn't enforce LAG member disable in the dataplane anyway. Signed-off-by: Ying Xie <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Local KVM Test Results (commit 0a01830)Test: Fix in this commit: Moved VS early-return before the ASIC_DB wait — VS SAI does not populate |
The previous check collected all LAG members across all LAGs and verified EGRESS_DISABLE=true on all of them. This would always fail when other LAGs exist that were not disabled. Fix: look up the SAI OID for the specific PortChannel under test via COUNTERS_LAG_NAME_MAP, then filter LAG_MEMBER entries to only those belonging to that LAG. Also move the ASIC_DB check before the VS skip so it runs on VS too (VS SAI does populate ASIC_DB, it just doesn't enforce disable in the dataplane). Signed-off-by: Ying Xie <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The fallback checked all LAG members across all LAGs, which is the bug we are fixing. Fail explicitly instead. Signed-off-by: Ying Xie <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Instead of blindly picking the first two PortChannels, iterate and find a pair where all BGP neighbors (v4 and v6) are established. Skip the test if two suitable PortChannels cannot be found. This avoids failures on testbeds where some PortChannels have permanently idle IPv6 sessions. Signed-off-by: Ying Xie <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
COUNTERS_LAG_NAME_MAP may not have entries for all PortChannels on converged-peer testbeds. Instead, look up the SAI OIDs of the member ports via COUNTERS_PORT_NAME_MAP and match them against SAI_LAG_MEMBER_ATTR_PORT_ID in ASIC_DB. This works regardless of whether the LAG itself has a counter entry. Signed-off-by: Ying Xie <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
High-Level Summary of ChangesProblem (issue #17095): Evolution across 6 commits:
TL;DR: Started with a straightforward ASIC_DB wait, then iteratively tightened the scope (right LAG, right members), hardened error handling (no silent fallback), improved test robustness (only pick healthy LAGs), and simplified the ASIC_DB query path. |
|
@yxieca the issue also happen on the 202511 branch, should it also include in the 202511 branch, right? |
lolyu
left a comment
There was a problem hiding this comment.
✅ Approved
ASIC_DB wait after swssconfig is the right fix for the race condition. PortChannel selection with BGP state check is a good improvement. LGTM 🚀
Description of PR
Summary:
After
swssconfigdisables LAG members, the command returns before orchagent/syncd finishes applying the configuration to ASIC_DB. This race causes intermittent test failures intest_lag_member_forwarding_packetsbecause traffic verification runs before hardware actually disables the LAG members (~280ms gap observed).Added a
wait_untilpoll on ASIC_DB to confirm all LAG members haveSAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=truebefore proceeding with traffic tests.Fixes #17095
Type of change
Back port request
Approach
What is the motivation for this PR?
swssconfigreturns immediately after writing to APP_DB, but orchagent has not yet processed the request and applied it via syncd to ASIC_DB/hardware. The test sends traffic immediately afterswssconfigreturns, hitting a race window where LAG members are still active. This causestest_lag_member_forwarding_packetsto fail intermittently on hardware platforms.How did you do it?
Added a
wait_until(10, 0.5, 0, ...)poll afterswssconfigthat checks ASIC_DB forSAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=trueon all LAG member entries. This is the definitive signal that orchagent→syncd has fully applied the disable to hardware. The 0.5s poll interval with 10s timeout is generous — real-world logs show the gap is ~280ms.How did you verify/test it?
wait_untilpattern used extensively throughout sonic-mgmtAny platform specific information?
The VS (virtual switch) platform stores EGRESS_DISABLE in ASIC_DB but does not enforce it in the kernel dataplane — the existing VS skip block runs after this new wait, so VS tests still skip traffic verification as before.
Supported testbed topology if it's a new test case?
N/A (bug fix)
Documentation
N/A