Skip to content

[pfcwd] Fix pfcwd async fake storm#2337

Merged
yxieca merged 2 commits intosonic-net:masterfrom
lolyu:fix_pfcwd_wb_async
Oct 19, 2020
Merged

[pfcwd] Fix pfcwd async fake storm#2337
yxieca merged 2 commits intosonic-net:masterfrom
lolyu:fix_pfcwd_wb_async

Conversation

@lolyu
Copy link
Copy Markdown
Collaborator

@lolyu lolyu commented Oct 13, 2020

Description of PR

Summary:
Fixes # (issue)
Fix pfcwd/test_pfcwd_warm_reboot.py.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Approach

What is the motivation for this PR?

In some platforms and topologies(more ports, more possibility to reproduce), pfcwd/test_pfcwd_warm_reboot.py might fail if the test selects many storming ports in the platform like Arista7260. The reason is that if the test starts multiple fake storming threads, there might be a race condition when one thread tries to fetch storm_stop_defer from the test object's pfc_wd dictionary, which could be modified by those latter ports in their calls of setup_port_params, the thread will raise a KeyError, exit and leave the port in STORM DETECTED status. The test will fail because it cannot find any syslog entry generated because the port is already stormed when it tries to storm the port after warm reboot.

How did you do it?

  1. Passing storm_start_defer and storm_stop_defer directly to the
    thread target function instead of retrieving from the test object's
    pfc_wd dictionary, which might cause race condition if
    setup_port_params of those latter ports clean up pfc_wd and the
    thread tries to fetch storm_stop_defer from pfc_wd.

  2. Introduce thread class InterruptableThread to grant the exception
    awareness to the main thread caller when joins the thread. Use it in the
    thread creation in test_pfcwd_warm_reboot.py to make the test aware of
    any anomalies in storm threads.

  3. Introduce join_all to timeout join a list of threads. Use it to
    wait for all storm threads to finish and alert if any threads don't exit
    as expected.

  4. There is a chance the async storm thread make Ansible calls when the
    DUT SSH service is dead or Redis isn't started. Introducing a
    threading.Event object DUTACTIVE to make the storm thread wait till
    the DUT is up and active after warm reboot.

  5. Make indentation all aligned with four-spaces.

How did you verify/test it?

Test over Arista7260 with t0-116.

pfcwd/test_pfcwd_warm_reboot.py::TestPfcwdWb::test_pfcwd_wb[no_storm] PASSED                                                                                                                                                                                           [ 33%]
pfcwd/test_pfcwd_warm_reboot.py::TestPfcwdWb::test_pfcwd_wb[storm] PASSED                                                                                                                                                                                              [ 66%]
pfcwd/test_pfcwd_warm_reboot.py::TestPfcwdWb::test_pfcwd_wb[async_storm] PASSED                                                                                                                                                                                        [100%]

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@lolyu lolyu changed the title Fix pfcwd wb async [pfcwd] Fix pfcwd async storm Oct 13, 2020
@lolyu lolyu force-pushed the fix_pfcwd_wb_async branch from 044b314 to f7e0c03 Compare October 13, 2020 09:55
@sonic-net sonic-net deleted a comment from lgtm-com bot Oct 13, 2020
@lolyu lolyu force-pushed the fix_pfcwd_wb_async branch from f7e0c03 to 88d3df6 Compare October 13, 2020 10:31
Make indentation all aligned with four-spaces.

Signed-off-by: Longxiang Lyu <[email protected]>
@lolyu lolyu force-pushed the fix_pfcwd_wb_async branch 4 times, most recently from d0c0146 to c82d6ab Compare October 14, 2020 05:16
@lolyu lolyu changed the title [pfcwd] Fix pfcwd async storm [pfcwd] Fix pfcwd async fake storm Oct 14, 2020
@lolyu lolyu force-pushed the fix_pfcwd_wb_async branch from c82d6ab to dfac6a7 Compare October 14, 2020 06:03
@lolyu lolyu marked this pull request as ready for review October 14, 2020 06:41
@lolyu lolyu requested a review from a team October 14, 2020 06:42
@sonic-net sonic-net deleted a comment from lgtm-com bot Oct 14, 2020
@sonic-net sonic-net deleted a comment from lgtm-com bot Oct 14, 2020
@lolyu lolyu force-pushed the fix_pfcwd_wb_async branch from b32465f to dfac6a7 Compare October 16, 2020 08:32
Comment thread tests/common/utilities.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppress_exception

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks.

Comment thread tests/common/reboot.py Outdated
1. Passing `storm_start_defer` and `storm_stop_defer` directly to the
thread `target` function instead of retrieving from the test object's
`pfc_wd` dictionary, which might cause race condition if
`setup_port_params` of those latter ports clean up `pfc_wd` and the
thread tries to fetch `storm_stop_defer` from `pfc_wd`.

2. Introduce thread class `InterruptableThread` to grant the exception
awareness to the main thread caller when joins the thread. Use it in the
thread creation in `test_pfcwd_warm_reboot.py` to make the test aware of
any anomalies in storm threads.

3. Introduce `join_all` to timeout join a list of threads. Use it to
wait for all storm threads to finish and alert if any threads don't exit
as expected.

4. There is a chance the async storm thread make Ansible calls when the
DUT SSH service is dead or Redis isn't started. Introducing a
threading.Event object `DUTACTIVE` to make the storm thread wait till
the DUT is up and active after warm reboot.

Signed-off-by: Longxiang Lyu <[email protected]>
@lolyu lolyu force-pushed the fix_pfcwd_wb_async branch from dfac6a7 to 8903c5a Compare October 19, 2020 02:06
@yxieca yxieca merged commit a7291d5 into sonic-net:master Oct 19, 2020
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
swss:
* 7841930 2022-07-15 | [vxlan]Fixing L2MC vlan member caching issue (sonic-net#2378) (HEAD -> 202205) [Sudharsan Dhamal Gopalarathnam]
* b8cd435 2022-07-14 | [muxorch] Always use direct link for SoC IPs (sonic-net#2369) [Longxiang Lyu]
* 6158d5c 2022-07-08 | Add BGP profile to Vnet routes (sonic-net#2337) [Prince Sunny]
* bdb7ffd 2022-07-06 | [teammgr]: Waiting MACsec ready before doLagMemberTask (sonic-net#2286) [Ze Gan]

sairedis:
* 58359d4 2022-06-30 | [sairedis] Perform log rotate on request (sonic-net#1058) (HEAD -> 202205, github/202205) [Kamil Cudnik]
* cad0268 2022-07-13 | Enable cisco debug shell by default (sonic-net#1078) [VenkatCisco]

Signed-off-by: Ying Xie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants