Skip to content

fix(agw): Fixed SCTP abort issue by setting finite timeout in sctp_sendmsg#13146

Merged
ssanadhya merged 2 commits intomagma:masterfrom
bhuvaneshne:perf-sctp-abort-fix-bhuvaneshne
Aug 17, 2022
Merged

fix(agw): Fixed SCTP abort issue by setting finite timeout in sctp_sendmsg#13146
ssanadhya merged 2 commits intomagma:masterfrom
bhuvaneshne:perf-sctp-abort-fix-bhuvaneshne

Conversation

@bhuvaneshne
Copy link
Copy Markdown
Contributor

Signed-off-by: bhuvaneshne [email protected]

Summary

See bug description (#13115) for description, RCA and proposal of fix.

Test Plan

Reproduce the issue using the bundled simulator (See: #13115)
Replace the sctpd executable
Restart sctp daemon (systemctl restart sctpd)
Wait for sctpd to get ready (See syslog)
Rerun the simulator and see that the issue is not reproducible again

Additional Information

  • This change is backwards-breaking

@bhuvaneshne bhuvaneshne requested review from a team and Kaleem-Wavelabs June 30, 2022 10:48
@pull-request-size pull-request-size bot added the size/XS Denotes a PR that changes 0-9 lines. label Jun 30, 2022
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for opening a PR! 💯

A couple initial guidelines

Howto

  • Reviews. The "Reviewers" listed for this PR are the Magma maintainers who will shepherd it.
  • Checks. All required CI checks must pass before merge.
  • Merge. Once approved and passing CI checks, use the ready2merge label to indicate the maintainers can merge your PR.

More info

Please take a moment to read through the Magma project's

If this is your first Magma PR, also consider reading

@github-actions github-actions bot added the component: agw Access gateway-related issue label Jun 30, 2022
@bhuvaneshne bhuvaneshne changed the title fix(agw): Fixed SCTP abort issue by setting finite timeout in sctp_sendmsg. See: https://github.com/magma/magma/issues/13115 fix(agw): Fixed SCTP abort issue by setting finite timeout in sctp_sendmsg Jun 30, 2022
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jun 30, 2022

feg-workflow

    2 files  203 suites   40s ⏱️
374 tests 374 ✔️ 0 💤 0
388 runs  388 ✔️ 0 💤 0

Results for commit bdc312c.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jun 30, 2022

dp-workflow

13 tests   13 ✔️  2m 14s ⏱️
  1 suites    0 💤
  1 files      0

Results for commit bdc312c.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jun 30, 2022

agw-workflow

615 tests   611 ✔️  3m 49s ⏱️
    2 suites      4 💤
    2 files        0

Results for commit bdc312c.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

@ssanadhya ssanadhya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, with minor comment. Please address before merging.

auto n = msg.size();
auto rc = sctp_sendmsg(assoc.sd, buf, n, NULL, 0, htonl(assoc.ppid), 0,
stream, 0, 0);
stream, 100, 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment stating that the 100 indicates a timetolive value of 100 ms.

@ssanadhya
Copy link
Copy Markdown
Contributor

@bhuvaneshne , please run LTE integ tests on this change and update the test plan.

@bhuvaneshne bhuvaneshne requested review from a team July 1, 2022 07:15
@bhuvaneshne bhuvaneshne requested review from a team as code owners July 1, 2022 07:15
@pull-request-size pull-request-size bot added size/L Denotes a Pull Request that changes 100-499 lines. and removed size/XS Denotes a PR that changes 0-9 lines. labels Jul 1, 2022
@github-actions github-actions bot added component: ci All updates on CI (Jenkins/CircleCi/Github Action) component: docs Documentation-related issue component: orc8r Orchestrator-related issue labels Jul 1, 2022
@Neudrino
Copy link
Copy Markdown
Contributor

Neudrino commented Jul 1, 2022

This needs to be rebased on master, as there are foreign commits in here, which should not be on this PR.

Copy link
Copy Markdown
Contributor

@Neudrino Neudrino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing all the reviewers, as they are only on here due to bad commits in the branch!

@Neudrino Neudrino removed the request for review from ulaskozat July 1, 2022 09:43
@VinashakAnkitAman
Copy link
Copy Markdown
Member

VinashakAnkitAman commented Jul 12, 2022

@bhuvaneshne , this is the first test with traffic testing, so few things to check:

  1. Is the traffic server running okay in your local setup?
  2. Can you run this test individually, instead of as part of the sanity suite? You can do it with:

make integ_test TESTS=s1aptests/test_attach_ul_udp_data.py

@VinashakAnkitAman do we need to reload the VMs if one of the traffic tests fail?

  1. Can you run this test on the master commit, which is the parent of your feature branch?

I am currently experiencing #13245 , hence cannot validate in my local setup.

Hi @ssanadhya All the cleanups happen in the scenario a test case fails including cleaning the interfaces and stopping the unclosed iperf instances. If any test case fails, including data test case, we need not reload the VMs.

Hi @bhuvaneshne Although route should not be an issue here for uplink data test cases, can you please add route in the TRF Server VM with below command before running the test case?
sudo ip route flush via 192.168.129.1 && sudo ip route add 192.168.128.0/24 via 192.168.129.1 dev eth2

Now if you still have issues with local setup, since the branch has already been created and raised for PR, there is one more quick approach for you to verify the integ test directly via Github Actions:

  • Go to your forked repo https://github.com/<github_username>/magma/actions/workflows/lte-integ-test.yml
  • Click Run workflow button on the right side and select the correct branch, then click the Run workflow button.
  • Then follow the latest execution report downside

@bhuvaneshne bhuvaneshne force-pushed the perf-sctp-abort-fix-bhuvaneshne branch from 374d118 to 2109ac4 Compare July 18, 2022 13:28
@bhuvaneshne
Copy link
Copy Markdown
Contributor Author

@bhuvaneshne , this is the first test with traffic testing, so few things to check:

  1. Is the traffic server running okay in your local setup?
  2. Can you run this test individually, instead of as part of the sanity suite? You can do it with:

make integ_test TESTS=s1aptests/test_attach_ul_udp_data.py
@VinashakAnkitAman do we need to reload the VMs if one of the traffic tests fail?

  1. Can you run this test on the master commit, which is the parent of your feature branch?

I am currently experiencing #13245 , hence cannot validate in my local setup.

Hi @ssanadhya All the cleanups happen in the scenario a test case fails including cleaning the interfaces and stopping the unclosed iperf instances. If any test case fails, including data test case, we need not reload the VMs.

Hi @bhuvaneshne Although route should not be an issue here for uplink data test cases, can you please add route in the TRF Server VM with below command before running the test case? sudo ip route flush via 192.168.129.1 && sudo ip route add 192.168.128.0/24 via 192.168.129.1 dev eth2

Now if you still have issues with local setup, since the branch has already been created and raised for PR, there is one more quick approach for you to verify the integ test directly via Github Actions:

* Go to your forked repo `https://github.com/<github_username>/magma/actions/workflows/lte-integ-test.yml`

* Click `Run workflow` button on the right side and select the correct branch, then click the `Run workflow` button.

* Then follow the latest execution report downside

Thanks @VinashakAnkitAman , I rebased and reran the test on a new environment. It failed again but in a different spot. Let me try the "run workflow"

Deleting all the uplink/downlink flows
Keys left in Redis (list should be empty)[
 IMSI001010000000001:SPGW
IMSI001010000000001.magma.ipv4,ipv4:mobilityd_ipdesc_record
IMSI001010000000001:MME

nb_ue_attached: 1 
]
Entries in s1ap_imsi_map (should be zero): 1
Entries left in hashtables (should be zero): 3
Entries in mme_ueip_imsi_map (should be zero): 1
The test has failed. Restarting Sctpd for cleanup
Waiting for 30 seconds for restart to complete
Waiting for 29 seconds for restart to complete
Waiting for 28 seconds for restart to complete
Waiting for 27 seconds for restart to complete
Waiting for 26 seconds for restart to complete
Waiting for 25 seconds for restart to complete
Waiting for 24 seconds for restart to complete
Waiting for 23 seconds for restart to complete
Waiting for 22 seconds for restart to complete
Waiting for 21 seconds for restart to complete
Waiting for 20 seconds for restart to complete
Waiting for 19 seconds for restart to complete
Waiting for 18 seconds for restart to complete
Waiting for 17 seconds for restart to complete
Waiting for 16 seconds for restart to complete
Waiting for 15 seconds for restart to complete
Waiting for 14 seconds for restart to complete
Waiting for 13 seconds for restart to complete
Waiting for 12 seconds for restart to complete
Waiting for 11 seconds for restart to complete
Waiting for 10 seconds for restart to complete
Waiting for 9 seconds for restart to complete
Waiting for 8 seconds for restart to complete
Waiting for 7 seconds for restart to complete
Waiting for 6 seconds for restart to complete
Waiting for 5 seconds for restart to complete
Waiting for 4 seconds for restart to complete
Waiting for 3 seconds for restart to complete
Waiting for 2 seconds for restart to complete
Waiting for 1 seconds for restart to complete
Keys left in Redis (list should be empty)[
  
]
Entries in s1ap_imsi_map (should be zero): 0
Entries left in hashtables (should be zero): 0
Entries in mme_ueip_imsi_map (should be zero): 0
-------------- generated xml file: /var/tmp/test_results/test_tau_periodic_active.xml ---------------
====================================== short test summary info ======================================
FAILED s1aptests/test_tau_periodic_active.py::TestTauPeriodicActive::test_tau_periodic_active - As...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=================================== 1 failed in 117.67s (0:01:57) ===================================
make: *** [Makefile:51: integ_test] Error 1

@bhuvaneshne bhuvaneshne force-pushed the perf-sctp-abort-fix-bhuvaneshne branch from 2109ac4 to 8f1ea21 Compare July 19, 2022 11:27
@VinashakAnkitAman
Copy link
Copy Markdown
Member

VinashakAnkitAman commented Aug 3, 2022

@bhuvaneshne By any chance you got the integ test working over this PR?

@bhuvaneshne By any chance you got the integ test working over this PR?

@bhuvaneshne , this is the first test with traffic testing, so few things to check:

  1. Is the traffic server running okay in your local setup?
  2. Can you run this test individually, instead of as part of the sanity suite? You can do it with:

make integ_test TESTS=s1aptests/test_attach_ul_udp_data.py
@VinashakAnkitAman do we need to reload the VMs if one of the traffic tests fail?

  1. Can you run this test on the master commit, which is the parent of your feature branch?

I am currently experiencing #13245 , hence cannot validate in my local setup.

Hi @ssanadhya All the cleanups happen in the scenario a test case fails including cleaning the interfaces and stopping the unclosed iperf instances. If any test case fails, including data test case, we need not reload the VMs.
Hi @bhuvaneshne Although route should not be an issue here for uplink data test cases, can you please add route in the TRF Server VM with below command before running the test case? sudo ip route flush via 192.168.129.1 && sudo ip route add 192.168.128.0/24 via 192.168.129.1 dev eth2
Now if you still have issues with local setup, since the branch has already been created and raised for PR, there is one more quick approach for you to verify the integ test directly via Github Actions:

* Go to your forked repo `https://github.com/<github_username>/magma/actions/workflows/lte-integ-test.yml`

* Click `Run workflow` button on the right side and select the correct branch, then click the `Run workflow` button.

* Then follow the latest execution report downside

Thanks @VinashakAnkitAman , I rebased and reran the test on a new environment. It failed again but in a different spot. Let me try the "run workflow"

Deleting all the uplink/downlink flows
Keys left in Redis (list should be empty)[
 IMSI001010000000001:SPGW
IMSI001010000000001.magma.ipv4,ipv4:mobilityd_ipdesc_record
IMSI001010000000001:MME

nb_ue_attached: 1 
]
Entries in s1ap_imsi_map (should be zero): 1
Entries left in hashtables (should be zero): 3
Entries in mme_ueip_imsi_map (should be zero): 1
The test has failed. Restarting Sctpd for cleanup
Waiting for 30 seconds for restart to complete
Waiting for 29 seconds for restart to complete
Waiting for 28 seconds for restart to complete
Waiting for 27 seconds for restart to complete
Waiting for 26 seconds for restart to complete
Waiting for 25 seconds for restart to complete
Waiting for 24 seconds for restart to complete
Waiting for 23 seconds for restart to complete
Waiting for 22 seconds for restart to complete
Waiting for 21 seconds for restart to complete
Waiting for 20 seconds for restart to complete
Waiting for 19 seconds for restart to complete
Waiting for 18 seconds for restart to complete
Waiting for 17 seconds for restart to complete
Waiting for 16 seconds for restart to complete
Waiting for 15 seconds for restart to complete
Waiting for 14 seconds for restart to complete
Waiting for 13 seconds for restart to complete
Waiting for 12 seconds for restart to complete
Waiting for 11 seconds for restart to complete
Waiting for 10 seconds for restart to complete
Waiting for 9 seconds for restart to complete
Waiting for 8 seconds for restart to complete
Waiting for 7 seconds for restart to complete
Waiting for 6 seconds for restart to complete
Waiting for 5 seconds for restart to complete
Waiting for 4 seconds for restart to complete
Waiting for 3 seconds for restart to complete
Waiting for 2 seconds for restart to complete
Waiting for 1 seconds for restart to complete
Keys left in Redis (list should be empty)[
  
]
Entries in s1ap_imsi_map (should be zero): 0
Entries left in hashtables (should be zero): 0
Entries in mme_ueip_imsi_map (should be zero): 0
-------------- generated xml file: /var/tmp/test_results/test_tau_periodic_active.xml ---------------
====================================== short test summary info ======================================
FAILED s1aptests/test_tau_periodic_active.py::TestTauPeriodicActive::test_tau_periodic_active - As...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=================================== 1 failed in 117.67s (0:01:57) ===================================
make: *** [Makefile:51: integ_test] Error 1

@bhuvaneshne By any chance you got the integ test working over this PR?

@bhuvaneshne
Copy link
Copy Markdown
Contributor Author

@bhuvaneshne By any chance you got the integ test working over this PR?

@bhuvaneshne By any chance you got the integ test working over this PR?

@bhuvaneshne , this is the first test with traffic testing, so few things to check:

  1. Is the traffic server running okay in your local setup?
  2. Can you run this test individually, instead of as part of the sanity suite? You can do it with:

make integ_test TESTS=s1aptests/test_attach_ul_udp_data.py
@VinashakAnkitAman do we need to reload the VMs if one of the traffic tests fail?

  1. Can you run this test on the master commit, which is the parent of your feature branch?

I am currently experiencing #13245 , hence cannot validate in my local setup.

Hi @ssanadhya All the cleanups happen in the scenario a test case fails including cleaning the interfaces and stopping the unclosed iperf instances. If any test case fails, including data test case, we need not reload the VMs.
Hi @bhuvaneshne Although route should not be an issue here for uplink data test cases, can you please add route in the TRF Server VM with below command before running the test case? sudo ip route flush via 192.168.129.1 && sudo ip route add 192.168.128.0/24 via 192.168.129.1 dev eth2
Now if you still have issues with local setup, since the branch has already been created and raised for PR, there is one more quick approach for you to verify the integ test directly via Github Actions:

* Go to your forked repo `https://github.com/<github_username>/magma/actions/workflows/lte-integ-test.yml`

* Click `Run workflow` button on the right side and select the correct branch, then click the `Run workflow` button.

* Then follow the latest execution report downside

Thanks @VinashakAnkitAman , I rebased and reran the test on a new environment. It failed again but in a different spot. Let me try the "run workflow"

Deleting all the uplink/downlink flows
Keys left in Redis (list should be empty)[
 IMSI001010000000001:SPGW
IMSI001010000000001.magma.ipv4,ipv4:mobilityd_ipdesc_record
IMSI001010000000001:MME

nb_ue_attached: 1 
]
Entries in s1ap_imsi_map (should be zero): 1
Entries left in hashtables (should be zero): 3
Entries in mme_ueip_imsi_map (should be zero): 1
The test has failed. Restarting Sctpd for cleanup
Waiting for 30 seconds for restart to complete
Waiting for 29 seconds for restart to complete
Waiting for 28 seconds for restart to complete
Waiting for 27 seconds for restart to complete
Waiting for 26 seconds for restart to complete
Waiting for 25 seconds for restart to complete
Waiting for 24 seconds for restart to complete
Waiting for 23 seconds for restart to complete
Waiting for 22 seconds for restart to complete
Waiting for 21 seconds for restart to complete
Waiting for 20 seconds for restart to complete
Waiting for 19 seconds for restart to complete
Waiting for 18 seconds for restart to complete
Waiting for 17 seconds for restart to complete
Waiting for 16 seconds for restart to complete
Waiting for 15 seconds for restart to complete
Waiting for 14 seconds for restart to complete
Waiting for 13 seconds for restart to complete
Waiting for 12 seconds for restart to complete
Waiting for 11 seconds for restart to complete
Waiting for 10 seconds for restart to complete
Waiting for 9 seconds for restart to complete
Waiting for 8 seconds for restart to complete
Waiting for 7 seconds for restart to complete
Waiting for 6 seconds for restart to complete
Waiting for 5 seconds for restart to complete
Waiting for 4 seconds for restart to complete
Waiting for 3 seconds for restart to complete
Waiting for 2 seconds for restart to complete
Waiting for 1 seconds for restart to complete
Keys left in Redis (list should be empty)[
  
]
Entries in s1ap_imsi_map (should be zero): 0
Entries left in hashtables (should be zero): 0
Entries in mme_ueip_imsi_map (should be zero): 0
-------------- generated xml file: /var/tmp/test_results/test_tau_periodic_active.xml ---------------
====================================== short test summary info ======================================
FAILED s1aptests/test_tau_periodic_active.py::TestTauPeriodicActive::test_tau_periodic_active - As...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=================================== 1 failed in 117.67s (0:01:57) ===================================
make: *** [Makefile:51: integ_test] Error 1

@bhuvaneshne By any chance you got the integ test working over this PR?

Hi @VinashakAnkitAman : the test reported failure. I did not retry once again - I will give it a shot once again and update.

@ssanadhya
Copy link
Copy Markdown
Contributor

@bhuvaneshne , it will be great to have this fix in the release 1.8 (which was branched out last week). If you could please confirm that LTE integration tests are passing on this PR, we can merge and backport this change to the official release.

@VinashakAnkitAman , can you also test the changes of this PR in your local setup?

@VinashakAnkitAman
Copy link
Copy Markdown
Member

@bhuvaneshne , it will be great to have this fix in the release 1.8 (which was branched out last week). If you could please confirm that LTE integration tests are passing on this PR, we can merge and backport this change to the official release.

@VinashakAnkitAman , can you also test the changes of this PR in your local setup?

Sure @ssanadhya I am on leave on Tuesday for some sudden personal work.. I will re-verify in my local setup as well on Wednesday morning..

@bhuvaneshne
Copy link
Copy Markdown
Contributor Author

@bhuvaneshne , it will be great to have this fix in the release 1.8 (which was branched out last week). If you could please confirm that LTE integration tests are passing on this PR, we can merge and backport this change to the official release.
@VinashakAnkitAman , can you also test the changes of this PR in your local setup?

Sure @ssanadhya I am on leave on Tuesday for some sudden personal work.. I will re-verify in my local setup as well on Wednesday morning..

Hi @ssanadhya , Below are the test cases that fail even after retries:
image

I can rebase and give it one more try

@bhuvaneshne bhuvaneshne force-pushed the perf-sctp-abort-fix-bhuvaneshne branch from 8f1ea21 to e6c6103 Compare August 16, 2022 04:24
@ssanadhya
Copy link
Copy Markdown
Contributor

@Neudrino , could you please review this revised PR? It is awaiting your approval since you requested changes.

@bhuvaneshne
Copy link
Copy Markdown
Contributor Author

@ssanadhya , Finally got a clean run (Had to retry couple of test cases). Attaching the test results
test_results.zip
.

@bhuvaneshne bhuvaneshne force-pushed the perf-sctp-abort-fix-bhuvaneshne branch from e6c6103 to bdc312c Compare August 17, 2022 02:02
@ssanadhya ssanadhya enabled auto-merge (squash) August 17, 2022 04:56
@ssanadhya ssanadhya disabled auto-merge August 17, 2022 04:57
@ssanadhya
Copy link
Copy Markdown
Contributor

Thanks for confirming @bhuvaneshne ! I will merge this once CI checks are complete.

@ssanadhya , Finally got a clean run (Had to retry couple of test cases). Attaching the test results test_results.zip .

@Neudrino Neudrino dismissed their stale review August 17, 2022 07:14

The change request is no longer required, as the appropriate parties reviewed.

@Neudrino Neudrino removed their request for review August 17, 2022 07:14
@ssanadhya ssanadhya merged commit 6506c85 into magma:master Aug 17, 2022
MagmaCIBot pushed a commit that referenced this pull request Aug 17, 2022
…ndmsg (#13146)

* fix(agw): SCTP Abort on new connections from ENB. MME stops processing resolved.  #13115

Signed-off-by: bhuvaneshne <[email protected]>
(cherry picked from commit 6506c85)
@MagmaCIBot
Copy link
Copy Markdown

💚 All backports created successfully

Status Branch Result
v1.8

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

maxhbr pushed a commit that referenced this pull request Aug 17, 2022
…ndmsg (#13146) (#13644)

* fix(agw): SCTP Abort on new connections from ENB. MME stops processing resolved.  #13115

Signed-off-by: bhuvaneshne <[email protected]>
(cherry picked from commit 6506c85)

Co-authored-by: Bhuvanesh <[email protected]>
@nstng
Copy link
Copy Markdown
Contributor

nstng commented Aug 24, 2022

already in 1.8 - setting respective label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

apply-v1.8 backported-v1.8 candidate-for-backporting component: agw Access gateway-related issue size/XS Denotes a PR that changes 0-9 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants