Skip to content

fix(agw): Fixing monitord "Too many opened files" OSError#8810

Merged
ardzoht merged 1 commit intomagma:masterfrom
ardzoht:fix_monitord_os_error
Aug 27, 2021
Merged

fix(agw): Fixing monitord "Too many opened files" OSError#8810
ardzoht merged 1 commit intomagma:masterfrom
ardzoht:fix_monitord_os_error

Conversation

@ardzoht
Copy link
Copy Markdown
Contributor

@ardzoht ardzoht commented Aug 26, 2021

Signed-off-by: Alex Rodriguez [email protected]

Summary

  • Current implementation of asynchronous ICMP pinging job of the service is too aggressive, it opens a bunch of subprocess calls (bounded by the total number of subscribers attached) and sends them all at once, defeating the purpose of the asynchronous job.
  • It also contains a bug introduced on [CWF][AGW] Adding manual ping targets to monitord  #3649 refactor which holds inactive subscribers IPs and sends ICMP pings without ever releasing them.
  • This PR introduces chunking of ping targets with sleep interval to allow the service to not overload opened subprocess tasks, and also releases inactive subscribers as targets to ping.
  • Running attach/detach test running with 600 UEs quickly reproduces the issue:
Aug 26 17:38:28 phy-u6 monitord[1303578]: Traceback (most recent call last):
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/local/lib/python3.8/dist-packages/magma/common/job.py", line 121, in _periodic
Aug 26 17:38:28 phy-u6 monitord[1303578]:     await self._run()
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/local/lib/python3.8/dist-packages/magma/monitord/icmp_monitoring.py", line 96, in _run
Aug 26 17:38:28 phy-u6 monitord[1303578]:     await self._ping_targets(addresses, targets)
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/local/lib/python3.8/dist-packages/magma/monitord/icmp_monitoring.py", line 69, in _ping_targets
Aug 26 17:38:28 phy-u6 monitord[1303578]:     ping_results = await ping_interface_async(ping_params, self._loop)
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/asyncio/coroutines.py", line 127, in coro
Aug 26 17:38:28 phy-u6 monitord[1303578]:     res = yield from res
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/local/lib/python3.8/dist-packages/magma/magmad/check/subprocess_workflow.py", line 81, in exec_and_parse_subprocesses_async
Aug 26 17:38:28 phy-u6 monitord[1303578]:     subprocs = yield from asyncio.gather(*futures)
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/asyncio/subprocess.py", line 236, in create_subprocess_exec
Aug 26 17:38:28 phy-u6 monitord[1303578]:     transport, protocol = await loop.subprocess_exec(
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/asyncio/base_events.py", line 1630, in subprocess_exec
Aug 26 17:38:28 phy-u6 monitord[1303578]:     transport = await self._make_subprocess_transport(
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/asyncio/unix_events.py", line 197, in _make_subprocess_transport
Aug 26 17:38:28 phy-u6 monitord[1303578]:     transp = _UnixSubprocessTransport(self, protocol, args, shell,
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/asyncio/base_subprocess.py", line 36, in __init__
Aug 26 17:38:28 phy-u6 monitord[1303578]:     self._start(args=args, shell=shell, stdin=stdin, stdout=stdout,
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/asyncio/unix_events.py", line 789, in _start
Aug 26 17:38:28 phy-u6 monitord[1303578]:     self._proc = subprocess.Popen(
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
Aug 26 17:38:28 phy-u6 monitord[1303578]:     self._execute_child(args, executable, preexec_fn, close_fds,
Aug 26 17:38:28 phy-u6 monitord[1303578]:   File "/usr/lib/python3.8/subprocess.py", line 1605, in _execute_child
Aug 26 17:38:28 phy-u6 monitord[1303578]:     errpipe_read, errpipe_write = os.pipe()
Aug 26 17:38:28 phy-u6 monitord[1303578]: OSError: [Errno 24] Too many open files

Test Plan

  • Running same active/detach test on spirent issue does not appear:
magma@phy-u6:~$ sudo systemctl status magma@monitord
● [email protected] - Magma monitord service
     Loaded: loaded (/etc/systemd/system/[email protected]; disabled; vendor preset: enabled)
     Active: active (running) since Thu 2021-08-26 21:39:32 UTC; 56min ago
   Main PID: 1557830 (python3)
      Tasks: 6 (limit: 9339)
     Memory: 47.7M (limit: 300.0M)
     CGroup: /system.slice/system-magma.slice/[email protected]
             └─1557830 python3 -m magma.monitord.main

Aug 26 22:36:13 phy-u6 monitord[1557830]: INFO:root:Pinging 400:500 hosts
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560000:192.168.135.166 => 1.237ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560001:192.168.134.25 => 2.581ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560002:192.168.129.118 => 1.157ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560003:192.168.133.31 => 0.938ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560004:192.168.134.67 => 0.926ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560005:192.168.134.175 => 1.089ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560006:192.168.133.59 => 0.864ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560007:192.168.128.104 => 1.312ms
Aug 26 22:36:14 phy-u6 monitord[1557830]: INFO:root:001011234560008:192.168.134.8 => 1.545ms

Additional Information

  • This change is backwards-breaking

@ardzoht ardzoht requested review from a team and pshelar August 26, 2021 21:55
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines. label Aug 26, 2021
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for opening a PR! 💯

A couple initial guidelines

Howto

  • Reviews. The "Reviewers" listed for this PR are the Magma maintainers who will shepherd it.
  • Checks. All required CI checks must pass before merge.
  • Merge. Once approved and passing CI checks, use the ready2merge label to indicate the maintainers can merge your PR.

More info

Please take a moment to read through the Magma project's

If this is your first Magma PR, also consider reading

@github-actions github-actions bot added the component: agw Access gateway-related issue label Aug 26, 2021
@github-actions
Copy link
Copy Markdown
Contributor

@ardzoht ardzoht linked an issue Aug 26, 2021 that may be closed by this pull request
@ardzoht ardzoht added apply-v1.5 Apply this commit to the v1.5 release branch as well. apply-v1.6 apply-v1.7 labels Aug 26, 2021
@ardzoht ardzoht force-pushed the fix_monitord_os_error branch from 2bf9cc8 to 02a92b2 Compare August 26, 2021 22:00
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

feg-workflow

    2 files  145 suites   29s ⏱️
338 tests 338 ✔️ 0 💤 0 ❌
352 runs  352 ✔️ 0 💤 0 ❌

Results for commit 02a92b2.

@github-actions
Copy link
Copy Markdown
Contributor

agw-workflow

    2 files      2 suites   2m 42s ⏱️
464 tests 455 ✔️ 9 💤 0 ❌
465 runs  456 ✔️ 9 💤 0 ❌

Results for commit 02a92b2.

@pshelar
Copy link
Copy Markdown
Contributor

pshelar commented Aug 26, 2021

have you considered using python native ping lib to avoid sub-process completely ?
something like: https://github.com/alessandromaggio/pythonping

@ardzoht
Copy link
Copy Markdown
Contributor Author

ardzoht commented Aug 26, 2021

https://github.com/alessandromaggio/pythonping

I looked at alternatives, the constraint here is the fact that we're using mtr0 to send the ICMP pings, and not sure if pythonping allows to specify the interface in this case.

@pshelar
Copy link
Copy Markdown
Contributor

pshelar commented Aug 27, 2021

https://github.com/alessandromaggio/pythonping

I looked at alternatives, the constraint here is the fact that we're using mtr0 to send the ICMP pings, and not sure if pythonping allows to specify the interface in this case.

yes, it does not looks like it has the capability.

Ping3 has the interface parameter: https://github.com/kyan001/ping3

Copy link
Copy Markdown
Contributor

@pshelar pshelar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ardzoht ardzoht merged commit adec9ff into magma:master Aug 27, 2021
ardzoht added a commit to ardzoht/magma that referenced this pull request Aug 27, 2021
@markjen markjen removed the apply-v1.5 Apply this commit to the v1.5 release branch as well. label Aug 27, 2021
@markjen
Copy link
Copy Markdown
Contributor

markjen commented Aug 27, 2021

this PR does not merge cleanly into v1.5, please open a manual merge PR

markjen pushed a commit that referenced this pull request Aug 27, 2021
markjen pushed a commit that referenced this pull request Aug 27, 2021
ulaskozat pushed a commit that referenced this pull request Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backported-v1.5 backported-v1.6 backported-v1.7 component: agw Access gateway-related issue size/M Denotes a PR that changes 30-99 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

monitord crashed on Too many open files descriptor

3 participants