-
-
Notifications
You must be signed in to change notification settings - Fork 2k
tests: Ensure excessive FD limits are avoided #2730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: Ensure excessive FD limits are avoided #2730
Conversation
Processes that run as daemons (`postsrsd` and `fail2ban-server`) initialize by closing all FDs (File Descriptors). This behaviour queries that maximum limit and iterates through the entire range even if only a few FDs are open. In some environments (Docker, limit configured by distro) this can be a range exceeding 1 billion (from kernel default of 1024 soft, 4096 hard), causing an 8 minute delay with heavy CPU activity. `postsrsd` has since been updated to use `close_range()` syscall, and `fail2ban` will now iterate through `/proc/self/fd` (open FDs) which should resolve the performance hit. Until those updates reach our Docker image, we need to workaround it with `--ulimit` option. NOTE: The CI does not seem affected, but it can affect local development when running tests causing failures. If `docker.service` on a distro sets `LimitNOFILE=` to approx 1 million or lower, it should not be an issue. On distros such as Fedora 36, it is `LimitNOFILE=infinity` (approx 1 billion) that causes excessive delays.
This comment was marked as resolved.
This comment was marked as resolved.
Typically on modern distros with systemd, this should equate to 1024 (soft) and 512K (hard) limits. A distro may override the built-in global defaults systemd sets via setting `DefaultLimitNOFILE=` in `/etc/systemd/user.conf` and `/etc/systemd/system.conf`.
d6e13b8 to
83af3e7
Compare
- `no_containers.bats` tests the external script `setup.sh` without `-c`; it's expected that no existing DMS container is running - otherwise it may attempt to use that container and fail. Detect this and fail early via `setup_file()` step. - `mail_undef_spam_subject.bats` was missing a container name assignment to go with the change I used to be more explicit about assigning a deterministic name. - `mail_hostname.bats` had a odd timing failure with teardown due to the last tests bringing the containers down earlier (`docker stop` paired with the `docker run --rm`). Adding a moment of delay via `sleep` helps avoid that false positive scenario.
83af3e7 to
74cf883
Compare
georglauterbach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
casperklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- It might be worth adding a little comment (
#2730) or so to the--ulimit lines, to prevent research for someone wondering about these lines. - Reminder: Every time a "sleep" is used to fix something, god kills a kitten 😆
With them being only in the tests, hopefully I'm not too concerned as by mid 2023 (?) the next Debian release will be ready, and that hopefully has the new I can add the reference comments if you still think they'd be useful. I would like to look at moving the container setup for tests into |
58bd96b
58bd96b to
2c27237
Compare
|
Apologies, I pushed a fix to the wrong PR 😬 Dropped the commit and force-pushed. You can see that the commit is the same as the last merge master branch commit |
|
Even with |
Absolutely, that's why I try to provide some rather verbose history in some refactoring PRs to save other maintainers from trawling through it 😆 The most annoying ones are around 2020-2021 due to how large several were at moving everything around, so my special trick for that is to take something like the Dockerfile and jump back in time prior to that, then browse the repo from that point in time and find the file to git blame 😅 I do agree that it's not pleasant, and definitely prefer adding contextual doc comments for stuff that is harder to lookup / grok 👍 |
Description
Local tests can fail depending on environment if Docker is configured with an excessive FD limit.
This was affecting tests using
ENABLE_SRS=1andENABLE_FAIL2BAN=1due to a common step for daemonization taking considerably longer to complete (8 minutes on Fedora 36), causing the tests to fail due to timeout or unreachable service.I have chosen kernel defaults (1024 soft, 4096 hard), as opposed to systemd (1024, 512K) which should be fine for tests. The limit is for how many files/streams a process can have open at one time. A process can request to raise the soft limit so long as it's below the hard limit.
I avoided documentation as I don't expect this to be a common issue users would face in real deployments. If an issue is raised about high CPU usage or similar odd activity to troubleshoot, it may be due to this, and if so warrant documenting for users. Otherwise I only expect it to affect contributors running tests locally.
Additional Details
Processes that run as daemons (
postsrsdandfail2ban-server) initialize by closing all FDs (File Descriptors).This behaviour queries that maximum limit and iterates through the entire range even if only a few FDs are open. In some environments (Docker, limit configured varies by distro) this can be a range exceeding 1 billion (from kernel default of 1024 soft, 4096 hard), causing an 8 minute delay with heavy CPU activity.
postsrsdhas since been updated to useclose_range()syscall, andfail2banwill now iterate through/proc/self/fd(open FDs) which should resolve the performance hit. Until those updates reach our Docker image, we need to workaround it with--ulimitoption.NOTE: The CI does not seem affected, but it can affect local development when running tests causing failures. If
docker.serviceon a distro setsLimitNOFILE=to approx 1 million or lower, it should not be an issue. On distros such as Fedora 36, it isLimitNOFILE=infinity(approx 1 billion) that causes excessive delays.When
close_range()syscall is available, at least in Python, this requires kernel 5.9+ andglibc>= 2.34, the performance hit is avoided. On Debian 11 and Alpine Linux 3.16, neither container meets theglibcrequirement (Debian 11 package is too old, Alpine usesmusland has no equivalent support), whilst Fedora 36 container hasglibc2.35 (performance improvement demonstrated). Thus this PR can likely be reverted once the next Debian release in 2023 occurs.Fixes: #2722
Type of change
Checklist: