tests: Ensure excessive FD limits are avoided by polarathene · Pull Request #2730 · docker-mailserver/docker-mailserver

polarathene · 2022-08-19T03:31:24Z

Description

Local tests can fail depending on environment if Docker is configured with an excessive FD limit.

This was affecting tests using ENABLE_SRS=1 and ENABLE_FAIL2BAN=1 due to a common step for daemonization taking considerably longer to complete (8 minutes on Fedora 36), causing the tests to fail due to timeout or unreachable service.

I have chosen kernel defaults (1024 soft, 4096 hard), as opposed to systemd (1024, 512K) which should be fine for tests. The limit is for how many files/streams a process can have open at one time. A process can request to raise the soft limit so long as it's below the hard limit.

I avoided documentation as I don't expect this to be a common issue users would face in real deployments. If an issue is raised about high CPU usage or similar odd activity to troubleshoot, it may be due to this, and if so warrant documenting for users. Otherwise I only expect it to affect contributors running tests locally.

Additional Details

Processes that run as daemons (postsrsd and fail2ban-server) initialize by closing all FDs (File Descriptors).

This behaviour queries that maximum limit and iterates through the entire range even if only a few FDs are open. In some environments (Docker, limit configured varies by distro) this can be a range exceeding 1 billion (from kernel default of 1024 soft, 4096 hard), causing an 8 minute delay with heavy CPU activity.

postsrsd has since been updated to use close_range() syscall, and fail2ban will now iterate through /proc/self/fd (open FDs) which should resolve the performance hit. Until those updates reach our Docker image, we need to workaround it with --ulimit option.

NOTE: The CI does not seem affected, but it can affect local development when running tests causing failures. If docker.service on a distro sets LimitNOFILE= to approx 1 million or lower, it should not be an issue. On distros such as Fedora 36, it is LimitNOFILE=infinity (approx 1 billion) that causes excessive delays.

When close_range() syscall is available, at least in Python, this requires kernel 5.9+ and glibc >= 2.34, the performance hit is avoided. On Debian 11 and Alpine Linux 3.16, neither container meets the glibc requirement (Debian 11 package is too old, Alpine uses musl and has no equivalent support), whilst Fedora 36 container has glibc 2.35 (performance improvement demonstrated). Thus this PR can likely be reverted once the next Debian release in 2023 occurs.

Fixes: #2722

Type of change

Bug fix (non-breaking change which fixes an issue)
Improvement (non-breaking change that does improve existing functionality)

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
New and existing unit tests pass locally with my changes

Processes that run as daemons (`postsrsd` and `fail2ban-server`) initialize by closing all FDs (File Descriptors). This behaviour queries that maximum limit and iterates through the entire range even if only a few FDs are open. In some environments (Docker, limit configured by distro) this can be a range exceeding 1 billion (from kernel default of 1024 soft, 4096 hard), causing an 8 minute delay with heavy CPU activity. `postsrsd` has since been updated to use `close_range()` syscall, and `fail2ban` will now iterate through `/proc/self/fd` (open FDs) which should resolve the performance hit. Until those updates reach our Docker image, we need to workaround it with `--ulimit` option. NOTE: The CI does not seem affected, but it can affect local development when running tests causing failures. If `docker.service` on a distro sets `LimitNOFILE=` to approx 1 million or lower, it should not be an issue. On distros such as Fedora 36, it is `LimitNOFILE=infinity` (approx 1 billion) that causes excessive delays.

Typically on modern distros with systemd, this should equate to 1024 (soft) and 512K (hard) limits. A distro may override the built-in global defaults systemd sets via setting `DefaultLimitNOFILE=` in `/etc/systemd/user.conf` and `/etc/systemd/system.conf`.

- `no_containers.bats` tests the external script `setup.sh` without `-c`; it's expected that no existing DMS container is running - otherwise it may attempt to use that container and fail. Detect this and fail early via `setup_file()` step. - `mail_undef_spam_subject.bats` was missing a container name assignment to go with the change I used to be more explicit about assigning a deterministic name. - `mail_hostname.bats` had a odd timing failure with teardown due to the last tests bringing the containers down earlier (`docker stop` paired with the `docker run --rm`). Adding a moment of delay via `sleep` helps avoid that false positive scenario.

georglauterbach

LGTM

casperklein

Two things:

It might be worth adding a little comment (#2730) or so to the --ulimit lines, to prevent research for someone wondering about these lines.
Reminder: Every time a "sleep" is used to fix something, god kills a kitten 😆

polarathene · 2022-08-22T00:02:09Z

1. It might be worth adding a little comment (#2730) or so to the --ulimit lines, to prevent research for someone wondering about these lines.

With them being only in the tests, hopefully git blame is easy enough to look up this PR. All core maintainers are currently aware of the purpose now. The tests are rarely modified, I'm pretty much the only one who does extensive refactoring there 😅

I'm not too concerned as by mid 2023 (?) the next Debian release will be ready, and that hopefully has the new postsrsd release packaged. While fail2ban I know you've opted to install from external source, so that'll probably get the update sooner?

I can add the reference comments if you still think they'd be useful. I would like to look at moving the container setup for tests into docker-compose.yml files at some point, git blame would be a bit more work by then, so more context inline would be helpful at that point.

polarathene · 2022-08-22T02:14:34Z

Apologies, I pushed a fix to the wrong PR 😬

Dropped the commit and force-pushed. You can see that the commit is the same as the last merge master branch commit 2c7237, so nothing has changed since approval 😅

casperklein · 2022-08-22T09:40:41Z

Even with git blame I find it sometimes very hard to track code changes. There are so many refactorings or code is splitted/moved to other files, that it can be quite challenging to find the origin commit that introduced something. Maybe not in this specific case, but in general.

polarathene · 2022-08-22T09:49:24Z

Even with git blame I find it sometimes very hard to track code changes. There are so many refactorings or code is splitted/moved to other files, that it can be quite challenging to find the the origin commit that introduced something.

Absolutely, that's why I try to provide some rather verbose history in some refactoring PRs to save other maintainers from trawling through it 😆

The most annoying ones are around 2020-2021 due to how large several were at moving everything around, so my special trick for that is to take something like the Dockerfile and jump back in time prior to that, then browse the repo from that point in time and find the file to git blame 😅

I do agree that it's not pleasant, and definitely prefer adding contextual doc comments for stuff that is harder to lookup / grok 👍

polarathene added area/tests service/security/fail2ban kind/bugfix labels Aug 19, 2022

polarathene added this to the v11.2.0 milestone Aug 19, 2022

polarathene self-assigned this Aug 19, 2022

This comment was marked as resolved.

Sign in to view

polarathene marked this pull request as draft August 19, 2022 20:27

polarathene force-pushed the tests/enforce-ulimit branch from d6e13b8 to 83af3e7 Compare August 20, 2022 03:01

polarathene force-pushed the tests/enforce-ulimit branch from 83af3e7 to 74cf883 Compare August 20, 2022 08:09

polarathene marked this pull request as ready for review August 20, 2022 13:29

polarathene requested review from casperklein and georglauterbach August 20, 2022 13:29

georglauterbach previously approved these changes Aug 20, 2022

View reviewed changes

casperklein previously approved these changes Aug 21, 2022

View reviewed changes

Merge branch 'master' into tests/enforce-ulimit

2c27237

polarathene dismissed stale reviews from casperklein and georglauterbach via 58bd96b August 22, 2022 02:08

polarathene force-pushed the tests/enforce-ulimit branch from 58bd96b to 2c27237 Compare August 22, 2022 02:11

georglauterbach approved these changes Aug 22, 2022

View reviewed changes

casperklein approved these changes Aug 22, 2022

View reviewed changes

Merge branch 'master' into tests/enforce-ulimit

8e0dbea

polarathene merged commit 672e9cf into docker-mailserver:master Aug 22, 2022

polarathene mentioned this pull request Sep 25, 2022

scripts: set nofile for fail2ban process #2792

Closed

3 tasks

polarathene mentioned this pull request Mar 6, 2023

Review / revisit systemd unit files docker/for-linux#73

Open

polarathene mentioned this pull request Oct 9, 2025

CI experiment: Build on latest Debian astral-sh/python-build-standalone#772

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests: Ensure excessive FD limits are avoided#2730

tests: Ensure excessive FD limits are avoided#2730
polarathene merged 5 commits intodocker-mailserver:masterfrom
polarathene:tests/enforce-ulimit

polarathene commented Aug 19, 2022

Uh oh!

This comment was marked as resolved.

georglauterbach left a comment

Uh oh!

casperklein left a comment

Uh oh!

polarathene commented Aug 22, 2022

Uh oh!

polarathene commented Aug 22, 2022 •

edited

Loading

Uh oh!

casperklein commented Aug 22, 2022 •

edited

Loading

Uh oh!

polarathene commented Aug 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

polarathene commented Aug 19, 2022

Description

Additional Details

Type of change

Checklist:

Uh oh!

This comment was marked as resolved.

georglauterbach left a comment

Choose a reason for hiding this comment

Uh oh!

casperklein left a comment

Choose a reason for hiding this comment

Uh oh!

polarathene commented Aug 22, 2022

Uh oh!

polarathene commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casperklein commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polarathene commented Aug 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

polarathene commented Aug 22, 2022 •

edited

Loading

casperklein commented Aug 22, 2022 •

edited

Loading