Revert commit that changed LimitNOFILE to infinity to avoid regressions #7566

rayburgemeestre · 2022-10-20T10:14:06Z

Hi there,

This PR is reverting the change introduced by the following PR: #4475

In our case we experienced an issue with it with RHEL 9 and Rocky 9. The system would OOMKill processes running in containers (spawned via Kubernetes, using containerd)

With the limit set to INFINITY the system would become either unresponsive if unlucky, or the process would be killed, similar to below.

[  550.295416] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=kubelet,mems_allowed=0,global_oom,task_memcg=/kubepods/besteffort/podfa0597df-8678-412b-b116-d2a8b906ddfd/fa785a2bb7a371c01590584ca74615e800c37b6e6984351f00a6bee11b918f2c,task=mysqld,pid=7654,uid=0
[  550.295443] Out of memory: Killed process 7654 (mysqld) total-vm:16827684kB, anon-rss:15376132kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:30220kB oom_score_adj:1000
[  550.413393] oom_reaper: reaped process 7654 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  560.824936] coredns invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=996

This is a very lightweight mysql server, on a very beefy node getting killed right away. This also happened with java stuff, uwsgi, and other processes we have automated tests with.

We tried with Kubernetes 1.24, and 1.21, with cgroups v2 and v1, later tried kubernetes on docker, where it worked. Which led us to the discovery that docker is using LimitNOFILE=1048576 and containerd is using LimitNOFILE=infinity by default.

Others have been running into this as well, for example:

We can work around it, but I believe simply reverting the change will make life easier for others, and maybe therefore worth doing? It would have saved me personally a couple of days troubleshooting.

k8s-ci-robot · 2022-10-20T10:14:15Z

Hi @rayburgemeestre. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…md_updates" This reverts commit 6c74c39, reversing changes made to 09814d4. Signed-off-by: Ray Burgemeestre <[email protected]>

thaJeztah · 2022-10-20T12:06:10Z

I wonder if this needs some discussion (I'm a bit behind on the linked discussions, and probably should post there as well). So the TL;DR is that;

The intent of the LimitNOFILE in the systemd unit was to "don't apply limits", both for "accounting" purposes (applying limits can cause an overhead), and to make sure that the (combined) load of all containers running as child process are able to use the resources provided by the host. (Similar to cpu not being constrained by default, which should be restricted for containers if they are not allowed to use "all"). See moby/moby@8db6109.

Some (older) versions of systemd didn't handle the infinity value, so it was hardcoded to the value 1048576 (moby/moby@4084bf7), which was the value used as "infinity" (i.e., setting LimitNOFILE=infinity or LimitNOFILE=1048576 would be equivalent (the value is hardcoded in the kernel see Don't set ulimits (nproc) for all init scripts moby/moby#24555 (comment), and here)).
But; newer kernel versions (or systemd) (apparently) updated the limit; 1.5.10 causes memory leak in mysql container; regression from 1.4.13 #6707 (comment), which now became 1073741816. So using LimitNOFILE=infinity on older systems would result in 1048576, but upgrading to a newer version would result in 1073741816 with the same configuration.

The culprit in the linked issues with MySQL (#3201, #6707, docker-library/mysql#579) is that these containers are started without limits set, so they run with "unconstraint" limits. It looks like MySQL as a default optimizes for performance (?) and consumes "what it can use"; due to the increased limits, this means it consumes way more than before, which is causing the issue.

So the "correct" fix for these is to make sure containers are started with the intended limits on resources (cpu, memory, pid limit, ulimits). Those limits vary depending the use-case; some instances may need (expect to need) more resources than others, but in all cases it's good practice to set limits to the expected requirements for it to consume.

So, the question is; what should be done?

If the intent continues to be; "no limits" as a default (containers should be set with a limit instead), then the PR should not be reverted.
Reverting the PR may mean that (the shared workload (?) - needs verification) of a deployed instance of containerd would be limited in the amount of resources it can consume.
containerd is designed to be a component in the stack, and from that perspective trying to be "un-opinionated", leaving it to components in the stack "on top" of containerd to add their "opinionated" configuration (which may include: "setting defaults").

kzys · 2022-10-20T21:16:26Z

Oh this reminds me #6541. Writing a systemd unit file for multiple systemd versions is not the thing. We do need to pick the systemd version first...

samuelkarp · 2022-10-23T02:37:53Z

I'm curious to know how many people are consuming the unit file from the containerd repo. Typically, distributions that package containerd would be responsible for shipping their own unit file and can make adjustments to it based on the kernel and systemd version available in that distribution. From containerd's perspective, we can't predict what every installed system will have with respect to a kernel and systemd version so we'll need to pick some behavior in our unit file, but that won't necessarily be appropriate for every system.

mkgvb · 2022-11-10T00:06:12Z

Hi all,
I was doing some testing with running centos:7 container under fedora35 and fedora36 hosts and was noticing

GNU screen would never respond to commands (e.g screen -ls screen --help)
yum transactions would never complete

Took a bit of googling and I found this SO thread and it fixed both issues https://stackoverflow.com/questions/73185002/yum-update-stucks-inside-docker

This is whats currently distributed on the fedora 35 machine in /usr/lib/systemd/system/containerd.service

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

and my current version

yum list installed containerd.io
Installed Packages
containerd.io.x86_64                                  1.6.8-3.1.fc35                                   @docker-ce-stable

Not sure if this is the place to put it as this might be more of a fedora packaging issue as you guys are discussing but wanted to make it known

C-Higgins · 2022-11-30T01:44:01Z

I believe the commit should be reverted, as it is a regression in some circumstances. The decision to make the limit infinite (unopiniated) is fine, but it should be communicated as a breaking change in a later release. As far as I know, this change was not made intending to be a breaking change.

travisby · 2023-02-19T17:57:23Z

Another datapoint: rook-ceph refuses to sync if you using limit=infinity coreos/fedora-coreos-tracker#329 (the poster closed the issue not because coreos changed the limit, but because their upstream vendor did). I don't quite know the nature of the perf issue, but changing it to a static number resolved the monitors syncing.

emorozov · 2023-02-21T12:14:24Z

It looks like MySQL as a default optimizes for performance (?) and consumes "what it can use"; due to the increased limits, this means it consumes way more than before, which is causing the issue.

Doesn't look like a plausible explanation to me. When you run MySQL without docker, it doesn't have any limits applied and it doesn't try to consume more RAM than is available. When I run MySQL in docker now, it tries to allocate 16.5 Gb of RAM immediately on start, although my laptop has only 16 Gb.

There must be something else that causes MySQL to consume RAM in the latest docker so greedily.

thaJeztah · 2023-02-21T13:10:32Z

Doesn't look like a plausible explanation to me. When you run MySQL without docker, it doesn't have any limits applied and it doesn't try to consume more RAM than is available.

It will have limits applied;

Limits on the host:

ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 3709
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 3709
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Limits in container:

docker run --rm alpine sh -c 'ulimit -a'
core file size (blocks)         (-c) unlimited
data seg size (kb)              (-d) unlimited
scheduling priority             (-e) 0
file size (blocks)              (-f) unlimited
pending signals                 (-i) 3709
max locked memory (kb)          (-l) 8192
max memory size (kb)            (-m) unlimited
open files                      (-n) 1073741816
POSIX message queues (bytes)    (-q) 819200
real-time priority              (-r) 0
stack size (kb)                 (-s) 8192
cpu time (seconds)              (-t) unlimited
max user processes              (-u) unlimited
virtual memory (kb)             (-v) unlimited
file locks                      (-x) unlimited

Apparently with some recent releases of some OSes where new containerd has been released, default docker community edition causes Airlfow to immediately consume all memory. This happens in Breeze and Docker Compose at least. There is a workaround described in: ttps://github.com/moby/moby/issues/43361#issuecomment-1227617516 to use Docker Desktop instead. The issue is tracked in containerd in this issue - proposing to revert the change (as it impacts other applications run in docker, not only Airlfow): containerd/containerd#7566

Blizzke · 2023-02-28T14:57:22Z

Just an FYI: This has bitten me in the behind 2 times the last 2 weeks: once for slapd (openldap) and once for airflow (python).
Containers going from 0 to 60+GB in 10 seconds (my system has 256GB ram and the entire OOM actually crashes my window manager) . After setting a ulimit -n 1024 things chill down considerably.

splinter89 · 2023-03-02T18:55:33Z

If I may, I would recommend adding a note somewhere to stop this cycle

Added 1048576 Add rlimits to service file #1846
Changed to infinity Update unit file for resources and task max #2601
Changed to 1048576 Set nofile to 1048576 #3202
Changed to infinity systemd: use LimitNOFILE=infinity instead of hard-coded max value #4475
Changing back to 1048576 in this PR

polarathene · 2023-03-09T01:34:48Z

This took a bit longer than expected to put together 😅

I don't expect anyone to read through it all (especially linked content), but it should hopefully be a helpful reference to support the change more confidently.
I've accumulated notes together here and attempted to organize them into a reasonable document, but there's still a fair bit of repeated content.
Links are provided for additional reference / verification of statements, but otherwise I've inlined any content from the links that seemed relevant.

Personal experience with `infinity` causing problems

This issue was difficult to troubleshoot back in Aug 2022 for me.

Ironically despite that experience, I recently experienced an issue where there was reports of supervisord repeatedly restarting that same postsrsd process after recent changes in the project I maintain - but I could not reproduce locally.

I had not realized that the running daemon process had actually stalled at start up (I had forgotten to test with the ulimits adjusted, so it inherited LimitNOFILE=infinity (resolves to 2^30) as I had changed to a new system since Aug 2022), it would have eventually reproduced the failure had I waited 10 mins (or set lower soft limit with --ulimit for an instant fail).

On other systems infinity resolves to 2^20 (1048576) which reproduces in a timely manner (1 sec). My previous system was like that (Manjaro).

polarathene · 2023-03-09T01:35:22Z

To `infinity` and beyond regressions!

After a tonne of investigation into the cause, I shared some helpful insights and have revised them below:

The default limits are defined in the kernel (originally 1024 for quite a while). Kernel 3.0 release in 2011 raised the hard limit from 1024 to 4096 (still valid today, but source file has moved). From the mentioned commit message:

You don't want to raise the default soft limit,
since that might break apps that use select(), but it's safe to raise the default hard limit;
that way, apps that know they need lots of file descriptors can raise their soft limit without needing root, and without user intervention.
You may encounter mentions of /etc/security/limits.conf when looking into this topic, it is specifically config for the pam_limits module used in PAM and affects logged-in users (but not systemd managed services like docker.service):

There are two types of resource limits that pam_limits provides: hard limits and soft limits.
Hard limits are set by root and enforced by the kernel, while soft limits may be configured by the user within the range allowed by the hard limits.
Next up systemd changed how it handles limits in v240 release (Dec 2018):

Previously, systemd passed this on unmodified to all processes it forked off.
With this systemd release the hard limit systemd passes on is increased to 512K, overriding the kernel's defaults and substantially increasing the number of simultaneous file descriptors unprivileged userspace processes can allocate.

Note that the soft limit remains at 1024 for compatibility reasons:
The traditional UNIX select() call cannot deal with file descriptors >=1024 and increasing the soft limit globally might thus result in programs unexpectedly allocating a high file descriptor and thus failing abnormally when attempting to use it with select()
(of course, programs shouldn't use select() anymore, and prefer poll()/epoll, but the call unfortunately remains undeservedly popular at this time).

This change reflects the fact that file descriptor handling in the Linux kernel has been optimized in more recent kernels and allocating large numbers of them should be much cheaper both in memory and in performance than it used to be.

Which default hard limit is most appropriate is of course hard to decide.
However, given reports that ~300K file descriptors are used in real-life applications we believe 512K is sufficiently high as new default for now.
Note that there are also reports that using very high hard limits (e.g. 1G) is problematic: some software allocates large arrays with one element for each potential file descriptor (Java, …) — a high hard limit thus triggers excessively large memory allocations in these applications.
Hopefully, the new default of 512K is a good middle ground: higher than what real-life applications currently need, and low enough for avoid triggering excessively large allocations in problematic software.
Docker containers inherit the LimitNOFILE value set in docker.service, typically a single value that raises both the soft and hard limit to infinity. Unless that is changed with --ulimit which can reduce each limit indivdually (--ulimit "nofile=$(ulimit -Sn):$(ulimit -Hn)" to match the shell sessions limits, typically 1024:524288).
For reference Github Actions with Ubuntu 20.04 runner OS has both soft (-Sn) and hard (-Hn) are set to:
- Host: 65536
- Docker containers: 1048576
You can't run a container with limits higher than set for the Docker Daemon (something else in Debian imposes a limit of 2^20, as their {containerd,docker}.service has LimitNOFILE=infinity).
- EDIT: This was likely /proc/sys/fs/nr_open which /etc/sysctl.conf could have a fs.nr_open entry for configuring. Or a either /etc/systemd/{user,system}.conf files may set DefaultLimitNOFILE as another source to consider.
  
  Otherwise systemd since v240 maxes fs.nr_open (potentially to: 1 073 741 816). This changed the value interpreted for infinity which could be configured elsewhere (like PAM /etc/security/limits.conf), which gets more interesting with Debian patching defaults for PAM limits to use infinity if no explicit override was configured in limits.conf or their later guess of PAM limits copying RLIMIT_NOFILE from PID 1 which is bad (the issue later discovers it's both of those things).
  
  PAM wouldn't matter with Docker in this context as mentioned earlier. It's the fix (I document that in another section below) that reverts systemd v240 changing fs.nr_open that leaves it at 2^20 instead of 2^30.
2^30 (over 1 billion) is 1024 times as much as 2^20 (1048576, over a million), the latter was the original intended ceiling for the switch to LimitNOFILE=infinity in docker.service / containerd.service.

This is problematic for some software. A common daemon practice is to close all inherited file descriptors (typically 1024 from the standard soft limit) - which in Docker looks like a stalled / hanging process (but is actually performing over a billion close() syscalls IIRC):
- postsrsd (upstream bug report): Less than 1 sec vs 8 minutes to init.
- fail2ban (upstream bug report): Less than a second vs 70 minutes to init.
- Both projects improved handling this, but will take time to be more broadly available (Debian Bookworm later this year AFAIK is still not updating it's postsrsd package to get it for example_), and I wouldn't be surprised if various other software isn't prepared for dealing with a billion FDs.
- For the closing inherited FDs practice that daemons perform, there are more modern approaches (fail2ban chose to iterate through /proc/self/fd, and postsrsd chose to use the close_range() syscall).
  
  I shared some notes regarding close_range() as being considerably faster (unfazed by excessively high ulimits), but has some requirements gotchas (glibc 2.34 + 5.9 kernel minimum) that Debian 11 (Bullseye) released in 2022 does not meet but the next release (Bookworm) will this year. Alpine with musl last I knew also lacks support (as the link details), thus a popular base image is more prone to the issue.

polarathene · 2023-03-09T01:36:21Z

History of decisions for `LimitNOFILE` in `docker.service`:

2014 LimitNOFILE=1048576:

So, why not just use "infinity", if we're going insanely high anyways? :)
/devilsadvocate

umm, container bombs? ;)

there's nothing special about 1048576.
In fact it's higher than is actually necessary (at the moment).
It is however what we've tested with

Justification is vaguely expressed with a set of commands to compare output of change. Presumably back in 2014, the defaults (whatever they were in that users environment at the time) for the Docker daemon were too low, so they were raised explicitly to 2^20 (1048576) on the basis that it provided high enough margin of "works for me" to avoid the issues from limits being too low.

There is an issue that linked back to that PR, about failure to delete hundreds of containers because too many files are opened... Quite possibly related to the purpose of the PR.
2016 changes to LimitNOFILE=infinity:

If you look at the original commit that added these limits it looks like it was just to make sure that we don't hit the limits as Docker (it wasn't a limiting feature)
This commit fixes the performance problem without causing issues with the daemon opening too many files or processes.

Their reasoning was the change to infinity resolved a performance problem from LimitNOFILE=1048576, and that it avoids issues with daemons opening too many files/processes??? The original commit purpose was also misinterpreted.

This issue about slow I/O perf is referenced by the PR author as related to what they were trying to resolve. A file I/O benchmark to test storage performance with lots of files exceeding real-world activity motivated the change since it helped improve a benchmark? (I think this is the bug referenced in the next PR? But the 130 to 80 min timings are a bit difficult to locate)
2016 revert back to LimitNOFILE=1048576:

According to the reporter of the bug, their benchmark time almost halved (from ~130 minutes to ~80) after setting the ulimits to be unlimited. I saw similar speedups while testing.
Note that this only really affects cases where you open and close an unspeakably large number of files for short periods of time.

It's possible that it's a systemd-specific issue (I wouldn't be surprised) but it makes sense to me that having a resource limit for the number of open files would cause quite a lot of overhead.

So I'm not really sure why there is a performance impact -- it might be some weird quirk of rpm that causes this to happen.

Initial quotes seem to be about LimitNOFILE=infinity previously set, with a belief that infinity was not setting a resource limit, hence their confusion of why 1048576 (2^20) seemed more overhead vs infinity (varied much more in the past depending on distro settings AFAIK but could be much larger where the real perf overhead becomes a problem).

The last quoted comment actually reminded me of when I was troubleshooting back in Aug 2022 with better ways to solve the postsrsd and fail2ban projects perf issues closing 1 billion FDs 👍

I specifically recall citing RedHat using /proc/self/fd to resolve their related performance issue (close_range() was not available at the time).

That was resolved in 2018 while originally the issue was reported in 2016 and seems to be what was being discussed at roughly the same time as those two 2016 docker.service PRs played with the LimitNOFILE settings.

In 2019 another project references the Fedora RPM issue and works around it in their Dockerfile for a dramatic speedup (they had 2^30 for infinity):

This simple fix reduced the build time of our application in a minikube based Docker system from 10h to 15 minutes.
2021 changes to LimitNOFILE=infinity again.
This time as a PR to bring over docker.service from Docker CE to sync changes. The actual commit is from 2018 (from this PR that sync'd the opposite direction) which seemed to mangle the LimitNOFILE to infinity again 🤷‍♂️ (and then containerd regularly follows what moby is doing with docker.service for LimitNOFILE, there isn't much discussion beyond that).

polarathene · 2023-03-09T01:41:40Z

Almost there, I think I've connected all the dots in this post 👍

March 2019 - Proposal that `LimitNOFILE=1048576` is too high and should be lowered in `docker.service`

Focuses on negative impact from an excessive soft limit

The docker.service in 2019 would have had LimitNOFILE=1048576 AFAIK.

This is required for containerd itself, but is way too generous for containers it runs.

This can create a number of problems, such as container abusing system resources (e.g. DoS attacks).
In general, cgroup limits should be used to prevent those, yet I think ulimits should be set to a saner values.

In particular, RLIMIT_NOFILE, a number of open files limit, which is set to 2^20 (aka 1048576), causes a slowdown in a number of programs, as they use the upper limit value to iterate over all potentially opened file descriptors, closing those (or setting CLOEXEC bit) before every fork/exec.

They are citing a limit of 2^20 as capable of putting pressure on system resources, which from my experience (on a midrange 2016 intel i5-6500 CPU with mitigations active) is not too concerning compared to the impact severity of 2^30 (1k times more impact).

The issue references a few others as examples of impact from limits being too high. One of those compares the effect of increasing limits slowing down a Python script demonstrating a limit of approx 2^20 as taking 1 second to complete.

Reducing `LimitNOFILE` value in `docker.service`

OK, it looks like higher RLIMIT_NOFILE (aka LimitNOFILE) is required for BOTH containerd and dockerd, otherwise the default limit of 1024 is hit pretty fast, leading to inability to start more than ~200 "busybox top" containers

The issue initially proposes a bit of a ridiculously low number (for the hard limit) to set LimitNOFILE to 1024, but later discovers that doesn't work:

The hard limit can be raised (to set a maximum limit of files a process can open), and the software (processes) under that hard limit may raise their individual soft limit so long as it does not exceed that hard limit AFAIK? However, presently each process is inheriting a soft limit that's already set to the hard limit.
1024 as the soft limit should be left alone if select() may be called. As reported earlier, both the kernel and systemd (v240 release notes) have made that clear.

I'm lacking expertise here, but from what I understand the containers should probably be inheriting soft and hard limits that reflect the host OS? I'm not sure how dockerd / containerd fit into the mix in addition to containers with these limits:

Does the soft limit needs to be raised at this point? - That may complicate things if required for correct functionality before containerized processes come into play?
What is the scope of hard limit in this context? - Is it still per process, or cumulative files open across all running containers that cannot exceed the LimitNOFILE hard limit?

This shouldn't be too difficult to verify? (and the linked issue kind of does with the 200 busybox containers). For larger (commercial) deployments, docker.service / containerd.service could set a higher LimitNOFILE (hard limit, while the soft limit should probably be kept at 1024?), and then the individual containers could have smaller hard limits set if needed, allowing to run containers that need 100k+ FDs (like Kafka docs advise), without hitting problems caused from LimitNOFILE=infinity being inherited into the limits for individual containers? That should just not be a concern anyone else needs to manage outside of production deployments? (where docs for the dockerd / containerd can communicate these tips for production)

May 2021 - `Go >= 1.19` implicitly raises soft limit to hard limit

Here is a very interesting discussion about implicitly raising the soft limit on behalf of developers instead of requiring them to know to opt-in explicitly? (as most software would)

dockerd / containerd being built off Go (which I have no experience with) may have relied on adding LimitNOFILE in the past to avoid the default 1024 soft limit, assuming internally there was nothing raising that soft limit up to the hard limit in the past? (alternatively hard limit was likely too low and variable in the past until systemd v240 in 2018Q4?)

Article - Systemd changes to file descriptors (v240)

Now where it really gets interesting is right away the reference to a May 2021 blog article from the author of systemd which talks about file descriptors and the changes made back in the v240 release:

Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max.

On today's kernels they kinda lost their relevance.
They had some originally, because fds weren't accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that's ultimately what they are —, and thus charges them to the memory accounting done anyway.

Automatically at boot we'll now bump the two sysctls to their maximum, making them effectively ineffective.
This one was easy. We got rid of two pretty much redundant knobs. Nice!

The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay, cheap fds!

But … we left the soft RLIMIT_NOFILE limit at 1024.
We weren't quite ready to break all programs still using select() in 2019 yet.
Given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.

and

Anyway, here's the take away of this blog story:

Don't use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven's sake don't use select().
It might have been all the rage in the 1990s but it doesn't scale and is simply not designed for today's programs.

If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit to the hard limit.
But if you do this, you have to make sure your code (and any code that you link to from it) refrains from using select().
(Note: there's at least one glibc NSS plugin using select() internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.)

If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them.
Just because your program might be fine with fds >= 1024 it doesn't mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

I hadn't noticed that v240 change (despite it being the next item in the release notes..) which sets two sysctl tunables to the maximum (global limit fs.file-max and fs.nr_open per process limit)... I suspect that's why a colleagues Debian system was limited to 2^20 for their infinity value (EDIT: This appears to be the case).

Here are those values on my host (reference: sysctl fs docs):

# systemd >= v240 sets these two settings to their max:

# This value is approx (2^32 * 2^31), 64-bit signed int?
# I have seen a 2019 user report: (2^32 * 2^32) aka 2^64
$ sysctl fs.file-max
fs.file-max = 9223372036854775807


# `LimitNOFILE=infinity` seems to resolves to this 2^30 value? (well close enough, off by 8):
$ sysctl fs.nr_open
fs.nr_open = 1073741816


# The three values in file-nr denote the:
# - number of allocated file handles
# - number of allocated but unused file handles
# - maximum number of file handles.
$ sysctl fs.file-nr  
fs.file-nr = 26048      0       9223372036854775807

`infinity` differs between systemd v240 changes (`2^30`) vs Debian (`2^20`)

Initially the hard limit for systemd v240 was planned as 2^18 (262144), but at the request of WINE (CodeWeavers) devs, raised to 2^19 to support at least 300k file handles. The request called for preferring to raise to 2^20 to unify with the existing common practice seen in Debian and the wild. 2^20 was decided against as it was deemed too excessive for a process than necessary referencing a bug with Java on Debian (although that was caused by 2^30 hard limit due to Debian patches on pam_limits.so).

Before systemd v240 set fs.file-max to the maximum value, fs.file-max would be dynamically sized to 10% of system memory. This was discussed briefly in the PR where the hard limit was bumped from 2^18 to 2^19 before finalizing v240, but it was decided to keep the limit simple / predictable and not continue a dynamic approach based on available memory.

fs.file-max was presumably boosted to the maximum range as it's a system limit and this ensures it is never lower than the hard limit systemd sets? Otherwise on some systems fs.file-max could be lower than fs.nr_open (or a lower hard limit set for a process), which would make fs.file-max the effective ceiling.

fs.nr_open that is also maxed by systemd v240 (and seems to affect the limit value infinity) would originally default to 2^20 according to the kernel sysctl fs docs:

This denotes the maximum number of file-handles a process can allocate.
Default value is 1024*1024 (1048576) which should be enough for most machines.
Actual limit depends on RLIMIT_NOFILE resource limit.

NOTE: Debian infinity / fs.nr_open retains that default 2^20 (1048576) by building systemd >= 240 with the fs.nr_open change disabled as a workaround for a bug involving infinity limit with their patched pam_limits.so + systemd v240 which is still applied in 2023 (decision motivated by related issue report / discussion on systemd).

Various limits by distro release + risk of raising soft limit implicitly

In that same linked issue for Go, we have a user reporting configured limits across several systems:

Debian soft and hard both at 131072 (aka 2^17, prior to systemd v240 this and some lower hard limits depending on distro was common place).
CentOS 6 with soft 1024 and hard 4096 (kernel defaults since 2011).
A recent Fedora install with soft 1024 and hard 524288 (defaults from systemd since v240 in 2018Q4).
The comment directly after notes Debian Buster (2019 release) and Ubuntu Focal (20.04) have soft 1024 and hard 1048576.

In Nov 2021, a comment then lists various use cases where raising the soft limit implicitly should be avoided. Some later comments also mention macOS having a default soft limit of 256, and the initial implementation of the feature change failing some tests due to OPEN_MAX (getconf -a | grep OPEN_MAX, matches soft limit set: ulimit -Sn <num-not-above-hardlimit>) and infinity (sysctl fs.nr_open) not being handled well.

March 2022 - Go 1.19 implements the decision to implicitly bump soft limit and closes the issue (equivalent ticket on Go's tracker):

Go does not use select, so it should not be subject to these limits.
On some systems the limit is 256, which is very easy to run into, even in simple programs like gofmt when they parallelize walking a file tree.

Relevance to Docker

A presumably relevant gotcha of the impact of this change regarding Docker / containerd:

If Go did it at startup, it would be inherited by non-Go programs that we fork+exec.
That is a potential incompatibility, but probably not a large one.

My understanding is that going forward, LimitNOFILE could potentially be dropped from .service files and Go would raise the soft limit accordingly (likely to the 512k hard limit default from systemd), but I don't know the technical details with how that affects containers and processes run within those. Outside of production deployments, it is probably ok?

Perhaps LimitNOFILE should enforce the 1024 soft limit that the host is likely to have, which would be necessary now with that Go change? (granted the current .service soft limit has already exceeded 1024 for some time now, and is typically the culprit for misbehaving processes)

Perhaps the main issue in the past, was that prior to systemd v240, the inherited hard limit could have been as low as 4096 (assuming a host using a kernel release from at least 2011) which was problematic for a container runtime due to all child processes counting towards that same limit? (it also seems that adjusting the limit beyond what the system allowed (fs.file-max) could sometimes silently fail to apply, falling back to 4096 hard limit)

polarathene · 2023-03-09T01:45:21Z

Summary

docker.service + containerd.service probably should adopt LimitNOFILE=1024:524288.

Reasons to change away from `infinity`

Most hosts are likely to have limits set as 1024 soft + 524288 (2^19) hard.
The impact of 2^20 hard limit is minimal compared to the impact of 2^30 (implicitly resolved on some systems as the infinity value) in a variety of software.
Rarely should software need excessive limits so high, that if they do they are more likely to clearly document this (see MongoDB and Kafka) or be deployed in a production environment where the expertise to know of this requirement when relevant can be afforded.
When the limit needs to be raised, that is more clearly communicated through errors output - rather than troubleshooting software placing unusual pressure on system resources due to excessive limits, in addition to the process being mistaken for hanging or running (failure delayed until after fully iterating through FD range).
The default of LimitNOFILE=infinity is not sane for the majority of users, especially developer machines where surprises can be difficult to troubleshoot to this cause (I had to look through third-party source-code to track it down, report it to get a fix upstreamed and broaden knowledge to understand why colleagues could not reproduce the failure, and still got tripped up by the same bug 6 months later).
systemd docs for LimitNOFILE specifically discourage increasing the soft limit when setting LimitNOFILE unless you have a good reason to.

That section also describes infinity to configure "no limit" on the resource, which maps to unlimited. But for ulimit -n at least, this seems to equate to a rather large number (fs.nr_open). And has been described as RLIMIT_NOFILE on Linux not using RLIM_INFINITY as a special value for a fastpath

One of the 2016 moby discussions about using infinity references a StackOverflow page that cites two kernel commits:
- Oct 2008 kernel commit (2.6.28):
  
  permit setting RLIMIT_NOFILE to RLIM_INFINITY
  When a process wants to set the limit of open files to RLIM_INFINITY it gets EPERM even if it has CAP_SYS_RESOURCE capability.
  The spec says "Specifying RLIM_INFINITY as any resource limit value on a successful call to setrlimit() shall inhibit enforcement of that resource limit." and we're presently not doing that.
  
  which was reverted in Feb 2009:
  
  Because it causes (arguably poorly designed) existing userspace to spend interminable periods closing billions of not-open file descriptors.
  Apparently the pam library in Debian etch (4.0) initializes the limits to some default values when it doesn't have any settings in limit.conf to override them. Turns out that for nofiles this is RLIM_INFINITY.
docker.service provides inline comments as to why it was decided to set LimitNOFILE, but current maintainers are not able to cite a specific source / test that reproduces the concern, nor how much of a performance impact is implied? (while as shown above, infinity risks considerably worse performance among other concerns).

Furthermore, as detailed here, changes to systemd from v240 in 2018Q4 may have changed this (both to the hard limit handling and CPU accounting no longer needing to enable the CPU controller), among other improvements in kernel releases (since the decision to settle on a limit of infinity in docker.service). Release notes for v240 also cite related kernel accounting is no longer expensive.

Soft limit should be kept at 1024 if viable

I don't know if it was accidental to raise the soft limit for the daemon so high (and thus inherited by everything else run within the containers), just that it appears that:

It was originally necessary for removing hundreds of containers to avoid EMFILE (too many open files) soft limit?
Actual software running in the containers may sometimes need to open many files, but are likely capable of raising their soft limit, provided the hard limit is sufficient.

However that hard limit is per process which might be more restrictive if all processes from containers are running from a parent process like dockerd?

This is something I don't know the impact of, where a higher hard limit may be necessary to accomodate everything run in containers?
The daemons running into performance problems from LimitNOFILE=infinity, and likely other affected software AFAIK is due to the soft limit being well above 1024, regardless of hard limit (some software may still be prone to that if it bumps the soft limit to the hard limit internally).

These processes unless explicitly raising the soft limit aren't designed with unusually high soft limits in mind, which would typically be 1024.

Deciding on value to set for `LimitNOFILE`

docker.service + containerd.service probably should adopt LimitNOFILE=1024:524288.

Aligns with what systemd sets as DefaultLimitNOFILE=1024:524288 in /etc/systemd/system.conf since v240.

Additionally that release notes that accounting of large hard limits has not been a concern for many kernel releases since that time (2018Q4), definitely applicable by now. Removing the inlined concern in the .service files for LimitNOFILE.
systemd author reflects on the v240 change in May 2021, noting negative feedback was minimal.

Also see the last "take away" section if relevant to the daemons of the .service files. If nothing requires a higher soft limit prior to the container processes from users being run, then keeping a 1024 soft limit is appropriate and should better reflect running the process on the host instead of a container.
LimitNOFILE=524288 is a bit more prone to issues due to raising the soft limit that all containers and thus their processes inherit unless lowering it via --ulimit.
- Given the scope of containers, a soft limit above 1024 is likely to share the same concerns expressed in systemd v240 notes with their decision to keep it at 1024 for legacy compatibility reasons (as software that knows it needs a higher limit can request to raise up to the hard limit).
- A related change for Go 1.19 implicitly raising the soft limit to the hard limit lists known concerns when that should be avoided (NSS for user/group and DNS lookup is used from glibc, presumably relevant to dockerd). If applicable, the daemon really should not be raising it's soft limit (yet has done so for many years now? perhaps causing some difficult to troubleshoot bug reports?)
#LimitNOFILE= (not defining it in .service) may also be appropriate and use the hosts hard limit (on Debian-based distros this may instead be 1048576).

If I understand the Go 1.19 (Aug 2022) release, it's conditions to implicitly raise the soft limit to the hard limit may not apply to dockerd or containerd. If so this may be acceptable if the soft limit does not need to be raised, only causing compatibility concerns for hard limit on systems with systemd prior to v240?
LimitNOFILE=1048576 was approved in the past before using infinity.

It's possible that for some users that the value of 2^20 exceeded the system limit (fs.file-max prior to systemd v240 was dynamically sized based on 10% of system memory pages, as can be seen here ((398153 * 10) / 4) * 4096 == ~4GiB).

Thus the actual amount of files that could be open by a process was much lower. It's possible raising the limit failed silently which could revert to the 2011 kernel defaults (effectively LimitNOFILE=1024:4096, although that was a failure with PAM limits.conf). There has been advice to set /etc/sysctl.conf with values like 1024576 (often seen used for bytes / words) or lower, reducing fs.nr_open (maximum limit for ulimit -Hn).

That would explain why infinity could have been reported to perform better on affected systems? (it was not clear how the testing of limits was done, or what they actually were before/after on those systems)

Original version

✔️ LimitNOFILE=1048576 or LimitNOFILE=524288.
- Revert to the previous 2016 value 1048576, it was less problematic and the change to infinity in 2018 / 2021 seems to be an accident. Possibly the fastest option for getting the change accepted?
- Consider respecting systemd 524288 default that should be broadly available by now.
  
  In the context of dockerd / containerd in production, if the hard limit isn't per container / process, it may need to be raised with many demanding containers? (MongoDB docs advises needing 64k, and Kafka advises at least 100k).
❓ LimitNOFILE=1024:1048576 if not actually needing to raise the soft limit.
- 1048576 could be used to align with distros like Debian that explicitly configure this. It's a common value you'll see cited as a "works for me" solution (and was initially chosen for docker.service), partly I imagine because it's 2^20 which has some selection bias vs "odd looking" powers like 2^19? (but 2^16 acceptable for familiarity)
  
  EDIT: There is some relevance for the value such as fs.nr_open default in kernel, but historically required ensuring fs.file-max was at least as high to be effective.
- 524288 (2^19) as an alternative. This is the value systemd settled on with v240 release (2018Q4), after initially considering 2^18 (2^16 and 2^17 weren't uncommon to see elsewhere, Github Actions with Ubuntu 20.04 runner reports 2^16).
- 1024 soft limit if needed internally AFAIK could be raised and later lowered back to 1024 for individual containers managed?
❓ Avoid setting LimitNOFILE may need to keep in mind that Go 1.19 will raise the soft limit to the hard limit (this would technically match the likely intention that the current LimitNOFILE=infinity was believed to accomplish, instead of the actual behaviour).
- NOTE: The change since Go 1.19 may then require lowering the soft limit where necessary (fork+exec), as they decided not to restore the original soft limit implicitly (if it was raised to the hard limit implicitly).
- If the soft limit does not need to be raised, then LimitNOFILE=1024:<hard-limit-here> would seem more appropriate. Which would be similar behaviour prior to Go 1.19. It's unclear if docker.service ever needed to raise the soft limit to resolve it's original concerns?
❌ LimitNOFILE=1024:infinity would prevent some of the issues due to leaving the soft limit alone
- Some software will still misbehave like JDK had been found to not only raise it's soft limit to the hard limit (like from Go 1.19), but also allocate memory to an array based on that size.
  
  The JDK reference linked cites a 512k hard limit (2^19) that would allocate 4MB of memory for an array sized to the hard limit (2^19 * 8 bytes). That seems like it may align with reported pathological memory usage when the hard limit is 2^30 (1073741824 * 8 == 8.6GB).

Apparently with some recent releases of some OSes where new containerd has been released, default docker community edition causes Airlfow to immediately consume all memory. This happens in Breeze and Docker Compose at least. There is a workaround described in: ttps://github.com/moby/moby/issues/43361#issuecomment-1227617516 to use Docker Desktop instead. The issue is tracked in containerd in this issue - proposing to revert the change (as it impacts other applications run in docker, not only Airlfow): containerd/containerd#7566 GitOrigin-RevId: de2889c2e9779177363d6b87dc9020bf210fdd72

containerd recently changed the default LimitNOFILE from 1048576 to infinity. This breaks various applications in containers such as cups, which cannot start and/or print with the infinite ulimit. Issue: moby/moby#45204 PR: containerd/containerd#7566 Signed-off-by: Christian Stewart <[email protected]>

Apparently with some recent releases of some OSes where new containerd has been released, default docker community edition causes Airlfow to immediately consume all memory. This happens in Breeze and Docker Compose at least. There is a workaround described in: ttps://github.com/moby/moby/issues/43361#issuecomment-1227617516 to use Docker Desktop instead. The issue is tracked in containerd in this issue - proposing to revert the change (as it impacts other applications run in docker, not only Airlfow): containerd/containerd#7566 GitOrigin-RevId: de2889c2e9779177363d6b87dc9020bf210fdd72

containerd recently changed the default LimitNOFILE from 1048576 to infinity. This breaks various applications in containers such as cups, which cannot start and/or print with the infinite ulimit. Issue: moby/moby#45204 PR: containerd/containerd#7566 Signed-off-by: Christian Stewart <[email protected]>

Apparently with some recent releases of some OSes where new containerd has been released, default docker community edition causes Airlfow to immediately consume all memory. This happens in Breeze and Docker Compose at least. There is a workaround described in: ttps://github.com/moby/moby/issues/43361#issuecomment-1227617516 to use Docker Desktop instead. The issue is tracked in containerd in this issue - proposing to revert the change (as it impacts other applications run in docker, not only Airlfow): containerd/containerd#7566 GitOrigin-RevId: de2889c2e9779177363d6b87dc9020bf210fdd72

k8s-ci-robot added the needs-ok-to-test label Oct 20, 2022

Revert "Merge pull request containerd#4475 from thaJeztah/minor_syste…

d33b2a6

…md_updates" This reverts commit 6c74c39, reversing changes made to 09814d4. Signed-off-by: Ray Burgemeestre <[email protected]>

rayburgemeestre force-pushed the fix-regression-service-file branch from efb4b88 to d33b2a6 Compare October 20, 2022 10:23

thaJeztah mentioned this pull request Nov 29, 2022

"Max open files" limit of containers do not match with the limit in dockerd systemd unit file moby/moby#44547

Closed

sam-thibault mentioned this pull request Dec 8, 2022

Fedora Docker-CE-Engine 20.10.13 consumes all available system memory (kernel 5.16.13) moby/moby#43361

Closed

corhere mentioned this pull request Feb 23, 2023

Container ulimits are inherited from containerd by default moby/moby#45060

Closed

potiuk mentioned this pull request Feb 24, 2023

Add workaround for Airflow consuming all memory in some OS'es apache/airflow#29751

Merged

nightah mentioned this pull request Feb 27, 2023

23.0.1 hangs indefinitely for centos:7 or opensuse/leap:15.2 image moby/moby#45072

Closed

polarathene mentioned this pull request Mar 6, 2023

Review / revisit systemd unit files docker/for-linux#73

Open

MoisesGSalas mentioned this pull request May 28, 2025

Large Memory consumption by uWSGI overhangio/tutor#1235

Closed

satmandu mentioned this pull request Jul 29, 2025

Use current version of gawk on i686, and use docker workaround for breakage with current version in containers. chromebrew/chromebrew#12285

Merged

3 tasks

Revert commit that changed LimitNOFILE to infinity to avoid regressions #7566

Revert commit that changed LimitNOFILE to infinity to avoid regressions #7566

Uh oh!

Conversation

rayburgemeestre commented Oct 20, 2022

Uh oh!

k8s-ci-robot commented Oct 20, 2022

Uh oh!

thaJeztah commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kzys commented Oct 20, 2022

Uh oh!

samuelkarp commented Oct 23, 2022

Uh oh!

mkgvb commented Nov 10, 2022

Uh oh!

C-Higgins commented Nov 30, 2022

Uh oh!

travisby commented Feb 19, 2023

Uh oh!

emorozov commented Feb 21, 2023

Uh oh!

thaJeztah commented Feb 21, 2023

Uh oh!

Blizzke commented Feb 28, 2023

Uh oh!

splinter89 commented Mar 2, 2023

Uh oh!

polarathene commented Mar 9, 2023

Personal experience with infinity causing problems

Uh oh!

polarathene commented Mar 9, 2023

To infinity and beyond regressions!

Uh oh!

polarathene commented Mar 9, 2023

History of decisions for LimitNOFILE in docker.service:

Uh oh!

polarathene commented Mar 9, 2023

March 2019 - Proposal that LimitNOFILE=1048576 is too high and should be lowered in docker.service

Focuses on negative impact from an excessive soft limit

Reducing LimitNOFILE value in docker.service

May 2021 - Go >= 1.19 implicitly raises soft limit to hard limit

Article - Systemd changes to file descriptors (v240)

infinity differs between systemd v240 changes (2^30) vs Debian (2^20)

Various limits by distro release + risk of raising soft limit implicitly

Relevance to Docker

Uh oh!

polarathene commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reasons to change away from infinity

Soft limit should be kept at 1024 if viable

Deciding on value to set for LimitNOFILE

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

thaJeztah commented Oct 20, 2022 •

edited

Loading

Personal experience with `infinity` causing problems

To `infinity` and beyond regressions!

History of decisions for `LimitNOFILE` in `docker.service`:

March 2019 - Proposal that `LimitNOFILE=1048576` is too high and should be lowered in `docker.service`

Reducing `LimitNOFILE` value in `docker.service`

May 2021 - `Go >= 1.19` implicitly raises soft limit to hard limit

`infinity` differs between systemd v240 changes (`2^30`) vs Debian (`2^20`)

polarathene commented Mar 9, 2023 •

edited

Loading

Reasons to change away from `infinity`

Deciding on value to set for `LimitNOFILE`