Skip to content

[BR]: fail2ban-client start fails with Docker environments due to excessively large FD limits to iterate and close #3334

@polarathene

Description

@polarathene

UPDATE:

I've since investigated and discovered the cause is here:

# urandom should not be closed in Python 3.4.0. Fixed in 3.4.1
# http://bugs.python.org/issue21207
if sys.version_info[0:3] == (3, 4, 0): # pragma: no cover
urandom_fd = os.open("/dev/urandom", os.O_RDONLY)
for fd in range(0, maxfd):
try:
if not os.path.sameopenfile(urandom_fd, fd):
os.close(fd)
except OSError: # ERROR (ignore)
pass
os.close(urandom_fd)
else:
os.closerange(0, maxfd)
# Redirect the standard file descriptors to /dev/null.
os.open("/dev/null", os.O_RDONLY) # standard input (0)
os.open("/dev/null", os.O_RDWR) # standard output (1)
os.open("/dev/null", os.O_RDWR) # standard error (2)
return (True,)

  • In environments like Docker, the FD limit may commonly be approx 1 million or 1 billion depending on the OS configuration (at least with systemd service files setting LimitNOFILE= for docker.service).
  • Even though only a few standard FDs may be open, the excessive range will be iterated consuming heavy CPU. In my experience this was 8 minutes for os.closerange() and approx 70 minutes for the alternative os.close() loop.

A workaround when possible is to set a lower limit for the container with the docker run option --ulimit. Some users have reported not being able to use this due to lack of permissions required in their environment.

Various solutions have been suggested in the discussion below, but the easiest approach may be to set a sane maximum to close and log when that is exceeded to inform users to explicitly request a larger/unrestricted range if needed?


Reproduction

This is taken from a later comment. With Docker installed this command should fail (at least on linux hosts with systemd):

$ docker run --rm -it alpine ash -c 'apk add fail2ban && rm /etc/fail2ban/jail.d/alpine-ssh.conf && fail2ban-client start && ps aux | grep fail2ban'

# ...
2022-08-16 05:20:19,777 fail2ban                [11]: ERROR   Could not start server. Maybe an old socket file is still present. Try to remove /var/run/fail2ban/fail2ban.sock. If you used fail2ban-client to start the server, adding the -x option will do it

While this will be successful (--ulimit option sets a reduced limit that is possible to iterate through before the 30 second default timeout):

$ docker run --rm -it --ulimit 'nofile=1024' alpine ash -c 'apk add fail2ban && rm /etc/fail2ban/jail.d/alpine-ssh.conf && fail2ban-client start && ps aux | grep fail2ban'

# ...
Server ready
# ...
   14 root      0:00 {fail2ban-server} /usr/bin/python3 /usr/bin/fail2ban-server --async -b -s /var/run/fail2ban/fail2ban.sock -p /var/run/fail2ban/fail2ban.pid --loglevel INFO --logtarget /var/log/fail2ban.log --syslogsocket auto

If one knows how much time is needed to be successful, raising the timeout limit also works: fail2ban-client --timeout 600 start (10 minutes). Otherwise when the command times out, the fail2ban-server process continues closing file descriptors and isn't killed from the timeout.


Original report

Environment:

Running within a Docker container (Debian Bullseye for production, but have reproduced in Alpine). Docker daemon is running on a VM guest of Fedora 36 (hypervisor VMware Workstation 16.3) on a Arch Linux host system (x86_64).

  • Fail2Ban version: fail2ban-0.11.2-r2
  • OS, including release name/version : Alpine Linux v3.16 (Docker container)
  • Fail2Ban installed via OS/distribution mechanisms
  • You have not applied any additional foreign patches to the codebase
  • Some customizations were done to the configuration (provide details below is so)

The issue:

Failure occurs when attempting to start fail2ban in a Docker container:

$ fail2ban-client start

2022-08-16 04:16:46,049 fail2ban                [28]: ERROR   Could not start server. Maybe an old socket file is still present. Try to remove /var/run/fail2ban/fail2ban.sock. If you used fail2ban-client to start the server, adding the -x option will do it
Log output (with `-vvv -x`)
$ fail2ban-client -vvv -x start

 +  141 7FE0EAB16B48 fail2ban.configreader     INFO  Loading configs for fail2ban under /etc/fail2ban 
 +  143 7FE0EAB16B48 fail2ban.configreader     DEBUG Reading configs for fail2ban under /etc/fail2ban 
 +  144 7FE0EAB16B48 fail2ban.configreader     DEBUG Reading config files: /etc/fail2ban/fail2ban.conf
 +  145 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/fail2ban.conf']
 +  145 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Reading file: /etc/fail2ban/fail2ban.conf
 +  146 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/fail2ban.conf']
 +  148 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Shared file: /etc/fail2ban/fail2ban.conf
 +  149 7FE0EAB16B48 fail2ban                  INFO  Using socket file /var/run/fail2ban/fail2ban.sock
 +  149 7FE0EAB16B48 fail2ban                  INFO  Using pid file /var/run/fail2ban/fail2ban.pid, [INFO] logging to /var/log/fail2ban.log
 +  150 7FE0EAB16B48 fail2ban                  HEAVY  -- ['ping', 0.1] failed -- FileNotFoundError(2, 'No such file or directory')
 +  151 7FE0EAB16B48 fail2ban.configreader     INFO  Loading configs for jail under /etc/fail2ban 
 +  152 7FE0EAB16B48 fail2ban.configreader     DEBUG Reading configs for jail under /etc/fail2ban 
 +  154 7FE0EAB16B48 fail2ban.configreader     DEBUG Reading config files: /etc/fail2ban/jail.conf, /etc/fail2ban/jail.d/alpine-ssh.conf
 +  155 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/jail.conf']
 +  155 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Reading file: /etc/fail2ban/jail.conf
 +  169 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/paths-debian.conf']
 +  171 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Reading file: /etc/fail2ban/paths-debian.conf
 +  173 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/paths-common.conf']
 +  174 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Reading file: /etc/fail2ban/paths-common.conf
 +  188 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/paths-overrides.local']
 +  194 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Reading file: /etc/fail2ban/paths-overrides.local
 +  200 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/jail.d/alpine-ssh.conf']
 +  206 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Reading file: /etc/fail2ban/jail.d/alpine-ssh.conf
 +  207 7FE0EAB16B48 fail2ban.configparserinc  INFO    Loading files: ['/etc/fail2ban/paths-common.conf', '/etc/fail2ban/paths-debian.conf', '/etc/fail2ban/jail.conf', '/etc/fail2ban/jail.d/alpine-ssh.conf']
 +  207 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Shared file: /etc/fail2ban/paths-common.conf
 +  207 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Shared file: /etc/fail2ban/paths-debian.conf
 +  207 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Shared file: /etc/fail2ban/jail.conf
 +  208 7FE0EAB16B48 fail2ban.configparserinc  TRACE     Shared file: /etc/fail2ban/jail.d/alpine-ssh.conf
 +  249 7FE0EAB16B48 fail2ban                  DEBUG   async starting of server in 23, fork: True - 24
 +  252 7FE0EAB16B48 fail2ban                  HEAVY __waitOnServer: (True, 30)
 +  253 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 0.0009500980377197266
 +  267 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 0.014755487442016602
 +  287 7FE0EAB16B48 fail2ban                  DEBUG   async starting of server in 24, fork: True - 0
 +  290 7FE0EAB16B48 fail2ban                  DEBUG Starting '/usr/bin/fail2ban-server' with args ['fail2ban-server', '--async', '-b', '-s', '/var/run/fail2ban/fail2ban.sock', '-p', '/var/run/fail2ban/fail2ban.pid', '-x', '-vvv', '--loglevel', 'INFO', '--logtarget', '/var/log/fail2ban.log', '--syslogsocket', 'auto']
 +  295 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 0.04277944564819336
 +  348 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 0.09576845169067383
 +   69 7FBB6DACAB48 fail2ban                  INFO  Using socket file /var/run/fail2ban/fail2ban.sock
 +   73 7FBB6DACAB48 fail2ban                  INFO  Using pid file /var/run/fail2ban/fail2ban.pid, [INFO] logging to /var/log/fail2ban.log
 +   74 7FBB6DACAB48 fail2ban                  DEBUG   direct starting of server in 24, deamon: True
 +  457 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 0.2051231861114502
 +  151 7FBB6DACAB48 fail2ban.server           INFO  Starting in daemon mode
 +  153 7FBB6DACAB48 fail2ban                  DEBUG Exit with code 0
 +  658 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 0.4056081771850586
 + 1065 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 0.8127536773681641
 + 1567 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 1.3147380352020264

INFO   [ + 2073 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 1.8207433223724365
  + 2574 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 2.321540355682373
  + 3075 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 2.8223278522491455
  + 3581 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 3.3287596702575684
  + 4084 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 3.8317525386810303
  + 4587 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 4.3347508907318115
  + 5088 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 4.835473299026489
  + 5591 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 5.338402271270752
  + 6091 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 5.838769435882568
  + 6599 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 6.346776962280273
  + 7102 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 6.849779844284058
 + 7606 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 7.353386640548706
 + 8109 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 7.856667757034302
 + 8610 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 8.35775375366211
 + 9113 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 8.860753774642944
 + 9614 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 9.361480474472046
 +10118 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 9.865747928619385
 +10620 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 10.367749452590942
 +11127 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 10.874731302261353
 +11628 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 11.375521421432495
 +12131 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 11.878766298294067
  +12634 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 12.381748676300049
  +13140 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 12.887734413146973
  +13643 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 13.390751361846924
  +14152 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 13.89975881576538
  +14653 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 14.400509119033813
  +15156 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 14.903644323348999
  +15660 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 15.407699823379517
  +16161 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 15.908511638641357
  +16666 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 16.413676023483276
  +17173 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 16.920743465423584
 +17676 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 17.423735857009888
 +18177 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 17.92453360557556
 +18684 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 18.431763410568237
 +19186 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 18.933747053146362
 +19687 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 19.43473505973816
 +20195 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 19.94274401664734
 +20698 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 20.445739030838013
 +21202 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 20.94961166381836
 +21705 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 21.45273733139038
 +22213 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 21.96073269844055
  +22716 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 22.463761806488037
  +23219 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 22.96674084663391
  +23723 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 23.470768928527832
  +24233 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 23.980743169784546
  +24736 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 24.483736753463745
  +25239 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 24.9867525100708
  +25740 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 25.487753868103027
  +26244 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 25.991734981536865
  +26745 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 26.492750644683838
  +27248 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 26.995726823806763
 +27752 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 27.49959707260132
 +28265 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 28.01280903816223
 +28773 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 28.52074670791626
 +29276 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 29.023734092712402
 +29777 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 29.524730682373047
 +30282 7FE0EAB16B48 fail2ban                  HEAVY   wait-time: 30.029735803604126
 +30282 7FE0EAB16B48 fail2ban                  ERROR Failed to start server
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/fail2ban/client/fail2banclient.py", line 326, in __processStartStreamAfterWait
    if not self.__waitOnServer(): # pragma: no cover
  File "/usr/lib/python3.10/site-packages/fail2ban/client/fail2banclient.py", line 364, in __waitOnServer
    raise ServerExecutionException("Failed to start server")
fail2ban.client.fail2bancmdline.ServerExecutionException: Failed to start server
 +30284 7FE0EAB16B48 fail2ban                  ERROR Could not start server. Maybe an old socket file is still present. Try to remove /var/run/fail2ban/fail2ban.sock. If you used fail2ban-client to start the server, adding the -x option will do it
 +30286 7FE0EAB16B48 fail2ban                  DEBUG Exit with code 255

The command stalls for 30 seconds when a timeout seems to abort.

Quick Fix: Setting a lower ulimit -n value (normally 1024 for a linux OS, Docker containers instead set ulimit -n to over a billion by default) works around the bug (assuming the failure is due to attempting to perform a billion system calls to close file descriptors).

Steps to reproduce

  1. With Docker installed, an Alpine container can reproduce this on a Fedora 36 VM guest (although I don't expect the host OS to matter much, at least for Linux hosts).
  2. docker run --rm -it alpine ash is sufficient, then install the fail2ban package apk add fail2ban.
  3. The command will fail early due to the default jail being shipped for SSH. Disable it in the config or remove it rm /etc/fail2ban/jail.d/alpine-ssh.conf.
  4. Run the command fail2ban-client start, it should fail with a timeout internally after 30 seconds, and output the error shown above.

If you were not able to reproduce for some reason, verify these container and host results:

# Docker container (alpine)
$ getconf -a | grep '^OPEN_MAX'

OPEN_MAX                           1073741816
# NOTE: `_POSIX_OPEN_MAX` on alpine is similar but set to 20,
# on Debian bullseye it's the same `1073741816` value.


# VM guest Fedora 36 (Docker host)
$ getconf -a | grep '^OPEN_MAX'

OPEN_MAX                           1024
# NOTE: `_POSIX_OPEN_MAX` is also 1024 in this case.

# NOTE: `ulimit -n` and `sysctl fs.nr_open` also outputs the same value,
# While `ulimit -n` will reflect an adjusted value,
# `sysctl fs.nr_open` will always be `1073741816`, even outside the container.

Expected behavior

Command should be successful:

$ fail2ban-client start

Server ready

Observed behavior

I recently debugged a similar problem with another package running in a Docker container, I believe it's the same issue here.

I am not sure where to look in fail2ban source code, or if it's an upstream dependency of fail2ban (possibility).

The problem is very much likely due to an attempt to iterate through all file descriptors and closing them while starting the server as a daemon, or some similar action.

AFAIK this is a common step to apply while starting a daemon service, with naive approach to detect the FD limit from the system and iterate through each with a close() call.

Any additional information

The cause (if I'm correct) is a fault of Dockers.

When running a command via the root user (common as it's the default implicitly) within the Docker container, ulimit -n and similar methods will report a value exceeding 1 billion. On actual host systems (and when running commands as a different user in the Docker container), there is a more sane limit of 1024 usually reported (less on macOS AFAIK).

Docker (or rather containerd I believe) needs this high limit apparently to support many containers running at the same time, but the containers themselves rarely need it AFAIK. A workaround is to either set the Docker daemon config (if possible to configure, eg not permitted in Github CI?) or per container with the --ulimit 'nofile=1024' option.

Within the running container for reproduction, you can also adjust this limit internally once (it'll fail if you try to adjust it again later) with ulimit -n 1024. Now the same command will run successfully as expected.

The failure has nothing to do with what the error output is suggesting. In the other project that had this bug, using it within Docker resulted in approximately 10 minute startup time (vs <1 second). It was a million times slower starting up due to this issue.

In the production container (Debian Bullseye), the issue is reproducible locally while I develop, but doesn't appear to be caught by our Github CI tests. Probably due to Github configuring the Docker daemon to use a lower limit, or potentially the Ubuntu 20.04 environment used, I have not investigated that yet (it is due to systemd services configured in /usr/lib/systemd/system/ set fd limits via LimitNOFILE=. Fedora sets docker.service and containerd.service to infinity, while Manjaro sets docker.service to 1048576).

The issue itself is known upstream in Docker for several years now, with multiple python projects reported as hitting this issue when used in Docker containers.


Possible fix

If you can identify where this issue is occurring (in fail2ban or upstream dependency, or Python itself), a fix can be:

  • To use close_range() if available instead of the close() syscall.
  • As that may not be as widely available, a solution RedHat took for rpm that also had this problem was to iterate through entries in /proc/self/fd.
  • While another project iterated a fixed 1024 value and would continue until the first error from close() (FD doesn't exist), assuming that no other FDs would exist beyond that.

Meanwhile, a valid workaround is to add the --ulimit 'nofile=1024' option on their container config.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions