-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
UPDATE:
I've since investigated and discovered the cause is here:
fail2ban/fail2ban/server/server.py
Lines 917 to 935 in ff7fe57
| # urandom should not be closed in Python 3.4.0. Fixed in 3.4.1 | |
| # http://bugs.python.org/issue21207 | |
| if sys.version_info[0:3] == (3, 4, 0): # pragma: no cover | |
| urandom_fd = os.open("/dev/urandom", os.O_RDONLY) | |
| for fd in range(0, maxfd): | |
| try: | |
| if not os.path.sameopenfile(urandom_fd, fd): | |
| os.close(fd) | |
| except OSError: # ERROR (ignore) | |
| pass | |
| os.close(urandom_fd) | |
| else: | |
| os.closerange(0, maxfd) | |
| # Redirect the standard file descriptors to /dev/null. | |
| os.open("/dev/null", os.O_RDONLY) # standard input (0) | |
| os.open("/dev/null", os.O_RDWR) # standard output (1) | |
| os.open("/dev/null", os.O_RDWR) # standard error (2) | |
| return (True,) |
- In environments like Docker, the FD limit may commonly be approx 1 million or 1 billion depending on the OS configuration (at least with systemd service files setting
LimitNOFILE=fordocker.service). - Even though only a few standard FDs may be open, the excessive range will be iterated consuming heavy CPU. In my experience this was 8 minutes for
os.closerange()and approx 70 minutes for the alternativeos.close()loop.
A workaround when possible is to set a lower limit for the container with the docker run option --ulimit. Some users have reported not being able to use this due to lack of permissions required in their environment.
Various solutions have been suggested in the discussion below, but the easiest approach may be to set a sane maximum to close and log when that is exceeded to inform users to explicitly request a larger/unrestricted range if needed?
Reproduction
This is taken from a later comment. With Docker installed this command should fail (at least on linux hosts with systemd):
$ docker run --rm -it alpine ash -c 'apk add fail2ban && rm /etc/fail2ban/jail.d/alpine-ssh.conf && fail2ban-client start && ps aux | grep fail2ban'
# ...
2022-08-16 05:20:19,777 fail2ban [11]: ERROR Could not start server. Maybe an old socket file is still present. Try to remove /var/run/fail2ban/fail2ban.sock. If you used fail2ban-client to start the server, adding the -x option will do itWhile this will be successful (--ulimit option sets a reduced limit that is possible to iterate through before the 30 second default timeout):
$ docker run --rm -it --ulimit 'nofile=1024' alpine ash -c 'apk add fail2ban && rm /etc/fail2ban/jail.d/alpine-ssh.conf && fail2ban-client start && ps aux | grep fail2ban'
# ...
Server ready
# ...
14 root 0:00 {fail2ban-server} /usr/bin/python3 /usr/bin/fail2ban-server --async -b -s /var/run/fail2ban/fail2ban.sock -p /var/run/fail2ban/fail2ban.pid --loglevel INFO --logtarget /var/log/fail2ban.log --syslogsocket autoIf one knows how much time is needed to be successful, raising the timeout limit also works: fail2ban-client --timeout 600 start (10 minutes). Otherwise when the command times out, the fail2ban-server process continues closing file descriptors and isn't killed from the timeout.
Original report
Environment:
Running within a Docker container (Debian Bullseye for production, but have reproduced in Alpine). Docker daemon is running on a VM guest of Fedora 36 (hypervisor VMware Workstation 16.3) on a Arch Linux host system (x86_64).
- Fail2Ban version:
fail2ban-0.11.2-r2 - OS, including release name/version : Alpine Linux v3.16 (Docker container)
- Fail2Ban installed via OS/distribution mechanisms
- You have not applied any additional foreign patches to the codebase
- Some customizations were done to the configuration (provide details below is so)
The issue:
Failure occurs when attempting to start fail2ban in a Docker container:
$ fail2ban-client start
2022-08-16 04:16:46,049 fail2ban [28]: ERROR Could not start server. Maybe an old socket file is still present. Try to remove /var/run/fail2ban/fail2ban.sock. If you used fail2ban-client to start the server, adding the -x option will do itLog output (with `-vvv -x`)
$ fail2ban-client -vvv -x start
+ 141 7FE0EAB16B48 fail2ban.configreader INFO Loading configs for fail2ban under /etc/fail2ban
+ 143 7FE0EAB16B48 fail2ban.configreader DEBUG Reading configs for fail2ban under /etc/fail2ban
+ 144 7FE0EAB16B48 fail2ban.configreader DEBUG Reading config files: /etc/fail2ban/fail2ban.conf
+ 145 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/fail2ban.conf']
+ 145 7FE0EAB16B48 fail2ban.configparserinc TRACE Reading file: /etc/fail2ban/fail2ban.conf
+ 146 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/fail2ban.conf']
+ 148 7FE0EAB16B48 fail2ban.configparserinc TRACE Shared file: /etc/fail2ban/fail2ban.conf
+ 149 7FE0EAB16B48 fail2ban INFO Using socket file /var/run/fail2ban/fail2ban.sock
+ 149 7FE0EAB16B48 fail2ban INFO Using pid file /var/run/fail2ban/fail2ban.pid, [INFO] logging to /var/log/fail2ban.log
+ 150 7FE0EAB16B48 fail2ban HEAVY -- ['ping', 0.1] failed -- FileNotFoundError(2, 'No such file or directory')
+ 151 7FE0EAB16B48 fail2ban.configreader INFO Loading configs for jail under /etc/fail2ban
+ 152 7FE0EAB16B48 fail2ban.configreader DEBUG Reading configs for jail under /etc/fail2ban
+ 154 7FE0EAB16B48 fail2ban.configreader DEBUG Reading config files: /etc/fail2ban/jail.conf, /etc/fail2ban/jail.d/alpine-ssh.conf
+ 155 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/jail.conf']
+ 155 7FE0EAB16B48 fail2ban.configparserinc TRACE Reading file: /etc/fail2ban/jail.conf
+ 169 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/paths-debian.conf']
+ 171 7FE0EAB16B48 fail2ban.configparserinc TRACE Reading file: /etc/fail2ban/paths-debian.conf
+ 173 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/paths-common.conf']
+ 174 7FE0EAB16B48 fail2ban.configparserinc TRACE Reading file: /etc/fail2ban/paths-common.conf
+ 188 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/paths-overrides.local']
+ 194 7FE0EAB16B48 fail2ban.configparserinc TRACE Reading file: /etc/fail2ban/paths-overrides.local
+ 200 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/jail.d/alpine-ssh.conf']
+ 206 7FE0EAB16B48 fail2ban.configparserinc TRACE Reading file: /etc/fail2ban/jail.d/alpine-ssh.conf
+ 207 7FE0EAB16B48 fail2ban.configparserinc INFO Loading files: ['/etc/fail2ban/paths-common.conf', '/etc/fail2ban/paths-debian.conf', '/etc/fail2ban/jail.conf', '/etc/fail2ban/jail.d/alpine-ssh.conf']
+ 207 7FE0EAB16B48 fail2ban.configparserinc TRACE Shared file: /etc/fail2ban/paths-common.conf
+ 207 7FE0EAB16B48 fail2ban.configparserinc TRACE Shared file: /etc/fail2ban/paths-debian.conf
+ 207 7FE0EAB16B48 fail2ban.configparserinc TRACE Shared file: /etc/fail2ban/jail.conf
+ 208 7FE0EAB16B48 fail2ban.configparserinc TRACE Shared file: /etc/fail2ban/jail.d/alpine-ssh.conf
+ 249 7FE0EAB16B48 fail2ban DEBUG async starting of server in 23, fork: True - 24
+ 252 7FE0EAB16B48 fail2ban HEAVY __waitOnServer: (True, 30)
+ 253 7FE0EAB16B48 fail2ban HEAVY wait-time: 0.0009500980377197266
+ 267 7FE0EAB16B48 fail2ban HEAVY wait-time: 0.014755487442016602
+ 287 7FE0EAB16B48 fail2ban DEBUG async starting of server in 24, fork: True - 0
+ 290 7FE0EAB16B48 fail2ban DEBUG Starting '/usr/bin/fail2ban-server' with args ['fail2ban-server', '--async', '-b', '-s', '/var/run/fail2ban/fail2ban.sock', '-p', '/var/run/fail2ban/fail2ban.pid', '-x', '-vvv', '--loglevel', 'INFO', '--logtarget', '/var/log/fail2ban.log', '--syslogsocket', 'auto']
+ 295 7FE0EAB16B48 fail2ban HEAVY wait-time: 0.04277944564819336
+ 348 7FE0EAB16B48 fail2ban HEAVY wait-time: 0.09576845169067383
+ 69 7FBB6DACAB48 fail2ban INFO Using socket file /var/run/fail2ban/fail2ban.sock
+ 73 7FBB6DACAB48 fail2ban INFO Using pid file /var/run/fail2ban/fail2ban.pid, [INFO] logging to /var/log/fail2ban.log
+ 74 7FBB6DACAB48 fail2ban DEBUG direct starting of server in 24, deamon: True
+ 457 7FE0EAB16B48 fail2ban HEAVY wait-time: 0.2051231861114502
+ 151 7FBB6DACAB48 fail2ban.server INFO Starting in daemon mode
+ 153 7FBB6DACAB48 fail2ban DEBUG Exit with code 0
+ 658 7FE0EAB16B48 fail2ban HEAVY wait-time: 0.4056081771850586
+ 1065 7FE0EAB16B48 fail2ban HEAVY wait-time: 0.8127536773681641
+ 1567 7FE0EAB16B48 fail2ban HEAVY wait-time: 1.3147380352020264
INFO [ + 2073 7FE0EAB16B48 fail2ban HEAVY wait-time: 1.8207433223724365
+ 2574 7FE0EAB16B48 fail2ban HEAVY wait-time: 2.321540355682373
+ 3075 7FE0EAB16B48 fail2ban HEAVY wait-time: 2.8223278522491455
+ 3581 7FE0EAB16B48 fail2ban HEAVY wait-time: 3.3287596702575684
+ 4084 7FE0EAB16B48 fail2ban HEAVY wait-time: 3.8317525386810303
+ 4587 7FE0EAB16B48 fail2ban HEAVY wait-time: 4.3347508907318115
+ 5088 7FE0EAB16B48 fail2ban HEAVY wait-time: 4.835473299026489
+ 5591 7FE0EAB16B48 fail2ban HEAVY wait-time: 5.338402271270752
+ 6091 7FE0EAB16B48 fail2ban HEAVY wait-time: 5.838769435882568
+ 6599 7FE0EAB16B48 fail2ban HEAVY wait-time: 6.346776962280273
+ 7102 7FE0EAB16B48 fail2ban HEAVY wait-time: 6.849779844284058
+ 7606 7FE0EAB16B48 fail2ban HEAVY wait-time: 7.353386640548706
+ 8109 7FE0EAB16B48 fail2ban HEAVY wait-time: 7.856667757034302
+ 8610 7FE0EAB16B48 fail2ban HEAVY wait-time: 8.35775375366211
+ 9113 7FE0EAB16B48 fail2ban HEAVY wait-time: 8.860753774642944
+ 9614 7FE0EAB16B48 fail2ban HEAVY wait-time: 9.361480474472046
+10118 7FE0EAB16B48 fail2ban HEAVY wait-time: 9.865747928619385
+10620 7FE0EAB16B48 fail2ban HEAVY wait-time: 10.367749452590942
+11127 7FE0EAB16B48 fail2ban HEAVY wait-time: 10.874731302261353
+11628 7FE0EAB16B48 fail2ban HEAVY wait-time: 11.375521421432495
+12131 7FE0EAB16B48 fail2ban HEAVY wait-time: 11.878766298294067
+12634 7FE0EAB16B48 fail2ban HEAVY wait-time: 12.381748676300049
+13140 7FE0EAB16B48 fail2ban HEAVY wait-time: 12.887734413146973
+13643 7FE0EAB16B48 fail2ban HEAVY wait-time: 13.390751361846924
+14152 7FE0EAB16B48 fail2ban HEAVY wait-time: 13.89975881576538
+14653 7FE0EAB16B48 fail2ban HEAVY wait-time: 14.400509119033813
+15156 7FE0EAB16B48 fail2ban HEAVY wait-time: 14.903644323348999
+15660 7FE0EAB16B48 fail2ban HEAVY wait-time: 15.407699823379517
+16161 7FE0EAB16B48 fail2ban HEAVY wait-time: 15.908511638641357
+16666 7FE0EAB16B48 fail2ban HEAVY wait-time: 16.413676023483276
+17173 7FE0EAB16B48 fail2ban HEAVY wait-time: 16.920743465423584
+17676 7FE0EAB16B48 fail2ban HEAVY wait-time: 17.423735857009888
+18177 7FE0EAB16B48 fail2ban HEAVY wait-time: 17.92453360557556
+18684 7FE0EAB16B48 fail2ban HEAVY wait-time: 18.431763410568237
+19186 7FE0EAB16B48 fail2ban HEAVY wait-time: 18.933747053146362
+19687 7FE0EAB16B48 fail2ban HEAVY wait-time: 19.43473505973816
+20195 7FE0EAB16B48 fail2ban HEAVY wait-time: 19.94274401664734
+20698 7FE0EAB16B48 fail2ban HEAVY wait-time: 20.445739030838013
+21202 7FE0EAB16B48 fail2ban HEAVY wait-time: 20.94961166381836
+21705 7FE0EAB16B48 fail2ban HEAVY wait-time: 21.45273733139038
+22213 7FE0EAB16B48 fail2ban HEAVY wait-time: 21.96073269844055
+22716 7FE0EAB16B48 fail2ban HEAVY wait-time: 22.463761806488037
+23219 7FE0EAB16B48 fail2ban HEAVY wait-time: 22.96674084663391
+23723 7FE0EAB16B48 fail2ban HEAVY wait-time: 23.470768928527832
+24233 7FE0EAB16B48 fail2ban HEAVY wait-time: 23.980743169784546
+24736 7FE0EAB16B48 fail2ban HEAVY wait-time: 24.483736753463745
+25239 7FE0EAB16B48 fail2ban HEAVY wait-time: 24.9867525100708
+25740 7FE0EAB16B48 fail2ban HEAVY wait-time: 25.487753868103027
+26244 7FE0EAB16B48 fail2ban HEAVY wait-time: 25.991734981536865
+26745 7FE0EAB16B48 fail2ban HEAVY wait-time: 26.492750644683838
+27248 7FE0EAB16B48 fail2ban HEAVY wait-time: 26.995726823806763
+27752 7FE0EAB16B48 fail2ban HEAVY wait-time: 27.49959707260132
+28265 7FE0EAB16B48 fail2ban HEAVY wait-time: 28.01280903816223
+28773 7FE0EAB16B48 fail2ban HEAVY wait-time: 28.52074670791626
+29276 7FE0EAB16B48 fail2ban HEAVY wait-time: 29.023734092712402
+29777 7FE0EAB16B48 fail2ban HEAVY wait-time: 29.524730682373047
+30282 7FE0EAB16B48 fail2ban HEAVY wait-time: 30.029735803604126
+30282 7FE0EAB16B48 fail2ban ERROR Failed to start server
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/fail2ban/client/fail2banclient.py", line 326, in __processStartStreamAfterWait
if not self.__waitOnServer(): # pragma: no cover
File "/usr/lib/python3.10/site-packages/fail2ban/client/fail2banclient.py", line 364, in __waitOnServer
raise ServerExecutionException("Failed to start server")
fail2ban.client.fail2bancmdline.ServerExecutionException: Failed to start server
+30284 7FE0EAB16B48 fail2ban ERROR Could not start server. Maybe an old socket file is still present. Try to remove /var/run/fail2ban/fail2ban.sock. If you used fail2ban-client to start the server, adding the -x option will do it
+30286 7FE0EAB16B48 fail2ban DEBUG Exit with code 255The command stalls for 30 seconds when a timeout seems to abort.
Quick Fix: Setting a lower ulimit -n value (normally 1024 for a linux OS, Docker containers instead set ulimit -n to over a billion by default) works around the bug (assuming the failure is due to attempting to perform a billion system calls to close file descriptors).
Steps to reproduce
- With Docker installed, an Alpine container can reproduce this on a Fedora 36 VM guest (although I don't expect the host OS to matter much, at least for Linux hosts).
docker run --rm -it alpine ashis sufficient, then install thefail2banpackageapk add fail2ban.- The command will fail early due to the default jail being shipped for SSH. Disable it in the config or remove it
rm /etc/fail2ban/jail.d/alpine-ssh.conf. - Run the command
fail2ban-client start, it should fail with a timeout internally after 30 seconds, and output the error shown above.
If you were not able to reproduce for some reason, verify these container and host results:
# Docker container (alpine)
$ getconf -a | grep '^OPEN_MAX'
OPEN_MAX 1073741816
# NOTE: `_POSIX_OPEN_MAX` on alpine is similar but set to 20,
# on Debian bullseye it's the same `1073741816` value.
# VM guest Fedora 36 (Docker host)
$ getconf -a | grep '^OPEN_MAX'
OPEN_MAX 1024
# NOTE: `_POSIX_OPEN_MAX` is also 1024 in this case.
# NOTE: `ulimit -n` and `sysctl fs.nr_open` also outputs the same value,
# While `ulimit -n` will reflect an adjusted value,
# `sysctl fs.nr_open` will always be `1073741816`, even outside the container.
Expected behavior
Command should be successful:
$ fail2ban-client start
Server readyObserved behavior
I recently debugged a similar problem with another package running in a Docker container, I believe it's the same issue here.
I am not sure where to look in fail2ban source code, or if it's an upstream dependency of fail2ban (possibility).
The problem is very much likely due to an attempt to iterate through all file descriptors and closing them while starting the server as a daemon, or some similar action.
AFAIK this is a common step to apply while starting a daemon service, with naive approach to detect the FD limit from the system and iterate through each with a close() call.
Any additional information
The cause (if I'm correct) is a fault of Dockers.
When running a command via the root user (common as it's the default implicitly) within the Docker container, ulimit -n and similar methods will report a value exceeding 1 billion. On actual host systems (and when running commands as a different user in the Docker container), there is a more sane limit of 1024 usually reported (less on macOS AFAIK).
Docker (or rather containerd I believe) needs this high limit apparently to support many containers running at the same time, but the containers themselves rarely need it AFAIK. A workaround is to either set the Docker daemon config (if possible to configure, eg not permitted in Github CI?) or per container with the --ulimit 'nofile=1024' option.
Within the running container for reproduction, you can also adjust this limit internally once (it'll fail if you try to adjust it again later) with ulimit -n 1024. Now the same command will run successfully as expected.
The failure has nothing to do with what the error output is suggesting. In the other project that had this bug, using it within Docker resulted in approximately 10 minute startup time (vs <1 second). It was a million times slower starting up due to this issue.
In the production container (Debian Bullseye), the issue is reproducible locally while I develop, but doesn't appear to be caught by our Github CI tests. Probably due to Github configuring the Docker daemon to use a lower limit, or potentially the Ubuntu 20.04 environment used, I have not investigated that yet (it is due to systemd services configured in /usr/lib/systemd/system/ set fd limits via LimitNOFILE=. Fedora sets docker.service and containerd.service to infinity, while Manjaro sets docker.service to 1048576).
The issue itself is known upstream in Docker for several years now, with multiple python projects reported as hitting this issue when used in Docker containers.
Possible fix
If you can identify where this issue is occurring (in fail2ban or upstream dependency, or Python itself), a fix can be:
- To use
close_range()if available instead of theclose()syscall. - As that may not be as widely available, a solution RedHat took for
rpmthat also had this problem was to iterate through entries in/proc/self/fd. - While another project iterated a fixed 1024 value and would continue until the first error from
close()(FD doesn't exist), assuming that no other FDs would exist beyond that.
Meanwhile, a valid workaround is to add the --ulimit 'nofile=1024' option on their container config.