Fix wsgi.server shutdown for in-flight requests by tipabu · Pull Request #912 · eventlet/eventlet

tipabu · 2024-02-05T19:03:52Z

Cloning from #590 (which still has some relevant discussion) -- the source repo got deleted, which prevents me from resolving the conflicts on the original PR. Original description (mostly) follows:

While doing some work on OpenStack Swift, I noticed that in-flight
requests were getting their sockets closed by eventlet when the listen
socket was closed. In that case, there is a server process which bound
the listen socket and N fork'ed off worker processes which all run the
wsgi.server loop calling accept() and handling requests.

I noticed that nothing was ever setting connection state to
STATE_REQUEST (i.e. unused constant in previous patch). Upon closer
inspection, it turns out that the socket cleanup code for STATE_IDLE
sockets got applied to the socket with an in-progress request. This
caused code using the socket to get an EPIPE and exit the request's
greenthread without completing the request.

With some instrumentation in the code, without this patch's fix, the new
test observed the code doing this:

(7544) wsgi starting up on http://127.0.0.1:36745/
(7544) calling accept
(7544) accept got <eventlet.greenio.base.GreenSocket object at 0x7f7650d73290>, ('127.0.0.1', 50194)
(7544) accepted ('127.0.0.1', 50194)
(7544) completed req ('127.0.0.1', 50194)
(7544) calling accept
(7544) got exception 22
(7544) re-raising!
(7544) wsgi exiting, calling pool.waitall()
Traceback (most recent call last):
File "/tmp/swiftstack-eventlet/eventlet/wsgi.py", line 602, in handle_one_response
write(b'')
File "/tmp/swiftstack-eventlet/eventlet/wsgi.py", line 541, in write
wfile.flush()
File "/usr/lib64/python2.7/socket.py", line 303, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/tmp/swiftstack-eventlet/eventlet/greenio/base.py", line 403, in sendall
tail = self.send(data, flags)
File "/tmp/swiftstack-eventlet/eventlet/greenio/base.py", line 397, in send
return self._send_loop(self.fd.send, data, flags)
File "/tmp/swiftstack-eventlet/eventlet/greenio/base.py", line 384, in _send_loop
return send_method(data, *args)
error: [Errno 32] Broken pipe

127.0.0.1 - - [31/Oct/2019 01:08:05] "PUT /foo-bar HTTP/1.1" 200 0 0.000696
(7544) wsgi exited, is_accepting=True

In the new tests, not having the fix manifests as all greenthreads
exiting before the client finishes the in-flight request.

While doing some work on OpenStack Swift, I noticed that in-flight requests were getting their sockets closed by eventlet when the listen socket was closed. In that case, there is a server process which bound the listen socket and N fork'ed off worker processes which all run the wsgi.server loop calling accept() and handling requests. I noticed that nothing was ever setting connection state to STATE_REQUEST (i.e. unused constant in previous patch). Upon closer inspection, it turns out that the socket cleanup code for STATE_IDLE sockets got applied to the socket with an in-progress request. This caused code using the socket to get an EPIPE and exit the request's greenthread without completing the request. With some instrumentation in the code, without this patch's fix, the new test observed the code doing this: (7544) wsgi starting up on http://127.0.0.1:36745 (7544) calling accept (7544) accept got <eventlet.greenio.base.GreenSocket object at 0x7f7650d73290>, ('127.0.0.1', 50194) (7544) accepted ('127.0.0.1', 50194) (7544) completed req ('127.0.0.1', 50194) (7544) calling accept (7544) got exception 22 (7544) re-raising! (7544) wsgi exiting, calling pool.waitall() Traceback (most recent call last): File "/tmp/swiftstack-eventlet/eventlet/wsgi.py", line 602, in handle_one_response write(b'') File "/tmp/swiftstack-eventlet/eventlet/wsgi.py", line 541, in write wfile.flush() File "/usr/lib64/python2.7/socket.py", line 303, in flush self._sock.sendall(view[write_offset:write_offset+buffer_size]) File "/tmp/swiftstack-eventlet/eventlet/greenio/base.py", line 403, in sendall tail = self.send(data, flags) File "/tmp/swiftstack-eventlet/eventlet/greenio/base.py", line 397, in send return self._send_loop(self.fd.send, data, flags) File "/tmp/swiftstack-eventlet/eventlet/greenio/base.py", line 384, in _send_loop return send_method(data, *args) error: [Errno 32] Broken pipe 127.0.0.1 - - [31/Oct/2019 01:08:05] "PUT /foo-bar HTTP/1.1" 200 0 0.000696 (7544) wsgi exited, is_accepting=True In the new tests, not having the fix manifests as all greenthreads exiting before the client finishes the in-flight request. For the WebsocketWSGI app, it has a private class attribute that tells the wsgi.server() code to consider its "reqeusts" always idle, which I believe was the primary intent of the original patch.

...which apparently won't let you interpolate ints into bytestrings. Also remove some extraneous comments in the test.

Reduce chance for regression in future and show that client socket is unusable after an in-flight requet is finished.

codecov · 2024-02-05T19:06:39Z

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 56%. Comparing base (be36320) to head (389972f).

Files	Patch %	Lines
eventlet/wsgi.py	75%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##           master   #912   +/-   ##
=====================================
  Coverage      56%    56%           
=====================================
  Files          89     89           
  Lines        9741   9748    +7     
  Branches     1812   1812           
=====================================
+ Hits         5463   5471    +8     
- Misses       3903   3905    +2     
+ Partials      375    372    -3

Flag	Coverage Δ
ipv6	`23% <11%> (-1%)`	⬇️
py310asyncio	`53% <77%> (+<1%)`	⬆️
py310epolls	`53% <77%> (+<1%)`	⬆️
py310poll	`53% <77%> (+<1%)`	⬆️
py310selects	`53% <77%> (+<1%)`	⬆️
py311asyncio	`53% <77%> (+<1%)`	⬆️
py311epolls	`53% <77%> (+<1%)`	⬆️
py312asyncio	`50% <77%> (+<1%)`	⬆️
py312epolls	`51% <77%> (+<1%)`	⬆️
py37asyncio	`50% <77%> (+<1%)`	⬆️
py37epolls	`51% <77%> (+<1%)`	⬆️
py38asyncio	`51% <77%> (+<1%)`	⬆️
py38epolls	`53% <77%> (+<1%)`	⬆️
py38openssl	`52% <77%> (+<1%)`	⬆️
py38poll	`53% <77%> (+<1%)`	⬆️
py38selects	`53% <77%> (+<1%)`	⬆️
py39asyncio	`51% <77%> (+<1%)`	⬆️
py39dnspython1	`51% <77%> (+<1%)`	⬆️
py39epolls	`53% <77%> (+<1%)`	⬆️
py39poll	`53% <77%> (+<1%)`	⬆️
py39selects	`53% <77%> (+<1%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

4383 · 2024-02-12T17:25:07Z

All the asyncio based env tests are failing with all the supported versions of Python.
Apparently the error come from tests/wsgi_test.py::TestHttpd::test_close_idle_connections_listen_socket_closed. Logs shown Error: The operation was canceled.. This tox behavior seems related to a timeout. The same thing happen locally.

If I strace the process I can see tons of:

accept4(-1, 0x7ffd71a6a2e0, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffd71a6a2e0, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffd71a6a2e0, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffd71a6a2e0, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffd71a6a2e0, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffd71a6a2e0, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffd71a6a2e0, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)

The socket file descriptor is not an open file descriptor (https://man.archlinux.org/man/accept4.2.en#EBADF). In other words a socket is closed when using the asyncio context. This is why this test seems stuck, and this is why the tox-asyncio jobs timed outs in our github actions.

I'm not sure which socket is the problem (the client socket, the server socket, both sockets). Apparently only one socket seems in trouble.

I don't know why this new test is not compatible with asyncio. @itamarst: any idea?

Else, out of the asyncio context, I made local tests by removing the changes made in eventlet/ and by running tests, and I confirm that without this fix the given test test_close_idle_connections_listen_socket_closed is failing.

4383 · 2024-02-12T17:48:09Z

Here is more or less a complete stace corresponding to the failing test:

connect(15, {sa_family=AF_INET, sin_port=htons(36945), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
epoll_wait(11, [], 2, 0)                = 0
epoll_wait(11, [], 2, 0)                = 0
setsockopt(30, SOL_TCP, TCP_QUICKACK, [1], 4) = 0
recvfrom(30, 0x5649eb516f60, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
newfstatat(30, "", {st_mode=S_IFSOCK|0777, st_size=0, ...}, AT_EMPTY_PATH) = 0
epoll_ctl(11, EPOLL_CTL_ADD, 30, {events=EPOLLIN, data={u32=30, u64=140243567116318}}) = 0
sendto(15, "PUT /foo-bar HTTP/1.1\r\nHost: loc"..., 62, 0, NULL, 0) = 62
epoll_wait(11, [{events=EPOLLIN, data={u32=30, u64=140243567116318}}], 3, 0) = 1
sendto(15, "ABCABCABCABCABCABCABCABCABCABC", 30, 0, NULL, 0) = 30
epoll_ctl(11, EPOLL_CTL_DEL, 30, 0x7ffcea0cfd7c) = 0
recvfrom(30, "PUT /foo-bar HTTP/1.1\r\nHost: loc"..., 8192, 0, NULL, NULL) = 92
getsockname(30, {sa_family=AF_INET, sin_port=htons(36945), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
sendto(30, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 164, 0, NULL, 0) = 164
recvfrom(30, 0x5649eb516f60, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
newfstatat(30, "", {st_mode=S_IFSOCK|0777, st_size=0, ...}, AT_EMPTY_PATH) = 0
epoll_ctl(11, EPOLL_CTL_ADD, 30, {events=EPOLLIN, data={u32=30, u64=140243567116318}}) = 0
epoll_wait(11, [], 3, 0)                = 0
recvfrom(15, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 8192, 0, NULL, NULL) = 164
epoll_wait(11, [], 3, 0)                = 0
shutdown(29, SHUT_RDWR)                 = 0
epoll_wait(11, [{events=EPOLLHUP, data={u32=29, u64=140243567116317}}], 3, 0) = 1
close(29)                               = 0
epoll_ctl(11, EPOLL_CTL_DEL, 29, 0x7ffcea0d115c) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffcea0d1b90, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffcea0d1b90, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)
...
accept4(-1, 0x7ffcea0d1b90, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)

4383 · 2024-02-12T17:52:19Z

Apparently the test start to fail at line 1992 pool.waitall(). We can observe the shutdown and the socket close and then the test start failing.

itamarst · 2024-02-16T13:44:03Z

I will try to take a look.

itamarst · 2024-02-16T16:14:19Z

If you close a socket, accept() will return EBADF. You have code that runs accept() in a loop, forever (line 1034 of wsgi.py), and if there's EBADF it just... tries again (line 1043 of wsgi.py).

So, yeah, it'll take forever.

You either don't want to close a socket (how could that ever happen in real code?), or you want to add an else clause that breaks the loop on line 1044 of wsgi.py. However, it's possible that some of the errnos in ACCEPT_ERRNO are transient, so I don't know if that's correct behavior for all of them.

4383 · 2024-02-19T09:50:21Z

Indeed that could explain why we see an infinite loop.

Wonder if using a weakset to store connections could help better handling the living connections, and hence avoid this use case.

Apparently after the for loop (line 1043-1044) we have a try/except condition to handle socket connections but it seems that we don't reach that point:

eventlet/eventlet/wsgi.py

Lines 1047 to 1056 in b6f6e7c

    
           try: 
        
               # NOTE: It's not clear whether we want this to leave the 
        
               # socket open or close it.  Use cases like Spawning want 
        
               # the underlying fd to remain open, but if we're going 
        
               # that far we might as well not bother closing sock at 
        
               # all. 
        
               sock.close() 
        
           except OSError as e: 
        
               if support.get_errno(e) not in BROKEN_SOCK: 
        
                   traceback.print_exc()

Another concern that I have is: why this problem only appear with the asyncio hub?

If I remember correctly, when I ran local tests I was not able to reproduce this problem with EVENTLET_HUB set to other available hubs.

4383 · 2024-02-19T15:50:56Z

I confirm that when I run my local tests against a hub that is not the asyncio one, the test is successful.

Here below is an example with the epolls hub. We can observe that even if we face a EBADF we don't loop indefinitely as with the asyncio hub:

ioctl(14, FIONBIO, [1])                 = 0
accept4(13, 0x7ffc11730670, [16], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(11, EPOLL_CTL_ADD, 13, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=13, u64=140479790317581}}) = 0
epoll_ctl(11, EPOLL_CTL_DEL, 12, 0x7ffc1173170c) = 0
getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
connect(12, {sa_family=AF_INET, sin_port=htons(37403), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc500f0f000
setsockopt(14, SOL_TCP, TCP_QUICKACK, [1], 4) = 0
recvfrom(14, 0x55ba4cad8e10, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(11, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=14, u64=140479790317582}}) = 0
sendto(12, "PUT /foo-bar HTTP/1.1\r\nHost: loc"..., 62, 0, NULL, 0) = 62
epoll_wait(11, [{EPOLLIN, {u32=14, u64=140479790317582}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 14, 0x7ffc1173011c) = 0
recvfrom(14, "PUT /foo-bar HTTP/1.1\r\nHost: loc"..., 8192, 0, NULL, NULL) = 62
getsockname(14, {sa_family=AF_INET, sin_port=htons(37403), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
recvfrom(14, 0x55ba4cad8e10, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(11, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=14, u64=140479790317582}}) = 0
sendto(12, "ABCABCABCABCABCABCABCABCABCABC", 30, 0, NULL, 0) = 30
epoll_wait(11, [{EPOLLIN, {u32=14, u64=140479790317582}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 14, 0x7ffc1172fe0c) = 0
recvfrom(14, "ABCABCABCABCABCABCABCABCABCABC", 8192, 0, NULL, NULL) = 30
sendto(14, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 164, 0, NULL, 0) = 164
recvfrom(14, 0x55ba4cad8e10, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(11, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=14, u64=140479790317582}}) = 0
recvfrom(12, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 8192, 0, NULL, NULL) = 164
epoll_wait(11, [], 1023, 0)             = 0
shutdown(13, SHUT_RDWR)                 = 0
epoll_wait(11, [{EPOLLHUP, {u32=13, u64=140479790317581}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 13, 0x7ffc1173070c) = 0
accept4(13, 0x7ffc11730670, [16], SOCK_CLOEXEC) = -1 EINVAL (Invalid argument)
shutdown(14, SHUT_RDWR)                 = 0
close(13)                               = 0
epoll_wait(11, [{EPOLLIN|EPOLLHUP, {u32=14, u64=140479790317582}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 14, 0x7ffc1173017c) = 0
recvfrom(14, "", 8192, 0, NULL, NULL)   = 0
shutdown(14, SHUT_RDWR)                 = 0
close(14)                               = 0
shutdown(-1, SHUT_RDWR)                 = -1 EBADF (Bad file descriptor)
munmap(0x7fc500f0f000, 16384)           = 0
close(12)                               = 0
munmap(0x7fc500f13000, 16384)           = 0
dup2(5, 1)                              = 1
dup2(7, 2)                              = 2
lseek(6, 0, SEEK_SET)                   = 0
lseek(6, 0, SEEK_CUR)                   = 0
fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
read(6, "", 8192)                       = 0
lseek(6, 0, SEEK_SET)                   = 0
lseek(6, 0, SEEK_CUR)                   = 0
ftruncate(6, 0)                         = 0
lseek(8, 0, SEEK_SET)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
fstat(8, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
read(8, "", 8192)                       = 0
lseek(8, 0, SEEK_SET)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
ftruncate(8, 0)                         = 0
write(1, "\33[32m.\33[0m", 10)          = 10
dup2(6, 1)                              = 1
dup2(8, 2)                              = 2
dup2(5, 1)                              = 1
dup2(7, 2)                              = 2
lseek(6, 0, SEEK_SET)                   = 0
lseek(6, 0, SEEK_CUR)                   = 0
fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
read(6, "", 8192)                       = 0
lseek(6, 0, SEEK_SET)                   = 0
lseek(6, 0, SEEK_CUR)                   = 0
ftruncate(6, 0)                         = 0
lseek(8, 0, SEEK_SET)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
fstat(8, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
read(8, "", 8192)                       = 0
lseek(8, 0, SEEK_SET)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
ftruncate(8, 0)                         = 0
ioctl(1, TIOCGWINSZ, {ws_row=17, ws_col=127, ws_xpixel=2032, ws_ypixel=544}) = 0
write(1, "\33[32m                           "..., 115) = 115
chdir("/home/dev/app")                  = 0
stat("/home/dev/app/.pytest_cache/v/cache", {st_mode=S_IFDIR|0755, st_size=50, ...}) = 0
openat(AT_FDCWD, "/home/dev/app/.pytest_cache/v/cache/nodeids", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 12
fstat(12, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ioctl(12, TCGETS, 0x7ffc11731f00)       = -1 ENOTTY (Inappropriate ioctl for device)
lseek(12, 0, SEEK_CUR)                  = 0
lseek(12, 0, SEEK_CUR)                  = 0
write(12, "[\n  \"tests/api_test.py::TestApi:"..., 46647) = 46647
close(12)                               = 0
openat(AT_FDCWD, "/home/dev/app/.pytest_cache/v/cache/lastfailed", O_RDONLY|O_CLOEXEC) = 12
fstat(12, {st_mode=S_IFREG|0644, st_size=341, ...}) = 0
ioctl(12, TCGETS, 0x7ffc11731ed0)       = -1 ENOTTY (Inappropriate ioctl for device)
lseek(12, 0, SEEK_CUR)                  = 0
lseek(12, 0, SEEK_CUR)                  = 0
fstat(12, {st_mode=S_IFREG|0644, st_size=341, ...}) = 0
read(12, "{\n  \"tests/greendns_test.py::Tes"..., 342) = 341
read(12, "", 1)                         = 0
close(12)                               = 0
stat("/home/dev/app/.pytest_cache/v/cache", {st_mode=S_IFDIR|0755, st_size=50, ...}) = 0
openat(AT_FDCWD, "/home/dev/app/.pytest_cache/v/cache/stepwise", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 12
fstat(12, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ioctl(12, TCGETS, 0x7ffc11732130)       = -1 ENOTTY (Inappropriate ioctl for device)
lseek(12, 0, SEEK_CUR)                  = 0
lseek(12, 0, SEEK_CUR)                  = 0
write(12, "[]", 2)                      = 2
close(12)                               = 0
write(1, "\n", 1)                       = 1
ioctl(1, TIOCGWINSZ, {ws_row=17, ws_col=127, ws_xpixel=2032, ws_ypixel=544}) = 0
write(1, "\n", 1)                       = 1
write(1, "\33[32m==========================="..., 164) = 164
close(10)                               = 0
rt_sigaction(SIGBUS, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc501a64d60}, NULL, 8) = 0
rt_sigaction(SIGILL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc501a64d60}, NULL, 8) = 0
rt_sigaction(SIGFPE, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc501a64d60}, NULL, 8) = 0
rt_sigaction(SIGABRT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc501a64d60}, NULL, 8) = 0
rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc501a64d60}, NULL, 8) = 0
close(9)                                = 0
lseek(6, 0, SEEK_SET)                   = 0
lseek(6, 0, SEEK_CUR)                   = 0
fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
read(6, "", 8192)                       = 0
lseek(6, 0, SEEK_SET)                   = 0
lseek(6, 0, SEEK_CUR)                   = 0
ftruncate(6, 0)                         = 0
lseek(8, 0, SEEK_SET)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
fstat(8, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
read(8, "", 8192)                       = 0
lseek(8, 0, SEEK_SET)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
ftruncate(8, 0)                         = 0
dup2(5, 1)                              = 1
close(5)                                = 0
close(6)                                = 0
dup2(7, 2)                              = 2
close(7)                                = 0
close(8)                                = 0
dup2(3, 0)                              = 0
close(3)                                = 0
close(4)                                = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7fc501a64d60}, {sa_handler=0x7fc501da25f9, sa_mask=[], sa_flags=SA_RESTORER|SA_ONSTACK, sa_restorer=0x7fc501a64d60}, 8) = 0
brk(0x55ba4cbd0000)                     = 0x55ba4cbd0000
sigaltstack(NULL, {ss_sp=0x55ba4c7c1a20, ss_flags=0, ss_size=11824}) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
munmap(0x7fc4ff9bd000, 962560)          = 0
munmap(0x7fc502240000, 16384)           = 0
exit_group(0)                           = ?
+++ exited with 0 +++

At first glance, I don't see why this patch behave differently depending on the used hub.
The functional code and the unit test code seems hub agnostic.

4383 · 2024-02-19T15:58:08Z

Here is an isolation of the strace for both hubs.

With the epolls hub:

recvfrom(14, "ABCABCABCABCABCABCABCABCABCABC", 8192, 0, NULL, NULL) = 30
sendto(14, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 164, 0, NULL, 0) = 164
recvfrom(14, 0x55ba4cad8e10, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(11, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=14, u64=140479790317582}}) = 0
recvfrom(12, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 8192, 0, NULL, NULL) = 164
epoll_wait(11, [], 1023, 0)             = 0
shutdown(13, SHUT_RDWR)                 = 0
epoll_wait(11, [{EPOLLHUP, {u32=13, u64=140479790317581}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 13, 0x7ffc1173070c) = 0
accept4(13, 0x7ffc11730670, [16], SOCK_CLOEXEC) = -1 EINVAL (Invalid argument)
shutdown(14, SHUT_RDWR)                 = 0
close(13)                               = 0
epoll_wait(11, [{EPOLLIN|EPOLLHUP, {u32=14, u64=140479790317582}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 14, 0x7ffc1173017c) = 0
recvfrom(14, "", 8192, 0, NULL, NULL)   = 0
shutdown(14, SHUT_RDWR)                 = 0
close(14)                               = 0
shutdown(-1, SHUT_RDWR)                 = -1 EBADF (Bad file descriptor)

With the asyncio hub:

sendto(15, "ABCABCABCABCABCABCABCABCABCABC", 30, 0, NULL, 0) = 30
epoll_ctl(11, EPOLL_CTL_DEL, 30, 0x7ffcea0cfd7c) = 0
recvfrom(30, "PUT /foo-bar HTTP/1.1\r\nHost: loc"..., 8192, 0, NULL, NULL) = 92
getsockname(30, {sa_family=AF_INET, sin_port=htons(36945), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
sendto(30, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 164, 0, NULL, 0) = 164
recvfrom(30, 0x5649eb516f60, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
newfstatat(30, "", {st_mode=S_IFSOCK|0777, st_size=0, ...}, AT_EMPTY_PATH) = 0
epoll_ctl(11, EPOLL_CTL_ADD, 30, {events=EPOLLIN, data={u32=30, u64=140243567116318}}) = 0
epoll_wait(11, [], 3, 0)                = 0
recvfrom(15, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 8192, 0, NULL, NULL) = 164
epoll_wait(11, [], 3, 0)                = 0
shutdown(29, SHUT_RDWR)                 = 0
epoll_wait(11, [{events=EPOLLHUP, data={u32=29, u64=140243567116317}}], 3, 0) = 1
close(29)                               = 0
epoll_ctl(11, EPOLL_CTL_DEL, 29, 0x7ffcea0d115c) = -1 EBADF (Bad file descriptor)
accept4(-1, 0x7ffcea0d1b90, [16], SOCK_CLOEXEC) = -1 EBADF (Bad file descriptor)

tipabu · 2024-02-19T18:35:12Z

Oh, interesting! I think the test is expecting the accept to fail following shutdown but before close (hence the sleep(0) in between, to give all running greenthreads a chance to notice the shut-down socket and raise errors). Notice the

shutdown(13, SHUT_RDWR)                 = 0
epoll_wait(11, [{EPOLLHUP, {u32=13, u64=140479790317581}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 13, 0x7ffc1173070c) = 0
accept4(13, 0x7ffc11730670, [16], SOCK_CLOEXEC) = -1 EINVAL (Invalid argument)
shutdown(14, SHUT_RDWR)                 = 0
close(13)                               = 0

in the posted strace.

It's not clear to me how (if?) asyncio handles error states. Seems like it'll depend on selector, but SelectSelector explicitly passes an empty list for xlist and PollSelector makes no reference to POLLERR or POLLHUP....

4383 · 2024-02-20T14:44:46Z

I think the main difference between the epolls hub and the asyncio hub, resides in the management of the mask that determines which events are to be monitored for a file descriptor of the socket.

In the previous epolls strace we can observe that the a file descriptor (equal to 14 in the strace) is modified to set the epoll_event struct to react to the these events types: EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP.

Then, we request the shutdown of this socket and start waiting for a ready file descriptor. Once this socket is ready we remove this socket file descriptor from from the interest list of the epfd 11.

shutdown(14, SHUT_RDWR)                 = 0
close(13)                               = 0
epoll_wait(11, [{EPOLLIN|EPOLLHUP, {u32=14, u64=140479790317582}}], 1023, 0) = 1
epoll_ctl(11, EPOLL_CTL_DEL, 14, 0x7ffc1173017c) = 0
recvfrom(14, "", 8192, 0, NULL, NULL)   = 0
shutdown(14, SHUT_RDWR)                 = 0
close(14)

Apparently when using the asyncio hub the epoll_event struct is simply set with EPOLLIN (data other than high-priority data can be read), so, high priority data, error, and hangup events are all ignored, so, may it could explain why it block/loop.

https://www.man7.org/linux/man-pages/man2/epoll_ctl.2.html

Complete epoll_ctl by using the epolls hub:

epoll_ctl(11, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=14, u64=140479790317582}}) = 0

I'm far to be a low level specialist and someone more skilled than me should double check what I say here.

We recently observed (eventlet#912) difference in the behaviors between the asyncio hub and the other eventlet hubs. The asyncio hub behaved differenly because bad file descriptor were not properly removed from the list of readers and writers listeners. To clean these bad file descriptors we have to implement an awaitable wait method specific to the asyncio hub This patch should fix eventlet#912. I prefer to propose this patch as a separated patch to isolate topics and simplify reviews. eventlet#912

4383 · 2024-03-01T13:10:08Z

The root cause of the error seen when using the asyncio hub, is due to fact that all the eventlet hubs, apart the asyncio hub, implements bad file descriptors removal.

As in tests we close the socket, its file descriptor is also closed and as the asyncio hub do not remove bad file descriptor from its listeners, the server continue to iterate over this socket again and again, and try to access it, this is why we see the infinite loop of accept sys calls.

The other hubs implement mechanisms to remove these bad file descriptors, this is why their tests pass without problems.

I just submitted a patch to implement this missing feature: #924

For now, my PR fix the problem we face here. Local tests are successful for this use case. However asyncio jobs are failing due to a side effect of my proposal, I need to dive deep to see how to fix that.

I don't think we could consider the asyncio hub as production ready, while this problem is not solved. Hopefully we won't face similar issues elsewhere.

IMO, from a end user point of view, all hubs should behave the same way. All hubs should be iso-behaving.

itamarst · 2024-03-07T15:25:59Z

Reading the man page for accept(), I think the "just ignore EBADF" behavior in wsgi.py is a bug.

itamarst · 2024-03-07T15:50:28Z

@4383 back to you.

itamarst · 2024-03-07T15:54:05Z

eventlet/wsgi.py

 try:
    import ssl
    ACCEPT_EXCEPTIONS = (socket.error, ssl.SSLError)
-    ACCEPT_ERRNO = {errno.EPIPE, errno.EBADF, errno.ECONNRESET,


The man page for accept() on Linux says "EBADF: sockfd is not an open file descriptor." I.e. it's not a recoverable error.

Indeed, that seems logic. I think you are right, this kind of error (bad file descriptor), is not recoverable. I don't see how a server could recover from this kind of error.

By doing my previous tests, it seems that in this case (EBADF), the wsgi server is terminated properly without raising more attention than that.

In parallel, I think we should also consider that the wsgi server is only one piece of the architecture of our users. Some people may launch several greenthreads dedicated to various task, and the wsgi server greenthread could be one of them, so if the server exit abruptly, the global process will be affected and then all the other greenthreads would also exit abruptly, then killing all the other background jobs. I think this scenario is more or less the reason behind the implementation of the logic of EBADF management we observe in various hubs.

However, this logic more or less hide wsgi server errors. Without digging too far, developer should observe this kind of failure on the client side, when requests are no longer successful.

From a monitoring perspective, if developers monitors the process of their wsgi server, currently, by functionning this way, they should not be able to observe a failure on this process, as the global process could continue to run due to other greenthreads.

I'd argue that when we face this kind of scenario (EBADF), we should failed early as soon as possible.

IMO, rather than implementing specific logic in various hubs, logic dedicated to manage bad file descriptor, the wsgi module should exit with error to catch developer attention about this problem.

The current behaviour of hubs seems more a convenience to allow polymorphic architectures in the same process. A kind of fault tolerance behaviour. I'd rather encourage managing this kind of error by using things like pacemaker.

Thoughts?

4383

LGTM, removing EBADF from accepted errors fixed the issue we discussed previously. Thanks

4383 · 2024-03-08T15:46:11Z

@tipabu: if you are ok with the latest commit added by Itamar, then feel free to push the squash and merge button, else feel free to raise your concerns. Thank you.

4383 · 2024-03-11T08:44:03Z

@tipabu, @itamarst: The EBADF changes are not semantically close to the original intent of this PR. The EBADF changes fix a side behavior observed during debug.

As the commit introduced by Itamar is not really related to the changes initially proposed by this PR, and as I think the EBADF change is an important modification in the behaviour of the hubs and of the wsgi module, hence, I'd suggest to use "rebase and merge" rather than "squash and merge".

Indeed, we will possibly face side of effects due to the EBADF removal, so, for the sake of clarity and with the goal to keep an exploitable git history (git bisect, changelog), merging this PR this way would surely simplify our lifes.

Else, the EBADF changes could be proposed as a separated PR, and then merged first, but at the end the result would be the same, i.e a dedicated commit in the git history for the EBADF topic. Using "rebase and merge" here, would avoid the gymnastic of a new PR creation.

Any concerns?

itamarst · 2024-03-11T17:30:57Z

I don't have a strong opinion, I'd do whatever is easiest.

4383 · 2024-03-12T08:45:53Z

Well, even after rebasing master into this branch, the rebase feature of github doesn't work. Locally I'm able to properly merge this branch with master with no fast forward, but apparently it doesn't work in the pipes of github, I don't know why, so lets squash all commits and then wrote a meaningful commit message.

…rsion 0.36.1 0.36.1 ====== * [fix] eventlet.websocket is not always used from eventlet.wsgi, so do not assume eventlet.set_idle exists eventlet/eventlet#949 0.36.0 ====== * [fix] Make sure asyncio hub doesn't use greendns for asyncio DNS APIs eventlet/eventlet#938 * [fix] Make asyncio.to_thread work with the same semantics as normal asyncio eventlet/eventlet#930 * [fix] Refactor congruence checks based on assert at runtime eventlet/eventlet#932 * [tests] Run tests on macOS in CI, and some fixes to get it in reasonable state (#list eventlet/eventlet#934 * [fix] Fix wsgi.server shutdown for in-flight requests eventlet/eventlet#912 * [feature] Add debug convenience helpers - asyncio, threads eventlet/eventlet#925 * [fix] Handle errors better. eventlet/eventlet#923 (NEWS truncated at 15 lines) CVEs fixed in this build: CVE-2023-29483

dbishop and others added 5 commits October 30, 2019 18:49

Fix tests for python 3.4

2f157d3

...which apparently won't let you interpolate ints into bytestrings. Also remove some extraneous comments in the test.

Review comment fixes.

684f072

Reduce chance for regression in future and show that client socket is unusable after an in-flight requet is finished.

Make Websocket apps work when wrapped

a905bf8

Merge remote-tracking branch 'origin/master' into wsgi-set-idle

a248034

tipabu mentioned this pull request Feb 5, 2024

Fix wsgi.server shutdown for in-flight requests #590

Closed

4383 added the bug label Feb 13, 2024

4383 mentioned this pull request Feb 19, 2024

Is there any way to shutdown eventlet.wsgi.server gracefully? #917

Closed

4383 mentioned this pull request Mar 1, 2024

[WIP] Implement file descriptor removal and waiting in asyncio hub #924

Closed

itamarst mentioned this pull request Mar 7, 2024

Asyncio hub doesn't clean up bad file descriptors #927

Closed

pythonspeed added 2 commits March 7, 2024 10:26

Merge remote-tracking branch 'origin/master' into wsgi-set-idle

87e396a

EBADF means accept() will never work ever again

3696714

itamarst requested a review from 4383 March 7, 2024 15:50

itamarst reviewed Mar 7, 2024

View reviewed changes

4383 approved these changes Mar 8, 2024

View reviewed changes

Merge branch 'master' into wsgi-set-idle

389972f

4383 merged commit 1f3db00 into eventlet:master Mar 12, 2024

4383 mentioned this pull request Mar 15, 2024

[release] eventlet 0.36.0 #940

Closed

Conversation

tipabu commented Feb 5, 2024

Uh oh!

codecov bot commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

4383 commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4383 commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4383 commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itamarst commented Feb 16, 2024

Uh oh!

itamarst commented Feb 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4383 commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4383 commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4383 commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tipabu commented Feb 19, 2024

Uh oh!

4383 commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4383 commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itamarst commented Mar 7, 2024

Uh oh!

itamarst commented Mar 7, 2024

Uh oh!

itamarst Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

4383 Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

4383 left a comment

Choose a reason for hiding this comment

Uh oh!

4383 commented Mar 8, 2024

Uh oh!

4383 commented Mar 11, 2024

Uh oh!

itamarst commented Mar 11, 2024

Uh oh!

4383 commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Feb 5, 2024 •

edited

Loading

4383 commented Feb 12, 2024 •

edited

Loading

4383 commented Feb 12, 2024 •

edited

Loading

4383 commented Feb 12, 2024 •

edited

Loading

itamarst commented Feb 16, 2024 •

edited

Loading

4383 commented Feb 19, 2024 •

edited

Loading

4383 commented Feb 19, 2024 •

edited

Loading

4383 commented Feb 19, 2024 •

edited

Loading

4383 commented Feb 20, 2024 •

edited

Loading

4383 commented Mar 1, 2024 •

edited

Loading

4383 Mar 8, 2024 •

edited

Loading