[close #1802] Close listeners on SIGTERM #1808

schneems · 2019-05-30T20:16:30Z

Issue link: #1802

Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls exit 0. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a ECONNRESET error, this would indicate the request was accepted but then closed. Instead we want ECONNREFUSED.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

$ while m test/test_integration.rb:235; do :; done

Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the -w 2 config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node. This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond. ## Test This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`. I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners. ``` $ while m test/test_integration.rb:235; do :; done ``` ## Considerations This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach. Co-authored-by: Danny Fallon <[email protected]> Co-authored-by: Richard Schneeman <[email protected]>

nateberkopec · 2019-06-03T12:36:23Z

The maintainers think this looks good, but could use feedback. If you're experiencing issues with closed connections during shutdowns please give this a shot ❤️

electron0zero · 2019-06-04T16:33:42Z

We had this issue, our workaround was to send SIGTERM to nginx(we run puma behind nginx) and then sleep for 15 seconds and send SIGTERM to puma.

adamlogic · 2019-06-04T18:00:17Z

I see H13 errors all the time when downscaling with Rails Autoscale. I've deployed this patch, and so far so good.

It's only been a couple hours, I'll post another update tomorrow.

schneems · 2019-06-04T18:02:49Z

Awesome, thanks for the info @adamlogic!

adamlogic · 2019-06-05T16:03:48Z

After giving it about 24 hours, I'm definitely seeing different behavior, although I can't quite make sense of it. I typically get 1-2 H13 errors during a downscale or restart, spread throughout the day. Yesterday I only saw one instance of H13 errors, but it was a burst of 37 errors. Quite unusual.

I'm hoping that was an anomaly. Going to keep this patch running for now.

schneems · 2019-06-05T16:30:54Z

There was also an incident at that time, so it might be related? Thanks for running this, keep me updated.

schneems · 2019-06-06T15:27:11Z

How did we do on the next 24 hours?

One theory on where this might be going south is that the default behavior for Puma is not to drain the socket backlog. There's a setting for it though. Just a theory, haven't tested.

adamlogic · 2019-06-06T16:26:25Z

No H13's in the last 36+ hours. 👍

lib/puma/cluster.rb

adamlogic · 2019-06-07T12:59:11Z

Spoke too soon. Got 4 H13 errors last night.

Dug into my logs and found this error which correlates exactly with the H13 errors. This is the only instance of this error in my logs, even though I've had many downscale events.

2019-06-06T21:00:59.137527+00:00 heroku web.2 - - State changed from up to down
2019-06-06T21:01:00.403377+00:00 heroku web.2 - - Stopping all processes with SIGTERM
2019-06-06T21:01:00.428773+00:00 heroku web.2 - - Stopping all processes with SIGTERM
2019-06-06T21:01:01.404286+00:00 app web.2 - - [4] - Gracefully shutting down workers...
2019-06-06T21:01:01.408456+00:00 app web.2 - - bundler: failed to load command: puma (/app/vendor/bundle/ruby/2.5.0/bin/puma)
2019-06-06T21:01:01.408661+00:00 app web.2 - - Errno::ECHILD: No child processes
2019-06-06T21:01:01.408666+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `waitpid'
2019-06-06T21:01:01.408669+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `block in stop_workers'
2019-06-06T21:01:01.408672+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `each'
2019-06-06T21:01:01.408674+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `stop_workers'
2019-06-06T21:01:01.408676+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:406:in `block in setup_signals'
2019-06-06T21:01:01.408678+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:531:in `rescue in run'
2019-06-06T21:01:01.408680+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:494:in `run'
2019-06-06T21:01:01.408682+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/launcher.rb:186:in `run'
2019-06-06T21:01:01.408684+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cli.rb:80:in `run'
2019-06-06T21:01:01.408686+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/bin/puma:10:in `<top (required)>'
2019-06-06T21:01:01.408689+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bin/puma:23:in `load'
2019-06-06T21:01:01.408691+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bin/puma:23:in `<top (required)>'
2019-06-06T21:01:01.975425+00:00 heroku web.2 - - Process exited with status 1

LMK if there's any more info that would be helpful.

schneems · 2019-06-07T13:47:45Z

I’ve seen that error before, it happens because the child got closed before the parent tried to wait on it. When there is no child we should rescue that exception and keep going.

Heroku sends TERM to all processes not just the one it spawned. So sometimes the children close before the parent can send a TERM to them.

I’ll update later today with a fix for that and let you know.

On Heroku, and potentially other platforms, SIGTERM is sent to ALL processes, not just the parent process. This means that by the time the parent worker tries to wait on it's children to shut down, they may not exist. When that happens an `Errno::ECHILD` error is raised as seen in this comment puma#1808 (comment). In that situation we don't really care that there's no child, we want to continue the shutdown process in an orderly fashion so we can safely ignore the error and move on.

To avoid a `send` call in `cluster.rb` and to indicate to the maintainers that other classes use this method, we will make it public.

schneems · 2019-06-07T19:15:36Z

@adamlogic I just added 8c78ee2 which should resolve that error. I don't know if that's the cause of the H13 or just covering up some other behavior.

If fixing that error doesn't resolve the H13 then I would suggest setting drain_on_shutdown in your config/puma.rb to see if it helps (but please don't set this until after we verify that my above patch works, I would like to rule out one thing at a time and that's easier to do if we only change one thing).

adamlogic · 2019-06-07T20:41:30Z

👍 Thanks! Just deployed with 84ce04d.

I'll post another update on Monday.

adamlogic · 2019-06-10T10:46:52Z

No H13 errors or Errno::ECHILD errors in the past 72 hours. 🎉

nateberkopec · 2019-06-10T12:51:11Z

That's the cleanest looking Heroku metrics timeline I've ever seen

On Heroku, and potentially other platforms, SIGTERM is sent to ALL processes, not just the parent process. This means that by the time the parent worker tries to wait on it's children to shut down, they may not exist. When that happens an `Errno::ECHILD` error is raised as seen in this comment puma#1808 (comment). In that situation we don't really care that there's no child, we want to continue the shutdown process in an orderly fashion so we can safely ignore the error and move on.

nateberkopec added the bug label May 30, 2019

schneems mentioned this pull request May 31, 2019

H13 (connection closed with no response) from Puma on Heroku #1802

Closed

nateberkopec added the Needs Feedback label Jun 3, 2019

evanphx requested changes Jun 6, 2019

View reviewed changes

lib/puma/cluster.rb Outdated Show resolved Hide resolved

schneems added 2 commits June 7, 2019 14:07

Make Launcher#close_binder_listeners public

84ce04d

To avoid a `send` call in `cluster.rb` and to indicate to the maintainers that other classes use this method, we will make it public.

schneems merged commit 5e64ed9 into puma:master Jun 10, 2019

schneems mentioned this pull request Jun 10, 2019

Do not accept new requests on shutdown #1685

Merged

dentarg mentioned this pull request Apr 3, 2021

CI on FreeBSD #2589

Open

dentarg mentioned this pull request Dec 7, 2023

Still getting H13s during autoscaling #2200

Closed

[close #1802] Close listeners on SIGTERM #1808

[close #1802] Close listeners on SIGTERM #1808

Uh oh!

Conversation

schneems commented May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Considerations

Uh oh!

nateberkopec commented Jun 3, 2019

Uh oh!

electron0zero commented Jun 4, 2019

Uh oh!

adamlogic commented Jun 4, 2019

Uh oh!

schneems commented Jun 4, 2019

Uh oh!

adamlogic commented Jun 5, 2019

Uh oh!

schneems commented Jun 5, 2019

Uh oh!

schneems commented Jun 6, 2019

Uh oh!

adamlogic commented Jun 6, 2019

Uh oh!

Uh oh!

adamlogic commented Jun 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schneems commented Jun 7, 2019

Uh oh!

schneems commented Jun 7, 2019

Uh oh!

adamlogic commented Jun 7, 2019

Uh oh!

adamlogic commented Jun 10, 2019

Uh oh!

nateberkopec commented Jun 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

schneems commented May 30, 2019 •

edited

Loading

adamlogic commented Jun 7, 2019 •

edited

Loading