Skip to content

Conversation

@schneems
Copy link
Contributor

@schneems schneems commented May 30, 2019

Issue link: #1802

Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls exit 0. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

Test

This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a ECONNRESET error, this would indicate the request was accepted but then closed. Instead we want ECONNREFUSED.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

$ while m test/test_integration.rb:235; do :; done

Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the -w 2 config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

Currently when a SIGTERM is sent to a puma cluster, the signal is trapped, then sent to all children, it then waits for children to exit and then the parent process exits. The socket that accepts connections is only closed when the parent process calls `exit 0`. The problem with this flow is there is a period of time where there are no child processes to work on an incoming connection, however the socket is still open so clients can connect to it. When this happens, the client will connect, but the connection will be closed with no response. Instead, the desired behavior is for the connection from the client to be rejected. This allows the client to re-connect, or if there is a load balance between the client and the puma server, it allows the request to be routed to another node.

This PR fixes the existing behavior by manually closing the socket when SIGTERM is received before shutting down the workers/children processes. When the socket is closed, any incoming requests will fail to connect and they will be rejected, this is our desired behavior. Existing requests that are in-flight can still respond.

 ## Test


This behavior is quite difficult to test, you'll notice that the test is far longer than the code change. In this test we send an initial request to an endpoint that sleeps for 1 second. We then signal to other threads that they can continue. We send the parent process a SIGTERM, while simultaneously sending other requests. Some of these will happen after the SIGTERM is received by the server. When that happens we want none of the requests to get a `ECONNRESET` error, this would indicate the request was accepted but then closed. Instead we want `ECONNREFUSED`.

I ran this test in a loop for a few hours and it passes with my patch, it fails immediately if you remove the call to close the listeners.

```
$ while m test/test_integration.rb:235; do :; done
```

 ## Considerations

This PR only fixes the problem for "cluster" (i.e. multi-worker) mode. When trying to reproduce the test with single mode, on (removing the `-w 2` config) it already passes. This leads us to believe that either the bug does not exist in single threaded mode, or at the very least reproducing the bug via a test in the single threaded mode requires a different approach.

Co-authored-by: Danny Fallon <[email protected]>
Co-authored-by:  Richard Schneeman <[email protected]>
@nateberkopec
Copy link
Member

The maintainers think this looks good, but could use feedback. If you're experiencing issues with closed connections during shutdowns please give this a shot ❤️

@electron0zero
Copy link

We had this issue, our workaround was to send SIGTERM to nginx(we run puma behind nginx) and then sleep for 15 seconds and send SIGTERM to puma.

@adamlogic
Copy link

I see H13 errors all the time when downscaling with Rails Autoscale. I've deployed this patch, and so far so good.

Screen Shot 2019-06-04 at 1 54 53 PM

It's only been a couple hours, I'll post another update tomorrow.

@schneems
Copy link
Contributor Author

schneems commented Jun 4, 2019

Awesome, thanks for the info @adamlogic!

@adamlogic
Copy link

After giving it about 24 hours, I'm definitely seeing different behavior, although I can't quite make sense of it. I typically get 1-2 H13 errors during a downscale or restart, spread throughout the day. Yesterday I only saw one instance of H13 errors, but it was a burst of 37 errors. Quite unusual.

Screen Shot 2019-06-05 at 11 58 30 AM

I'm hoping that was an anomaly. Going to keep this patch running for now.

@schneems
Copy link
Contributor Author

schneems commented Jun 5, 2019

There was also an incident at that time, so it might be related? Thanks for running this, keep me updated.

@schneems
Copy link
Contributor Author

schneems commented Jun 6, 2019

How did we do on the next 24 hours?

One theory on where this might be going south is that the default behavior for Puma is not to drain the socket backlog. There's a setting for it though. Just a theory, haven't tested.

@adamlogic
Copy link

No H13's in the last 36+ hours. 👍

Screen Shot 2019-06-06 at 12 24 49 PM

@adamlogic
Copy link

adamlogic commented Jun 7, 2019

Spoke too soon. Got 4 H13 errors last night.

Screen Shot 2019-06-07 at 8 56 32 AM

Dug into my logs and found this error which correlates exactly with the H13 errors. This is the only instance of this error in my logs, even though I've had many downscale events.

2019-06-06T21:00:59.137527+00:00 heroku web.2 - - State changed from up to down
2019-06-06T21:01:00.403377+00:00 heroku web.2 - - Stopping all processes with SIGTERM
2019-06-06T21:01:00.428773+00:00 heroku web.2 - - Stopping all processes with SIGTERM
2019-06-06T21:01:01.404286+00:00 app web.2 - - [4] - Gracefully shutting down workers...
2019-06-06T21:01:01.408456+00:00 app web.2 - - bundler: failed to load command: puma (/app/vendor/bundle/ruby/2.5.0/bin/puma)
2019-06-06T21:01:01.408661+00:00 app web.2 - - Errno::ECHILD: No child processes
2019-06-06T21:01:01.408666+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `waitpid'
2019-06-06T21:01:01.408669+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `block in stop_workers'
2019-06-06T21:01:01.408672+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `each'
2019-06-06T21:01:01.408674+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:39:in `stop_workers'
2019-06-06T21:01:01.408676+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:406:in `block in setup_signals'
2019-06-06T21:01:01.408678+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:531:in `rescue in run'
2019-06-06T21:01:01.408680+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cluster.rb:494:in `run'
2019-06-06T21:01:01.408682+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/launcher.rb:186:in `run'
2019-06-06T21:01:01.408684+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/lib/puma/cli.rb:80:in `run'
2019-06-06T21:01:01.408686+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bundler/gems/puma-184e1510a97c/bin/puma:10:in `<top (required)>'
2019-06-06T21:01:01.408689+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bin/puma:23:in `load'
2019-06-06T21:01:01.408691+00:00 app web.2 - -   /app/vendor/bundle/ruby/2.5.0/bin/puma:23:in `<top (required)>'
2019-06-06T21:01:01.975425+00:00 heroku web.2 - - Process exited with status 1

LMK if there's any more info that would be helpful.

@schneems
Copy link
Contributor Author

schneems commented Jun 7, 2019

I’ve seen that error before, it happens because the child got closed before the parent tried to wait on it. When there is no child we should rescue that exception and keep going.

Heroku sends TERM to all processes not just the one it spawned. So sometimes the children close before the parent can send a TERM to them.

I’ll update later today with a fix for that and let you know.

schneems added 2 commits June 7, 2019 14:07
On Heroku, and potentially other platforms, SIGTERM is sent to ALL processes, not just the parent process. This means that by the time the parent worker tries to wait on it's children to shut down, they may not exist. When that happens an `Errno::ECHILD` error is raised as seen in this comment puma#1808 (comment).

In that situation we don't really care that there's no child, we want to continue the shutdown process in an orderly fashion so we can safely ignore the error and move on.
To avoid a `send` call in `cluster.rb` and to indicate to the maintainers that other classes use this method, we will make it public.
@schneems
Copy link
Contributor Author

schneems commented Jun 7, 2019

@adamlogic I just added 8c78ee2 which should resolve that error. I don't know if that's the cause of the H13 or just covering up some other behavior.

If fixing that error doesn't resolve the H13 then I would suggest setting drain_on_shutdown in your config/puma.rb to see if it helps (but please don't set this until after we verify that my above patch works, I would like to rule out one thing at a time and that's easier to do if we only change one thing).

@adamlogic
Copy link

👍 Thanks! Just deployed with 84ce04d.

I'll post another update on Monday.

@adamlogic
Copy link

No H13 errors or Errno::ECHILD errors in the past 72 hours. 🎉

Screen Shot 2019-06-10 at 6 45 46 AM

@nateberkopec
Copy link
Member

That's the cleanest looking Heroku metrics timeline I've ever seen

@schneems schneems merged commit 5e64ed9 into puma:master Jun 10, 2019
schneems added a commit to schneems/puma that referenced this pull request Jun 25, 2019
On Heroku, and potentially other platforms, SIGTERM is sent to ALL processes, not just the parent process. This means that by the time the parent worker tries to wait on it's children to shut down, they may not exist. When that happens an `Errno::ECHILD` error is raised as seen in this comment puma#1808 (comment).

In that situation we don't really care that there's no child, we want to continue the shutdown process in an orderly fashion so we can safely ignore the error and move on.
@dentarg dentarg mentioned this pull request Apr 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants