Meter calling accept(2) with available pool capacity #1278

evanphx · 2017-04-27T21:41:54Z

Talking through this with @nateberkopec, @schneems, @FooBarWidget, and others at RailsConf , we realized that accepting new clients without taking account for the available capacity of the thread pool doesn't improve throughput, it only hurts it in the case of multiple workers. If a worker pauses (or starts up before other workers), then a worker can accidentally suck up a high number of
clients and leave unused capacity inside the other workers.

This change will smooth out this issue, with a minor penalty to maximum throughput.

Talking through this with Nate and Richard, we realized that accepting new clients without taking account for the available capacity of the thread pool doesn't improve throughput, it only hurts it in the case of multiple workers. If a worker pauses (or starts up before other workers), then a worker can accidentally suck up a high number of clients and leave unused capacity inside the other workers. This change will smooth out this issue, with a minor penalty to maximum throughput.

nateberkopec · 2017-04-27T21:47:17Z

Linking #1107

nateberkopec · 2017-04-28T17:42:38Z

I think this could really improve performance for a lot of people, so I'm going to start looking at packaging up a new release.

ivoanjo · 2017-05-30T16:53:34Z

@evanphx and @nateberkopec, thanks for looking into this and looking forward to a puma release :)

nateberkopec · 2017-05-30T17:29:54Z

👍 since the Jruby bug on master was fixed, I think I can push a 3.9.0 soon.

waynerobinson · 2017-06-09T09:15:13Z

This will also resolve some potential deadlocking issues for APIs that can sometimes end up calling themselves (i.e. authentication systems that use microservices that need authenticating).

👍👍

joealba · 2017-06-13T15:48:22Z

@evanphx @schneems Do you recommend this upgrade for apps that are not yet threadsafe? And does that answer change if the app is also behind the Heroku router?

If I have N dynos with X workers each -- and the Heroku router picks a single busy worker in each of my N dynos, would that request fail due to the combination of this PR and a non-threadsafe app?

(I've tested this theory with ab and 4x higher concurrency than the number of workers, and I saw no failed requests. But I thought it prudent to ask you folks as well.)

References:

schneems · 2017-06-13T15:55:40Z

apps that are not yet threadsafe?

I actually don't recommend puma for apps that aren't threadsafe, unless it's a step to getting the app threadsafe. I recommend the rack-freeze gem for threadsafety in the middleware (it's a great gem).

I've seen performance issues with Puma using X workers and 1 thread compared to Unicorn with X workers. I'm not totally sure why, but i've seen this several times. You could also try passenger.

I don't think this patch would hurt if you're only using 1 thread, but i've seen other perf issues with that setup. I would recommend either putting in leg work to get the app threadsafe, or moving to a different server.

danshultz · 2017-09-19T02:11:25Z

lib/puma/server.rb


                    pool << client
-                    pool.wait_until_not_full unless queue_requests
+                    pool.wait_until_not_full


Was there a reason queue_requests was removed here?

I think the reason is explained adequately in the OP?

ivanpesin · 2017-10-24T03:21:35Z

lib/puma/thread_pool.rb

+          # If we can still spin up new threads and there
+          # is work queued, then accept more work until we would
+          # spin up the max number of threads.
+          return if @todo.size < @max - @spawned


This is not an equivalent of initial condition. With this change puma serving a single-threaded app with "threads 1,1" attempts to pass a request to a worker that is already serving a request. We're seeing occasional deadlocks for the requests coming right after a long-running request in our environment with "threads 1,1", "queue_requests false", and "workers 5" configuration. Rolling back this single commit to the original solves the issue completely.

In the case of a single-threaded worker, @waiting will be 0, @todo.size will be 1, @max will be 1 and @spawned will be 1, so the condition evaluates to 1 < 1 - 1, which is zero, so Puma waits here.

The problem is in line:
+ return if @waiting > 0

This is not how the original condition worked: until @todo.size - @waiting < @max - @spawned or @shutdown. The fact that @waiting > 0 is not enough for decision not to do @not_full.wait @mutex.

I prepared a simple test scenario which demonstrates the problem: https://gist.github.com/ivanpesin/f40a76e51a60a83f1190113fdecddaa5

With this commit, second request gets always blocked until first request is served, even though there are available workers. Third requests goes to available worker.

With original condition, second request goes to next available worker and works as expected. The condition needs to be either reverted to original form or needs logic correction.

We also had this issue.

Our setup is using 16 workers with 1 thread each running with an nginx server in front. We were getting high wait times between nginx and rails as measured by a middleware similar to the one at #1405 (comment). We partially undid the change in this pull request with the patch below and our wait times dropped as can be seen by the attached screenshot.

module PumaMonkeypatch def wait_until_not_full @mutex.synchronize do until @todo.size - @waiting < @max - @spawned or @shutdown @not_full.wait @mutex end end end end Puma::ThreadPool.prepend(::PumaMonkeypatch)

Thoughts on how to proceed? Should we file a new issue?

Should we file a new issue?

Please do.

Thanks. Ticket created: #1471

eprothro · 2017-11-08T16:29:44Z

If I'm understanding this change correctly:

Previously
If queue_requests was set, puma would eagerly accept connections, irrespective of the availability of worker threads in the pool to process them.

Now
Puma will only accept a connection if there is an available worker thread in the pool.

In effect, this leaves requests unaccepted (in the socket) instead of accepted and waiting for processing by the worker ("in" the queue_requests thread). And ultimately, as stated in the OP, this would result in better load balancing in a multi-worker configuration in certain scenarios.

Around the water cooler, it seems some are concerned that queue_requests has been deprecated or no longer has any effect.

To clarify, my understanding is that there is still benefit in having the queue_requests option, as this allows requests to be received without blocking a worker thread from handling on another request that has already been received. This has not changed, simply how many requests could be in this state (being handled by the queue_requests thread before being handed to a worker thread) has changed.

Is this correct?

nateberkopec · 2017-11-20T16:40:05Z

@eprothro

To clarify, my understanding is that there is still benefit in having the queue_requests option, as this allows requests to be received without blocking a worker thread from handling on another request that has already been received. This has not changed, simply how many requests could be in this state (being handled by the queue_requests thread before being handed to a worker thread) has changed.

That is correct.

@waiting

As part of the work in puma#1278 we refactored the conditional in `ThreadPool#wait_until_not_full` because we found it difficult to reason about at the RailsConf '17 roundtable. We broke it up a bit and everyone went away happier. However it turns out that you need to account for both `@waiting` and `@todo.size` together to make a determination on whether to accept more work from the socket and the refactor was not equivalent to the old code. There are no guarantees that request threads that have been signalled due to work being added to the thread pool's queue will become active, take the mutex and decrement `@waiting` before the main worker's thread runs `ThreadPool#wait_until_not_full`. If we run that before the request thread decrements @waiting we end up hitting the early exit of `return @waiting > 0`, taking more work off the socket than the pool's capacity. The correct way to determine the number of requests we have queued in the pool that are outstanding is `@todo.size - @waiting`, so I've reverted to the old behaviour and augmented the comment above it.

In PR puma#1278 this was changed to avoid using the "queue_requests" params, so workers only accept new connections if they have a free thread. This commit reverts that. I guess that change was beneficial for the most typical deployment scenarios, but it has a noticeable performance penalty in the case of Apisonator given that it is single threaded. Even with a moderate number of connections there are lots of time-outs.

Rewrite the conditional to be less confusing

db57f25

schneems approved these changes Apr 28, 2017

View reviewed changes

nateberkopec merged commit 482ea5a into master Apr 28, 2017

nateberkopec mentioned this pull request May 10, 2017

Cluster-mode: request blocked by busy worker even though other worker has idle threads #1107

Closed

danshultz mentioned this pull request Sep 19, 2017

High queue latency over TCP when under heavy load #1405

Closed

danshultz reviewed Sep 19, 2017

View reviewed changes

ivanpesin reviewed Oct 24, 2017

View reviewed changes

eprothro mentioned this pull request Nov 10, 2017

Accepting more connections than pool availability #1453

Closed

nateberkopec deleted the f-metered-accept branch November 17, 2017 19:19

jasquat mentioned this pull request Nov 27, 2017

Puma does not divide jobs between workers properly if using a single thread per worker #1471

Closed

dannyfallon mentioned this pull request Apr 17, 2018

Meter accept(2) while taking into account races around thread pool mutex #1563

Merged

Meter calling accept(2) with available pool capacity #1278

Meter calling accept(2) with available pool capacity #1278

Uh oh!

Conversation

evanphx commented Apr 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nateberkopec commented Apr 27, 2017

Uh oh!

nateberkopec commented Apr 28, 2017

Uh oh!

ivoanjo commented May 30, 2017

Uh oh!

nateberkopec commented May 30, 2017

Uh oh!

waynerobinson commented Jun 9, 2017

Uh oh!

joealba commented Jun 13, 2017

Uh oh!

schneems commented Jun 13, 2017

Uh oh!

danshultz Sep 19, 2017

Choose a reason for hiding this comment

Uh oh!

nateberkopec Sep 20, 2017

Choose a reason for hiding this comment

Uh oh!

ivanpesin Oct 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nateberkopec Nov 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanpesin Nov 10, 2017

Choose a reason for hiding this comment

Uh oh!

jasquat Nov 22, 2017

Choose a reason for hiding this comment

Uh oh!

nateberkopec Nov 22, 2017

Choose a reason for hiding this comment

Uh oh!

jasquat Nov 27, 2017

Choose a reason for hiding this comment

Uh oh!

eprothro commented Nov 8, 2017

Uh oh!

nateberkopec commented Nov 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

evanphx commented Apr 27, 2017 •

edited

Loading

ivanpesin Oct 24, 2017 •

edited

Loading

nateberkopec Nov 7, 2017 •

edited

Loading