Manage number of new connections per cycle#12178
Conversation
madolson
left a comment
There was a problem hiding this comment.
This was originally discussed here: #10524. The main issue we are trying to prevent is TLS connection storms, which aren't being protected by default here. My preference would have just been to update the tlsAcceptHandler to only do 1 per cycle and leave TCP as is.
There was a problem hiding this comment.
This was originally discussed here: #10524. The main issue we are trying to prevent is TLS connection storms, which aren't being protected by default here. My preference would have just been to update the tlsAcceptHandler to only do 1 per cycle and leave TCP as is. I would like feedback from @yossigo though.
|
I believe the focus here should not be solely on TLS. TLS is quite obvious, but it is definitely not the only consideration. Please read the discussion here first. All of my discussion/argument is still completely on-point: #10524 This boils down to a prioritization issue. When the event loop is saturated (CPU is 100% at the micro level), a prioritization is required. Do we delay new connections? or do we delay existing work? It's a binary decision. We must bias towards processing existing work - otherwise we risk timeout/retries. By increasing the number of connections accepted greater than 1, we are choosing to bias towards accepting connections - which is the wrong decision. Side note: it's also valid to question if there are cases where the accept count should be reduced to something LESS THAN 1. This would bias us even more to existing work. In practice I have found this unnecessary and IMO it would add unnecessary complexity. The issue with accept count is quite obvious in a TLS storm. Each accept implies the need for expensive negotiation messages. When we accept too much, the negotiation can't complete without timeout. Retries happen. Eventually we get to a point where NONE of the connections complete negotiation, and the engine is saturated with connection attempts. By prioritizing the negotiation over new accepts, SOME of the connections will complete without timeout. Over a short period of time, a TLS storm will dissipate. REAL WORLD SCENARIOS that I have witnessed:
In summary, if the event loop isn't saturated, this change makes no difference. The event loop spins quickly and connections are established. It's only in the case that the event loop is saturated, that connections-per-cycle has any impact. The impact is immediate and beneficial. P.S. If it were up to me, I'd remove the configuration and inner connection loop completely. Over the last 2 years, I've seen no evidence that using a value greater than 1 is ever warranted. But I agree that a tunable parameter is the less controversial option. |
As a note, this is often a more controversial option here compared to our internal AWS fork since it's a new config we have to expose, document, and maintain. |
@JimB123 Can you confirm this is also the case when there's very large spike of inbound connections? I'm concerned the extra syscalls will reduce accept rate and potentially have an impact on the TCP backlog. |
@yossigo You are correct that there are more syscalls. If accepting connections was the only measure of success, you would be right to be concerned. However, the measure of success should be accepting connections AND doing work. When originally working on this, I did some rough measurements of pure event loop speed and found the event loop to spin much faster than any reasonable connection rate. In the TLS case, the extra overhead is very small in comparison to the TLS negotiation. If I was to TRY to show a degradation in overall performance/throughput, I'd set up a scenario as follows:
I think in this case, there might be some measurable performance delta. However, I don't think we should optimize for such an anti-pattern. |
|
@JimB123 I wasn't suggesting to optimize for it, but use it as a test case to understand the worst case impact. Anyway, based on running unstable: This PR: |
|
Very nice @yossigo. 2%, worst case, is better than I was expecting. Of course, in practice, we're very unlikely to see any throughput impact. The connection (& disconnection) rate would have to be extremely high, and CPU would have to be saturated. |
|
@madolson I'd play it safe and leave it configurable for now. |
|
@yossigo in that case are we okay with a default of 1? Otherwise people will need to be impacted before they know to fix it. |
|
@madolson Yes. |
087b5cb to
d7d748d
Compare
madolson
left a comment
There was a problem hiding this comment.
@redis/core-team A small change to help prevent connection storms in certain circumstances.
|
i have one more concern. this is necessary in order to still be responsive for PING and INFO. this isn't normally a problem since 2mb of rdb file are probably quite fast, and we'll get many of these per second, but there are cases where the rate of such calls to processEventsWhileBlocked drops. |
I think we need re-benchmark it, |
|
we discussed this in a core-team meeting. we dismissed the concern of processEventsWhileBlocked, considering it an edge case that would only cause issues when several types of abuse are combined (a monitoring software that re-connects each time, several of them, and a very slow rate of processEventsWhileBlocked calls). we conceptually approved this pending on a re-benchmark (on a better hardware?) of the impact on clients that re-connect for each command they send. |
|
@AshMosh Can you help with the performance testing above? Ideally we want to test a more realworld scenarios than on the laptop which Yossi did. |
|
With max-new-connections-per-cycle=5, reconnect-interval=10:
With max-new-connections-per-cycle=10, reconnect-interval=10:
I also reran with reconnect-interval = 1 for a more extreme connection storm simulation: With max-new-connections-per-cycle=5, reconnect-interval=1:
With max-new-connections-per-cycle=10, reconnect-interval=1:
With max-new-connections-per-cycle=1000, reconnect-interval=1:
|
|
So I am a bit inclined to update my position and say we should update the PR and use a default value of 10. It seems to still provide most of the protective benefits against TLS based connection storms while still providing minimal detriment to the workloads that open a lot of connections. @soloestoy Thoughts? |
|
Anything larger than 1 risks timeouts on your real workload and death loop. Set up a scenario with a high connection rate, using TLS, and a workload with small timeouts for regular commands. Make sure you drive CPU to 100%. |
|
Decision for now is to split into two separate configs: With two different default values. Default of 1 for TLS and 10 for TCP. The motivation was we wanted to be as defensive as possible for not introducing performance regression for TCP based workloads that do single client + single command workloads (like with PHP). |
63b85da to
4b50a2f
Compare
madolson
left a comment
There was a problem hiding this comment.
@redis/core-team Any other concerns? I've updated the top comment as well.
|
I have reconsidered, and if we are concerned about the potential time consumption of accept, perhaps we can set a time threshold for accept, such as 1000 microseconds. This way, we don't need to differentiate between TLS and TCP or worry about the number of accepted connections. Using time as a limit can be a better approach. |
|
#12178 (comment) @murphyjacob4 the results are based on this PR (limiting to 1 or 10) or unstable? |
|
Additionally, Redis initially accept only one connection, which was later increased to 1000 starting from version 2.8, see more details in f3d3c60. Does anyone have knowledge about this history? As mentioned in the commit description, this change was made to optimize scenarios where a connect-disconnect pattern is frequently used. In this scenario, accepting a new connection should be also considered a high-priority event, even if the CPU usage is at 100%. Therefore, if we intend to reduce the default value now, we need to exercise caution and carefully consider the implications. |
|
personally, i don't like the use of |
The problem isn't in the accept per-say it's that we're signing up for future work with the TLS negotiation. I think a better alternative, which I've mentioned before, is to limit the number of TLS connections in the handshaking state. Which bottlenecks incoming TCP connections much less since they move to accept immediately. @oranagra I didn't like tcp either, since it also applies to unix, so I'm fine just differentiating it on tls. Your suggestion has tls twice, I still prefer the original name |
There are situations (especially in TLS) in which the engine gets too occupied managing a large number of new connections. Existing connections may time-out while the server is processing the new connections initial TLS handshakes, which may cause cause new connections to be established, perpetuating the problem. To better manage the tradeoff between new connection rate and other workloads, this change adds a new config to manage maximum number of new connections per event loop cycle, instead of using a predetermined number (currently 1000). This change introduces two new configurations, max-new-connections-per-cycle and max-new-tls-connections-per-cycle. The default value of the tcp connections is 10 per cycle and the default value of tls connections per cycle is 1. --------- Co-authored-by: Madelyn Olson <[email protected]>
There are situations (especially in TLS) in which the engine gets too occupied managing a large number of new connections. Existing connections may time-out while the server is processing the new connections initial TLS handshakes, which may cause cause new connections to be established, perpetuating the problem. To better manage the tradeoff between new connection rate and other workloads, this change adds a new config to manage maximum number of new connections per event loop cycle, instead of using a predetermined number (currently 1000). This change introduces two new configurations, max-new-connections-per-cycle and max-new-tls-connections-per-cycle. The default value of the tcp connections is 10 per cycle and the default value of tls connections per cycle is 1. --------- Co-authored-by: Madelyn Olson <[email protected]>
There are situations (especially in TLS) in which the engine gets too occupied managing a large number of new connections. Existing connections may time-out while the server is processing the new connections initial TLS handshakes, which may cause cause new connections to be established, perpetuating the problem. To better manage the tradeoff between new connection rate and other workloads, this change adds a new config to manage maximum number of new connections per event loop cycle, instead of using a predetermined number (currently 1000).
This change introduces two new configurations, max-new-connections-per-cycle and max-new-tls-connections-per-cycle. The default value of the tcp connections is 10 per cycle and the default value of tls connections per cycle is 1.