-
Notifications
You must be signed in to change notification settings - Fork 4.9k
SocketAsyncEngine.Unix: remove multiple engines #36693
Conversation
The poll thread doesn't perform much work, so having 1 should be enough.
|
To be benchmarked to verify this doesn't regress performance. |
|
Pre this change, should it be doing everything under a static lock For example the field Would it not be better to use a field level lock so these class fields can be used in parallel? Otherwise it does defeat the purpose of having multiple engines? |
|
e.g. should it currently be doing something like this? |
Good point. If we have multiple engines, they shouldn't use the same lock for getting the socket context. |
|
Worth a try? |
Definitely. That lock was meant to be at the instance level. |
|
I've sent builds to @sebastienros and share the benchmark results when I have them. |
|
We should make sure we have right set of benchmarks. |
|
I rebased this PR and #36518 on the same commit, which includes the ConcurrentDictionary lookup. That is Baseline. MultiEngine is #36518 and SingleEngine is this PR.
SingleEngine shows best performance for all benchmarks. imo, we can merge this based on these results. CC @davidfowl |
|
These are for 6 cores 12 HT. I will run these on 14 cores / 28 HT too to have more insights. |
|
How many connections were in use for the above results and what's the range as the number of connections increase? e.g. from TE Round 17 Though this PR isn't only having one engine; its starting with one engine and then allocating a new engine when full? (Though full is defined as |
It starts allocating a new engine each 32 used connections. |
|
256 were used, and currently running on Citrine (TE env) with 256 too. |
256 / 32 = 8 engines |
Sure? In non-debug: bool IsFull { get { return _nextHandle == MaxHandles; } }
IntPtr MaxHandles = IntPtr.Size == 4 ? (IntPtr)int.MaxValue : (IntPtr)long.MaxValue;if (engine.IsFull)
{
Current = engine = new SocketAsyncEngine();
}So if you had 68k open sockets on a server it would only have one event loop for all of them? I might be reading it wrong? |
|
Other side of the diff: |
|
That's deleted in this PR though? |
|
Yes, we are going to 1 engine in this PR. The allocation of additional engines that remains is to deal with the exhaustion of unique tokens per engines. It's not meant as a way to distribute the load. |
|
Is this to reduce the contention on the ThreadPool as the poll thread does just mostly queues to it? However, it does also take a lock per loop item in What's the difference with a larger number of connections? e.g. 16,384 in the TE upper limit An example would be Stack Overflow which has 68k active sockets per server /cc @NickCraver |
|
*256 connections clearly look good 😄 |
|
On Citrine:
With 1024 connections on 6/12 cores:
1024 connections on Citrine:
From the results, the machine with more cores shows less improvement. SingleEngine is still out-performing MultiEngine.
It's to avoid side-effects of having too many poll threads.
@NickCraver 'active' does that mean those are all actively making request towards the server (or are there many sleeping connections)? How many 'processors' ( There are two parameters we can play with and see how they affect throughput: |
|
Looks good; thank you for checking it.
Something for follow up maybe? The single engine hasn't peaked at 256 then is under-performing at against multi-engine 1024 which was my worry. If multiple engines do have an advantage at higher connections it does seem that threshold for starting a new engine is going to be much higher than the 32 for the current; and since its a clear win still at 1024 connections this PR seems good as is, rather than trying to speculatively see if there is an advantage at some point. |
|
Thanks for all the numbers. Do we know what accounts for the 2% RPS dip in PlaintextNonPipelined? |
I don't know. My expectations here was to get results more similar to Json (like with the previous benchmark results). I'm not focussing on it because it's the least interesting benchmark. |
|
@davidfowl, considering that most numbers either stay the same or improve and this simplifies the code, assuming the small plaintext non-pipelined dip mentioned is real, are you ok with this? |
|
I like the change but the fact that non-pipelined performance is still on the downward trend doesn't make me happy. We can merge this for now and watch it. |
|
One more thing we should measure before merging is the proxy scenario. That uses both the HttpClient (sockets based) and server in the same process. That should be interesting. |
Only for plaintext, not for json. This surprising and not consistent with the earlier benchmark result. We should run the benchmark again.
Do you have such a benchmark available?
I got some benchmark results from @sebastienros that show this PR is reducing performance with 64k connections: -23% RPS for Json. I'm going to make some changes that increase this value, and then we can run another benchmark including 64k connections. |
ah ok: the proxy scenario |
|
Let me first have a look at the EventBufferCount before we run another benchmark. |
Implemented in this PR.
On https://github.com/tmds/corefx/tree/increase_min_handles I've raised the bar for creating additional engines. That means we should still get the gains we have benchmarked so far. We can benchmark these against one another to see what makes more sense for dealing with high (e.g. 65k) connection count. |
|
I sent some more dlls to @sebastienros for benchmarking. |
|
No benchmark results yet for this branch. It seems there is a bug causing the client connections to be refused. I need to investigate. https://github.com/tmds/corefx/tree/increase_min_handles results are good:
|
|
More benchmark results.
No clear winner stands out here. It depends on the benchmark which performs better (which is strange to me especially for Json vs Plaintext vs PlaintextNonPipelined). Fortunes shows a large regression. It's not possible to pick between these without prioritizing benchmarks. Therefore I'm inclined to leave things as is, unless someone decides which benchmarks are more important. I've got great assistance from @sebastienros to run these benchmarks for me. However, I'd be much more productive (and not feel like a waste of his time) if I could run the benchmarks myself. |
In other words, close this PR? I'm ok with that. Thanks for the investigation. |

The poll thread doesn't perform much work, so having 1 should be enough.