SocketAsyncEngine.Unix: remove multiple engines #36693

tmds · 2019-04-08T11:56:56Z

The poll thread doesn't perform much work, so having 1 should be enough.

tmds · 2019-04-08T12:00:12Z

To be benchmarked to verify this doesn't regress performance.

benaadams · 2019-04-08T12:18:11Z

Pre this change, should it be doing everything under a static lock lock (s_lock) as now or a lock per engine?

For example the field _handleToContextMap is used under the static lock even though there is one per engine; this means it can only be used single threaded even though there are many of them.

Would it not be better to use a field level lock so these class fields can be used in parallel? Otherwise it does defeat the purpose of having multiple engines?

benaadams · 2019-04-08T12:48:56Z

e.g. should it currently be doing something like this?

#36697

tmds · 2019-04-08T14:24:05Z

Pre this change, should it be doing everything under a static lock lock (s_lock) as now or a lock per engine?

Good point. If we have multiple engines, they shouldn't use the same lock for getting the socket context.
#36358 changes this lock, but shows little change.
It would be interesting to see if it matters more on top of #36518.

benaadams · 2019-04-08T14:52:34Z

Worth a try?

tmds · 2019-04-08T15:05:04Z

Worth a try?

Definitely. That lock was meant to be at the instance level.
I think Sébastien is out, I'll send him some builds and we'll have a look at the results when he gets to it.

…ocket_engines

tmds · 2019-04-13T15:44:05Z

I've sent builds to @sebastienros and share the benchmark results when I have them.

wfurt · 2019-04-14T01:01:55Z

We should make sure we have right set of benchmarks.
With changes like this, it may greatly benefit some workload but hurting others. We have seen highly parallel usage with nuget restore for example. We also have seen some interesting behavior with other tests on systems with 46 cores. Even if that is not typical, we should understand how single engine scales with increasing number of cores.

tmds · 2019-04-17T18:34:26Z

I rebased this PR and #36518 on the same commit, which includes the ConcurrentDictionary lookup. That is Baseline. MultiEngine is #36518 and SingleEngine is this PR.

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
JsonBaseline	393,190	96	178	1.09	296	273.25	0.57	1.00
JsonMultiEngine	396,144	96	177	1.1	297	271.16	0.51	1.01
JsonSingleEngine	421,054	95	179	1.03	292	265.5	0.57	1.07

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
PlaintextBaseline	2,163,842	98	65	1.6	299	123.6	0.5	1.00
PlaintextMultiEngine	2,202,285	98	65	1.59	304	120.74	0.65	1.02
PlaintextSingleEngine	2,245,635	96	65	1.14	296	119.09	0.46	1.04

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
PlaintextNonPipelinedBaseline	497,360	97	65	1.12	307	122.19	0.57	1.00
PlaintextNonPipelinedMultiEngine	491,759	97	65	1.1	287	121.02	0.45	0.99
PlaintextNonPipelinedSingleEngine	531,290	97	65	1.2	288	117.93	0.51	1.07

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
FortunesBaseline	43,400	94	174	5.94	304	131.8	0.91	1.00
FortunesMultiEngine	43,397	93	174	5.91	307	128.72	1.07	1.00
FortunesSingleEngine	43,762	94	173	5.82	303	131.06	1.04	1.01

SingleEngine shows best performance for all benchmarks.
Most I/O operations are performed by the Json and PlaintextNonPipelined benchmarks which show a +7% RPS.
Plaintext shows a +4%. Fortunes a +1% RPS.

imo, we can merge this based on these results.

CC @davidfowl

sebastienros · 2019-04-17T18:40:57Z

These are for 6 cores 12 HT. I will run these on 14 cores / 28 HT too to have more insights.

benaadams · 2019-04-17T18:50:51Z

How many connections were in use for the above results and what's the range as the number of connections increase?

e.g. from TE Round 17

Though this PR isn't only having one engine; its starting with one engine and then allocating a new engine when full? (Though full is defined as long.MaxValue sockets allocated; which likely means one engine?)

tmds · 2019-04-17T18:53:43Z

s starting with one engine and then allocating a new engine when full? (Though full is defined as long.MaxValue sockets allocated; which likely means one engine?)

It starts allocating a new engine each 32 used connections.

sebastienros · 2019-04-17T18:53:44Z

256 were used, and currently running on Citrine (TE env) with 256 too.

tmds · 2019-04-17T18:58:17Z

256 were used, and currently running on Citrine (TE env) with 256 too.

256 / 32 = 8 engines
The number of engines is also limited to half the number of processors.
256 connections on 6 cores -> 6 engines
256 connections on 14 cores -> 8 engines

benaadams · 2019-04-17T19:02:43Z

It starts allocating a new engine each 32 used connections.

Sure? In non-debug:

bool IsFull { get { return _nextHandle == MaxHandles; } }
IntPtr MaxHandles = IntPtr.Size == 4 ? (IntPtr)int.MaxValue : (IntPtr)long.MaxValue;

if (engine.IsFull)
{
    Current = engine = new SocketAsyncEngine();
}

So if you had 68k open sockets on a server it would only have one event loop for all of them?

I might be reading it wrong?

tmds · 2019-04-17T19:06:59Z

Other side of the diff:

                // Round-robin to the next engine once we have sufficient sockets on this one.
                if (!engine.HasLowNumberOfSockets)	
                {	
                    s_allocateFromEngine = (s_allocateFromEngine + 1) % EngineCount;	
                }

benaadams · 2019-04-17T19:21:04Z

That's deleted in this PR though?

tmds · 2019-04-17T19:27:43Z

Yes, we are going to 1 engine in this PR. The allocation of additional engines that remains is to deal with the exhaustion of unique tokens per engines. It's not meant as a way to distribute the load.

benaadams · 2019-04-17T20:29:34Z

Is this to reduce the contention on the ThreadPool as the poll thread does just mostly queues to it?

However, it does also take a lock per loop item in SocketAsyncContext.HandleEvents

What's the difference with a larger number of connections? e.g. 16,384 in the TE upper limit

An example would be Stack Overflow which has 68k active sockets per server /cc @NickCraver

benaadams · 2019-04-17T20:52:46Z

*256 connections clearly look good 😄

tmds · 2019-04-18T05:15:11Z

On Citrine:

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
JsonBaseline	742,649	95	548	1.71	327	272.01	0.14	1.00
JsonMultiEngine	737,653	95	597	1.8	327	269.02	0.13	0.99
JsonSingleEngine	790,714	96	580	1.27	328	270.27	0.14	1.06

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
PlaintextBaseline	4,393,343	99	98	1.33	329	100.15	0.14	1.00
PlaintextMultiEngine	4,381,514	99	100	1.23	330	99.61	0.13	1.00
PlaintextSingleEngine	4,473,776	98	97	0.89	330	100.33	0.13	1.02

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
PlaintextNonPipelinedBaseline	883,312	99	133	0.92	331	100.55	0.13	1.00
PlaintextNonPipelinedMultiEngine	867,084	99	129	0.87	333	99.38	0.13	0.98
PlaintextNonPipelinedSingleEngine	865,959	100	97	0.52	333	100.53	0.13	0.98

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
FortunesBaseline	35,092	96	488	7.38	338	109.89	0.26	1.00
FortunesMultiEngine	35,951	96	505	7.2	341	108.91	0.25	1.02
FortunesSingleEngine	35,916	96	519	7.22	338	110.25	0.31	1.02

With 1024 connections on 6/12 cores:

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
JsonPlatformBaseline	641,624	97	166	3.88	245	243.09	0.43	1.00
JsonPlatformMultiEngine	653,136	97	208	4.48	240	243.36	0.53	1.02
JsonPlatformSingleEngine	683,193	98	213	3.68	240	245.38	0.45	1.06

1024 connections on Citrine:

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
JsonPlatformBaseline	968,883	98	415	1.74	273	244.01	0.13	1.00
JsonPlatformMultiEngine	978,002	98	417	1.65	273	244.78	0.13	1.01
JsonPlatformSingleEngine	993,364	98	428	1.44	273	244.87	0.13	1.03

From the results, the machine with more cores shows less improvement. SingleEngine is still out-performing MultiEngine.

Is this to reduce the contention on the ThreadPool as the poll thread does just mostly queues to it?

It's to avoid side-effects of having too many poll threads.

An example would be Stack Overflow which has 68k active sockets per server

@NickCraver 'active' does that mean those are all actively making request towards the server (or are there many sleeping connections)? How many 'processors' (Environment.ProcessorCount) are on a machine that handles those?

There are two parameters we can play with and see how they affect throughput:
We can still have multiple engines, but use a lower maximum than ProcessorCount / 2.
We can also have a look at the effect of the EventBufferCount which controls how many events we can get from epoll per call (1024 currently).
I can make some builds where we vary these parameters and see how they affect throughput at high connection count.

benaadams · 2019-04-18T05:48:53Z

Looks good; thank you for checking it.

I can make some builds where we vary these parameters and see how they affect throughput at high connection count.

Something for follow up maybe?

The single engine hasn't peaked at 256 then is under-performing at against multi-engine 1024 which was my worry. If multiple engines do have an advantage at higher connections it does seem that threshold for starting a new engine is going to be much higher than the 32 for the current; and since its a clear win still at 1024 connections this PR seems good as is, rather than trying to speculatively see if there is an advantage at some point.

stephentoub · 2019-04-18T14:52:08Z

Thanks for all the numbers. Do we know what accounts for the 2% RPS dip in PlaintextNonPipelined?

tmds · 2019-04-18T15:38:39Z

Thanks for all the numbers. Do we know what accounts for the 2% RPS dip in PlaintextNonPipelined?

I don't know. My expectations here was to get results more similar to Json (like with the previous benchmark results). I'm not focussing on it because it's the least interesting benchmark.

stephentoub · 2019-04-18T16:23:56Z

@davidfowl, considering that most numbers either stay the same or improve and this simplifies the code, assuming the small plaintext non-pipelined dip mentioned is real, are you ok with this?

davidfowl · 2019-04-19T04:27:31Z

I like the change but the fact that non-pipelined performance is still on the downward trend doesn't make me happy. We can merge this for now and watch it.

davidfowl · 2019-04-19T04:28:46Z

One more thing we should measure before merging is the proxy scenario. That uses both the HttpClient (sockets based) and server in the same process. That should be interesting.

cc @sebastienros

tmds · 2019-04-19T07:16:22Z

non-pipelined performance is still on the downward trend doesn't make me happy

Only for plaintext, not for json. This surprising and not consistent with the earlier benchmark result. We should run the benchmark again.

That uses both the HttpClient (sockets based) and server in the same process. That should be interesting.

Do you have such a benchmark available?
From the ones used here, Fortunes is the (only) one that also has some client-side usage of Sockets (for DB, not HTTP).

We can also have a look at the effect of the EventBufferCount which controls how many events we can get from epoll per call (1024 currently).

I got some benchmark results from @sebastienros that show this PR is reducing performance with 64k connections: -23% RPS for Json.

I'm going to make some changes that increase this value, and then we can run another benchmark including 64k connections.

tmds · 2019-04-19T07:16:59Z

Do you have such a benchmark available?

ah ok: the proxy scenario

tmds · 2019-04-19T07:17:24Z

Let me first have a look at the EventBufferCount before we run another benchmark.

tmds · 2019-04-19T09:52:52Z

Let me first have a look at the EventBufferCount before we run another benchmark.

Implemented in this PR.

We can still have multiple engines, but use a lower maximum than ProcessorCount / 2.

On https://github.com/tmds/corefx/tree/increase_min_handles I've raised the bar for creating additional engines. That means we should still get the gains we have benchmarked so far.

We can benchmark these against one another to see what makes more sense for dealing with high (e.g. 65k) connection count.

tmds · 2019-04-19T10:00:03Z

I sent some more dlls to @sebastienros for benchmarking.

tmds · 2019-04-23T07:48:43Z

No benchmark results yet for this branch. It seems there is a bug causing the client connections to be refused. I need to investigate.

https://github.com/tmds/corefx/tree/increase_min_handles results are good:

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
JsonBaseline	80,093	72	892	8.55	344	271.4	0.17	1.00
JsonMinHandles	78,910	78	895	6.67	344	271.07	0.15	0.99

tmds · 2019-05-09T07:07:46Z

More benchmark results.
This is with 65k connections. Because the numbers are lower, I guess this is on the 6/12 cores machine:

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
JsonBaseline	11,151	60	844	7.49	278	210.82	0.66	1.00
JsonLargeThreshold	12,876	65	888	6.8	274	211.24	0.55	1.15
JsonSingleEngine	10,256	67	831	9.86	287	209.31	0.57	0.92

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
PlaintextBaseline	29,205	56	438	51.61	285	104.34	0.52	1.00
PlaintextLargeThreshold	27,345	45	420	33.9	284	101.72	0.56	0.94
PlaintextSingleEngine	35,265	49	404	30.81	282	104.74	0.67	1.21

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
PlaintextNonPipelinedBaseline	17,611	65	518	11.7	281	105.18	0.48	1.00
PlaintextNonPipelinedLargeThreshold	13,792	60	512	10.29	278	105.84	0.67	0.78
PlaintextNonPipelinedSingleEngine	16,992	62	523	10.88	282	104.03	0.53	0.96

Description	RPS	CPU (%)	Memory (MB)	Avg. Latency (ms)	Startup (ms)	First Request (ms)	Latency (ms)	Ratio
FortunesBaseline	14,188	75	1,547	69.35	285	113.41	1.11	1.00
FortunesLargeThreshold	10,024	82	1,461	57.64	285	112.11	0.86	0.71
FortunesSingleEngine	8,523	83	1,526	43.49	288	109.72	0.9	0.60

No clear winner stands out here. It depends on the benchmark which performs better (which is strange to me especially for Json vs Plaintext vs PlaintextNonPipelined). Fortunes shows a large regression.

It's not possible to pick between these without prioritizing benchmarks. Therefore I'm inclined to leave things as is, unless someone decides which benchmarks are more important.

I've got great assistance from @sebastienros to run these benchmarks for me. However, I'd be much more productive (and not feel like a waste of his time) if I could run the benchmarks myself.

stephentoub · 2019-05-14T13:25:42Z

Therefore I'm inclined to leave things as is, unless someone decides which benchmarks are more important.

In other words, close this PR?

I'm ok with that.

Thanks for the investigation.

SocketAsyncEngine.Unix: remove multiple engines

69bef1c

The poll thread doesn't perform much work, so having 1 should be enough.

tmds mentioned this pull request Apr 8, 2019

[WIP] Perform socket operation on epoll thread and rework locking #36518

Closed

stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 8, 2019

Merge remote-tracking branch 'upstream/master' into remove_multiple_s…

be93194

…ocket_engines

karelz added the area-System.Net.Sockets label Apr 13, 2019

karelz assigned tmds, davidsh and wfurt Apr 13, 2019

benaadams approved these changes Apr 18, 2019

View reviewed changes

davidfowl approved these changes Apr 19, 2019

View reviewed changes

Enlarge event buffer when it got filled completely

84ec6c9

tmds closed this May 14, 2019

karelz added this to the 3.0 milestone May 22, 2019

tmds mentioned this pull request Jun 11, 2019

Benchmark corefx Sockets epoll exhaustion strategy dotnet/aspnetcore#4751

Closed

tmds mentioned this pull request Jan 10, 2020

SocketAsyncEngine.Unix: Use eventfd(2) to signal shutdown dotnet/runtime#1538

Closed

richlander mentioned this pull request Jan 11, 2020

Environment.ProcessorCount incorrect reporting in containers in 3.1 dotnet/runtime#622

Closed

stephentoub mentioned this pull request Jan 29, 2020

allocate SocketAsyncEngine less frequenty to reduce the number of epoll_wait threads dotnet/runtime#2346

Merged

SocketAsyncEngine.Unix: remove multiple engines #36693

SocketAsyncEngine.Unix: remove multiple engines #36693

Uh oh!

Conversation

tmds commented Apr 8, 2019

Uh oh!

tmds commented Apr 8, 2019

Uh oh!

benaadams commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented Apr 8, 2019

Uh oh!

benaadams commented Apr 8, 2019

Uh oh!

tmds commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented Apr 13, 2019

Uh oh!

wfurt commented Apr 14, 2019

Uh oh!

tmds commented Apr 17, 2019

Uh oh!

sebastienros commented Apr 17, 2019

Uh oh!

benaadams commented Apr 17, 2019

Uh oh!

tmds commented Apr 17, 2019

Uh oh!

sebastienros commented Apr 17, 2019

Uh oh!

tmds commented Apr 17, 2019

Uh oh!

benaadams commented Apr 17, 2019

Uh oh!

tmds commented Apr 17, 2019

Uh oh!

benaadams commented Apr 17, 2019

Uh oh!

tmds commented Apr 17, 2019

Uh oh!

benaadams commented Apr 17, 2019

Uh oh!

benaadams commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented Apr 18, 2019

Uh oh!

benaadams commented Apr 18, 2019

Uh oh!

stephentoub commented Apr 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented Apr 18, 2019

Uh oh!

stephentoub commented Apr 18, 2019

Uh oh!

davidfowl commented Apr 19, 2019

Uh oh!

davidfowl commented Apr 19, 2019

Uh oh!

tmds commented Apr 19, 2019

Uh oh!

tmds commented Apr 19, 2019

Uh oh!

tmds commented Apr 19, 2019

Uh oh!

tmds commented Apr 19, 2019

Uh oh!

tmds commented Apr 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented Apr 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented May 9, 2019

benaadams commented Apr 8, 2019 •

edited

Loading

benaadams commented Apr 8, 2019 •

edited

Loading

tmds commented Apr 8, 2019 •

edited

Loading

benaadams commented Apr 17, 2019 •

edited

Loading

stephentoub commented Apr 18, 2019 •

edited

Loading

tmds commented Apr 19, 2019 •

edited

Loading

tmds commented Apr 23, 2019 •

edited

Loading