Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@tmds
Copy link
Member

@tmds tmds commented Apr 8, 2019

The poll thread doesn't perform much work, so having 1 should be enough.

The poll thread doesn't perform much work, so having 1 should be enough.
@tmds
Copy link
Member Author

tmds commented Apr 8, 2019

cc @stephentoub @geoffkizer

To be benchmarked to verify this doesn't regress performance.

@benaadams
Copy link
Member

benaadams commented Apr 8, 2019

Pre this change, should it be doing everything under a static lock lock (s_lock) as now or a lock per engine?

For example the field _handleToContextMap is used under the static lock even though there is one per engine; this means it can only be used single threaded even though there are many of them.

Would it not be better to use a field level lock so these class fields can be used in parallel? Otherwise it does defeat the purpose of having multiple engines?

@benaadams
Copy link
Member

benaadams commented Apr 8, 2019

e.g. should it currently be doing something like this?

#36697

@tmds
Copy link
Member Author

tmds commented Apr 8, 2019

Pre this change, should it be doing everything under a static lock lock (s_lock) as now or a lock per engine?

Good point. If we have multiple engines, they shouldn't use the same lock for getting the socket context.
#36358 changes this lock, but shows little change.
It would be interesting to see if it matters more on top of #36518.

@benaadams
Copy link
Member

Worth a try?

@tmds
Copy link
Member Author

tmds commented Apr 8, 2019

Worth a try?

Definitely. That lock was meant to be at the instance level.
I think Sébastien is out, I'll send him some builds and we'll have a look at the results when he gets to it.

@stephentoub stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 8, 2019
@tmds
Copy link
Member Author

tmds commented Apr 13, 2019

I've sent builds to @sebastienros and share the benchmark results when I have them.

@wfurt
Copy link
Member

wfurt commented Apr 14, 2019

We should make sure we have right set of benchmarks.
With changes like this, it may greatly benefit some workload but hurting others. We have seen highly parallel usage with nuget restore for example. We also have seen some interesting behavior with other tests on systems with 46 cores. Even if that is not typical, we should understand how single engine scales with increasing number of cores.

@tmds
Copy link
Member Author

tmds commented Apr 17, 2019

I rebased this PR and #36518 on the same commit, which includes the ConcurrentDictionary lookup. That is Baseline. MultiEngine is #36518 and SingleEngine is this PR.

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
JsonBaseline 393,190 96 178 1.09 296 273.25 0.57 1.00
JsonMultiEngine 396,144 96 177 1.1 297 271.16 0.51 1.01
JsonSingleEngine 421,054 95 179 1.03 292 265.5 0.57 1.07
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
PlaintextBaseline 2,163,842 98 65 1.6 299 123.6 0.5 1.00
PlaintextMultiEngine 2,202,285 98 65 1.59 304 120.74 0.65 1.02
PlaintextSingleEngine 2,245,635 96 65 1.14 296 119.09 0.46 1.04
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
PlaintextNonPipelinedBaseline 497,360 97 65 1.12 307 122.19 0.57 1.00
PlaintextNonPipelinedMultiEngine 491,759 97 65 1.1 287 121.02 0.45 0.99
PlaintextNonPipelinedSingleEngine 531,290 97 65 1.2 288 117.93 0.51 1.07
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
FortunesBaseline 43,400 94 174 5.94 304 131.8 0.91 1.00
FortunesMultiEngine 43,397 93 174 5.91 307 128.72 1.07 1.00
FortunesSingleEngine 43,762 94 173 5.82 303 131.06 1.04 1.01

SingleEngine shows best performance for all benchmarks.
Most I/O operations are performed by the Json and PlaintextNonPipelined benchmarks which show a +7% RPS.
Plaintext shows a +4%. Fortunes a +1% RPS.

imo, we can merge this based on these results.

CC @davidfowl

@sebastienros
Copy link
Member

These are for 6 cores 12 HT. I will run these on 14 cores / 28 HT too to have more insights.

@benaadams
Copy link
Member

How many connections were in use for the above results and what's the range as the number of connections increase?

e.g. from TE Round 17

image

Though this PR isn't only having one engine; its starting with one engine and then allocating a new engine when full? (Though full is defined as long.MaxValue sockets allocated; which likely means one engine?)

@tmds
Copy link
Member Author

tmds commented Apr 17, 2019

s starting with one engine and then allocating a new engine when full? (Though full is defined as long.MaxValue sockets allocated; which likely means one engine?)

It starts allocating a new engine each 32 used connections.

@sebastienros
Copy link
Member

256 were used, and currently running on Citrine (TE env) with 256 too.

@tmds
Copy link
Member Author

tmds commented Apr 17, 2019

256 were used, and currently running on Citrine (TE env) with 256 too.

256 / 32 = 8 engines
The number of engines is also limited to half the number of processors.
256 connections on 6 cores -> 6 engines
256 connections on 14 cores -> 8 engines

@benaadams
Copy link
Member

It starts allocating a new engine each 32 used connections.

Sure? In non-debug:

bool IsFull { get { return _nextHandle == MaxHandles; } }
IntPtr MaxHandles = IntPtr.Size == 4 ? (IntPtr)int.MaxValue : (IntPtr)long.MaxValue;
if (engine.IsFull)
{
    Current = engine = new SocketAsyncEngine();
}

So if you had 68k open sockets on a server it would only have one event loop for all of them?

I might be reading it wrong?

@tmds
Copy link
Member Author

tmds commented Apr 17, 2019

Other side of the diff:

                // Round-robin to the next engine once we have sufficient sockets on this one.
                if (!engine.HasLowNumberOfSockets)	
                {	
                    s_allocateFromEngine = (s_allocateFromEngine + 1) % EngineCount;	
                }

@benaadams
Copy link
Member

That's deleted in this PR though?

@tmds
Copy link
Member Author

tmds commented Apr 17, 2019

Yes, we are going to 1 engine in this PR. The allocation of additional engines that remains is to deal with the exhaustion of unique tokens per engines. It's not meant as a way to distribute the load.

@benaadams
Copy link
Member

Is this to reduce the contention on the ThreadPool as the poll thread does just mostly queues to it?

However, it does also take a lock per loop item in SocketAsyncContext.HandleEvents

What's the difference with a larger number of connections? e.g. 16,384 in the TE upper limit

An example would be Stack Overflow which has 68k active sockets per server /cc @NickCraver

@benaadams
Copy link
Member

benaadams commented Apr 17, 2019

*256 connections clearly look good 😄

@tmds
Copy link
Member Author

tmds commented Apr 18, 2019

On Citrine:

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
JsonBaseline 742,649 95 548 1.71 327 272.01 0.14 1.00
JsonMultiEngine 737,653 95 597 1.8 327 269.02 0.13 0.99
JsonSingleEngine 790,714 96 580 1.27 328 270.27 0.14 1.06
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
PlaintextBaseline 4,393,343 99 98 1.33 329 100.15 0.14 1.00
PlaintextMultiEngine 4,381,514 99 100 1.23 330 99.61 0.13 1.00
PlaintextSingleEngine 4,473,776 98 97 0.89 330 100.33 0.13 1.02
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
PlaintextNonPipelinedBaseline 883,312 99 133 0.92 331 100.55 0.13 1.00
PlaintextNonPipelinedMultiEngine 867,084 99 129 0.87 333 99.38 0.13 0.98
PlaintextNonPipelinedSingleEngine 865,959 100 97 0.52 333 100.53 0.13 0.98
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
FortunesBaseline 35,092 96 488 7.38 338 109.89 0.26 1.00
FortunesMultiEngine 35,951 96 505 7.2 341 108.91 0.25 1.02
FortunesSingleEngine 35,916 96 519 7.22 338 110.25 0.31 1.02

With 1024 connections on 6/12 cores:

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
JsonPlatformBaseline 641,624 97 166 3.88 245 243.09 0.43 1.00
JsonPlatformMultiEngine 653,136 97 208 4.48 240 243.36 0.53 1.02
JsonPlatformSingleEngine 683,193 98 213 3.68 240 245.38 0.45 1.06

1024 connections on Citrine:

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
JsonPlatformBaseline 968,883 98 415 1.74 273 244.01 0.13 1.00
JsonPlatformMultiEngine 978,002 98 417 1.65 273 244.78 0.13 1.01
JsonPlatformSingleEngine 993,364 98 428 1.44 273 244.87 0.13 1.03

From the results, the machine with more cores shows less improvement. SingleEngine is still out-performing MultiEngine.

Is this to reduce the contention on the ThreadPool as the poll thread does just mostly queues to it?

It's to avoid side-effects of having too many poll threads.

An example would be Stack Overflow which has 68k active sockets per server

@NickCraver 'active' does that mean those are all actively making request towards the server (or are there many sleeping connections)? How many 'processors' (Environment.ProcessorCount) are on a machine that handles those?

There are two parameters we can play with and see how they affect throughput:
We can still have multiple engines, but use a lower maximum than ProcessorCount / 2.
We can also have a look at the effect of the EventBufferCount which controls how many events we can get from epoll per call (1024 currently).
I can make some builds where we vary these parameters and see how they affect throughput at high connection count.

@benaadams
Copy link
Member

Looks good; thank you for checking it.

I can make some builds where we vary these parameters and see how they affect throughput at high connection count.

Something for follow up maybe?

The single engine hasn't peaked at 256 then is under-performing at against multi-engine 1024 which was my worry. If multiple engines do have an advantage at higher connections it does seem that threshold for starting a new engine is going to be much higher than the 32 for the current; and since its a clear win still at 1024 connections this PR seems good as is, rather than trying to speculatively see if there is an advantage at some point.

@stephentoub
Copy link
Member

stephentoub commented Apr 18, 2019

Thanks for all the numbers. Do we know what accounts for the 2% RPS dip in PlaintextNonPipelined?

@tmds
Copy link
Member Author

tmds commented Apr 18, 2019

Thanks for all the numbers. Do we know what accounts for the 2% RPS dip in PlaintextNonPipelined?

I don't know. My expectations here was to get results more similar to Json (like with the previous benchmark results). I'm not focussing on it because it's the least interesting benchmark.

@stephentoub
Copy link
Member

@davidfowl, considering that most numbers either stay the same or improve and this simplifies the code, assuming the small plaintext non-pipelined dip mentioned is real, are you ok with this?

@davidfowl
Copy link
Member

I like the change but the fact that non-pipelined performance is still on the downward trend doesn't make me happy. We can merge this for now and watch it.

@davidfowl
Copy link
Member

One more thing we should measure before merging is the proxy scenario. That uses both the HttpClient (sockets based) and server in the same process. That should be interesting.

cc @sebastienros

@tmds
Copy link
Member Author

tmds commented Apr 19, 2019

non-pipelined performance is still on the downward trend doesn't make me happy

Only for plaintext, not for json. This surprising and not consistent with the earlier benchmark result. We should run the benchmark again.

That uses both the HttpClient (sockets based) and server in the same process. That should be interesting.

Do you have such a benchmark available?
From the ones used here, Fortunes is the (only) one that also has some client-side usage of Sockets (for DB, not HTTP).

We can also have a look at the effect of the EventBufferCount which controls how many events we can get from epoll per call (1024 currently).

I got some benchmark results from @sebastienros that show this PR is reducing performance with 64k connections: -23% RPS for Json.

I'm going to make some changes that increase this value, and then we can run another benchmark including 64k connections.

@tmds
Copy link
Member Author

tmds commented Apr 19, 2019

Do you have such a benchmark available?

ah ok: the proxy scenario

@tmds
Copy link
Member Author

tmds commented Apr 19, 2019

Let me first have a look at the EventBufferCount before we run another benchmark.

@tmds
Copy link
Member Author

tmds commented Apr 19, 2019

Let me first have a look at the EventBufferCount before we run another benchmark.

Implemented in this PR.

We can still have multiple engines, but use a lower maximum than ProcessorCount / 2.

On https://github.com/tmds/corefx/tree/increase_min_handles I've raised the bar for creating additional engines. That means we should still get the gains we have benchmarked so far.

We can benchmark these against one another to see what makes more sense for dealing with high (e.g. 65k) connection count.

@tmds
Copy link
Member Author

tmds commented Apr 19, 2019

I sent some more dlls to @sebastienros for benchmarking.

@tmds
Copy link
Member Author

tmds commented Apr 23, 2019

No benchmark results yet for this branch. It seems there is a bug causing the client connections to be refused. I need to investigate.

https://github.com/tmds/corefx/tree/increase_min_handles results are good:

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
JsonBaseline 80,093 72 892 8.55 344 271.4 0.17 1.00
JsonMinHandles 78,910 78 895 6.67 344 271.07 0.15 0.99

@tmds
Copy link
Member Author

tmds commented May 9, 2019

More benchmark results.
This is with 65k connections. Because the numbers are lower, I guess this is on the 6/12 cores machine:

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
JsonBaseline 11,151 60 844 7.49 278 210.82 0.66 1.00
JsonLargeThreshold 12,876 65 888 6.8 274 211.24 0.55 1.15
JsonSingleEngine 10,256 67 831 9.86 287 209.31 0.57 0.92
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
PlaintextBaseline 29,205 56 438 51.61 285 104.34 0.52 1.00
PlaintextLargeThreshold 27,345 45 420 33.9 284 101.72 0.56 0.94
PlaintextSingleEngine 35,265 49 404 30.81 282 104.74 0.67 1.21
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
PlaintextNonPipelinedBaseline 17,611 65 518 11.7 281 105.18 0.48 1.00
PlaintextNonPipelinedLargeThreshold 13,792 60 512 10.29 278 105.84 0.67 0.78
PlaintextNonPipelinedSingleEngine 16,992 62 523 10.88 282 104.03 0.53 0.96
Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) First Request (ms) Latency (ms) Ratio
FortunesBaseline 14,188 75 1,547 69.35 285 113.41 1.11 1.00
FortunesLargeThreshold 10,024 82 1,461 57.64 285 112.11 0.86 0.71
FortunesSingleEngine 8,523 83 1,526 43.49 288 109.72 0.9 0.60

No clear winner stands out here. It depends on the benchmark which performs better (which is strange to me especially for Json vs Plaintext vs PlaintextNonPipelined). Fortunes shows a large regression.

It's not possible to pick between these without prioritizing benchmarks. Therefore I'm inclined to leave things as is, unless someone decides which benchmarks are more important.

I've got great assistance from @sebastienros to run these benchmarks for me. However, I'd be much more productive (and not feel like a waste of his time) if I could run the benchmarks myself.

@stephentoub
Copy link
Member

Therefore I'm inclined to leave things as is, unless someone decides which benchmarks are more important.

In other words, close this PR?

I'm ok with that.

Thanks for the investigation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Net.Sockets * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants