[WIP] Perform socket operation on epoll thread and rework locking #36518

tmds · 2019-04-01T07:35:56Z

No description provided.

…Callback

tmds · 2019-04-01T07:40:21Z

This is intended for benchmarking. I've sent compiled dlls to @sebastienros . No need to review before we have benchmark results.
My expectation is this will perform better.
If it doesn't, I'll tackle https://github.com/dotnet/corefx/issues/25253.

CC @stephentoub @geoffkizer @davidfowl

tmds · 2019-04-01T08:26:50Z

Linux arm_64 release build failed with:

Installing Microsoft.DotNet.Arcade.Sdk 1.0.0-beta.19177.11.
  Zipping directory "/__w/1/s/artifacts/bin/testhost/netcoreapp-Linux-Release-arm64/" to "/__w/1/s/artifacts/helix/runtime/test-runtime-netcoreapp-Linux-Release-arm64.zip".
  Results will be available from https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F36518~2Fmerge/test~2Ffunctional~2Fcli~2F/20190401.1
  Sent Helix Job b296e0b8-d9c4-422a-99ec-91c2ac54b993
  Waiting for completion of job b296e0b8-d9c4-422a-99ec-91c2ac54b993
  Job b296e0b8-d9c4-422a-99ec-91c2ac54b993 is completed with 211 finished work items.
/home/vsts_azpcontainer/.nuget/packages/microsoft.dotnet.helix.sdk/2.0.0-beta.19177.11/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(67,5): error : This task can only be used on completed jobs. There are 117 of 211 unfinished work items. [/__w/1/s/eng/sendtohelix.proj]

Build FAILED.

/home/vsts_azpcontainer/.nuget/packages/microsoft.dotnet.helix.sdk/2.0.0-beta.19177.11/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(67,5): error : This task can only be used on completed jobs. There are 117 of 211 unfinished work items. [/__w/1/s/eng/sendtohelix.proj]
    0 Warning(s)
    1 Error(s)

Time Elapsed 00:06:07.49
Build failed (exit code '1').

stephentoub · 2019-04-01T12:32:49Z

Perform socket operation on epoll thread
My expectation is this will perform better.

Can you elaborate on this? IIRC, this is close to what we previously had before @geoffkizer changed it explicitly to do the socket call on the thread that would then be processing the results; I believe this was done not only to offload work from the epoll thread, but also to improve locality of access and cache coherency with the follow-on work accessing the data. Geoff had done some extensive benchmarking at the time.

davidfowl · 2019-04-01T14:35:01Z

@tmds there's a new draft PR feature you can use instead of [WIP]

tmds · 2019-04-01T14:35:18Z

I raised some concerns about the benchmarking (see #23115 (comment) and below). Different benchmarks I had done previously showed performing those operations on the epoll thread is preferable. We can see how the change affects the ASP.NET Core benchmarks and decide on based on that.

stephentoub · 2019-04-01T14:36:43Z

We can see how the change affects the ASP.NET Core benchmarks and decide on based on that.

If we want to decide based on that, we should make sure to run well more than just plaintext.

I'd also like @geoffkizer to weigh in.

tmds · 2019-04-04T07:13:08Z

Benchmark results:

Plaintext: -0.41%
Plaintext non pipelined: -0.56%
Json: +1.97%
Fortunes: +1.02%

The plaintext benchmarks are performing less good. This is not a complete surprise. These match best with the type of benchmark @geoffkizer performed in #23115.

I'm going to try some things, do some more benchmarks and then share those results too.

tmds · 2019-04-05T07:53:28Z

Because the epoll threads seemed to be the bottleneck causing the plaintext perf regression, I increased the number of threads to match ProcessorCount.

Benchmark results:

Json: -1.77% (latency in ms: 1.12 -> 1.09)
Plaintext: -1.96% (latency in ms: 1.57 -> 1.61)
PlaintextNonPipelined: -1.39% (latency in ms: 1.06 -> 1.15)
Fortunes: -0.27% (latency in ms: 2.44 -> 2.18)

All of these are performing worse. I think this is caused by two things. Additional context switches due to the extra threads. And a bigger mismatch between the epoll threads that bring in the work and the threadpool that does the work. This causes more things to wait in the threadpool queue.

It's possible to avoid this for cases that can complete synchronously, by not going to the threadpool (that is: invoke the user callback from the epoll thread).
note: we can't make this the default behaviour because the risk is too high that someone blocks the epoll thread. (It would require some API change to specifically request an inline dispatch.)

Benchmark results:

Json: +7.73% (latency in ms: 1.12 -> 1.59)
Plaintext: +0.28% (latency in ms: 1.57 -> 1.58)
PlaintextNonPipelined: +4.64% (latency in ms: 1.06 -> 1.5)
Fortunes: -0.27% (latency in ms: 2.44 -> 2.1)

Based on these benchmarks, this is my conclusion:

For dealing with epoll thread exhaustion, using multiple epoll threads or dispatching the socket operation to the threadpool are competitive options.
By performing the socket operations on the epoll thread we can do some things that more significantly improve performance. But as long as we are not doing any, I think it makes sense to stick to the simpler implementation, which is: use a single epoll thread and dispatch to the threadpool.

I'm leaving it to @davidfowl and others to say if adding an inline dispatch mode for socket operations is something they would consume to get the perf gain and can safely make available to end-users (that is: avoid the risk users end up blocking the epoll thread).

One last thought. After having looked once more at TE benchmark results, I would like to see more benchmarks that show latency for throughput (wrk2 like) instead of RPS for max CPU (wrk like).

@sebastienros thank you for benchmarking these changes!

stephentoub · 2019-04-05T14:55:10Z

I think it makes sense to stick to the simpler implementation, which is: use a single epoll thread and dispatch to the threadpool.

Are there benchmarking results for that change? I didn't see them above, but maybe I missed it. Right now we will use multiple epoll threads on larger core counts, right?

if adding an inline dispatch mode for socket operations is something they would consume to get the perf gain and can safely make available to end-users

This is something we contemplated in the 2.1 timeframe. From my perspective currently, it would really just help with a vanity metric. Most systems would not be able to set that safely, and for most real-world usage the resulting difference in throughput it provided would be negligible; it's only on these microbenchmarks where the difference shows up. My concern would then be that folks see "oh, I can set this same flag that gives these other benchmarks a boost" even when it's not appropriate from a reliability perspective, and their system actually tanks in production. I'm open to others' opinions, of course.

tmds · 2019-04-05T15:46:59Z

Are there benchmarking results for that change? I didn't see them above, but maybe I missed it. Right now we will use multiple epoll threads on larger core counts, right?

Yes, we are using multiple epoll threads and https://github.com/dotnet/corefx/issues/25253 is tracking removing those since #23115 happened.
No benchmark has been performed yet. Both were implemented as a way of avoiding the epoll thread becoming a bottleneck. With the epoll dispatching all 'real' work to the threadpool, I don't think it can be a bottleneck. We need to do a benchmark when implementing the change to confirm that assumption.

sebastienros · 2019-04-05T16:46:20Z

@tmds we do support wrk2 too, which uses fixed rates, and also bombardier which gives better average latencies, the ones in wrk are oddish usually.

tmds · 2019-04-08T12:04:04Z

Closing this one based on benchmark results not showing significant gain.
#36693 is for removing multiple epoll threads.

tmds · 2019-04-08T15:20:24Z

@benaadams pointed out there is a global lock that introduces contention between the poll threads when they try to find the SocketContexts to process events. I'll fix that, and we can see if it has a measurable impact.

tmds added 2 commits March 30, 2019 08:08

Perform socket operations on epoll thread

d25de16

Dispatch completed operations outside Lock to reduce contention with …

b655c82

…Callback

davidsh added the area-System.Net.Sockets label Apr 1, 2019

davidsh added this to the 3.0 milestone Apr 1, 2019

davidsh added the os-linux Linux OS (any supported distro) label Apr 1, 2019

karelz assigned tmds, davidsh and wfurt Apr 1, 2019

tmds closed this Apr 8, 2019

tmds mentioned this pull request Apr 8, 2019

SocketAsyncEngine.Unix: remove multiple engines #36693

Closed

[WIP] Perform socket operation on epoll thread and rework locking #36518

[WIP] Perform socket operation on epoll thread and rework locking #36518

Uh oh!

Conversation

tmds commented Apr 1, 2019

Uh oh!

tmds commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmds commented Apr 1, 2019

Uh oh!

stephentoub commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidfowl commented Apr 1, 2019

Uh oh!

tmds commented Apr 1, 2019

Uh oh!

stephentoub commented Apr 1, 2019

Uh oh!

tmds commented Apr 4, 2019

Uh oh!

tmds commented Apr 5, 2019

Uh oh!

stephentoub commented Apr 5, 2019

Uh oh!

tmds commented Apr 5, 2019

Uh oh!

sebastienros commented Apr 5, 2019

Uh oh!

tmds commented Apr 8, 2019

Uh oh!

tmds commented Apr 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tmds commented Apr 1, 2019 •

edited

Loading

stephentoub commented Apr 1, 2019 •

edited

Loading