Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@tmds
Copy link
Member

@tmds tmds commented Apr 1, 2019

No description provided.

@tmds
Copy link
Member Author

tmds commented Apr 1, 2019

This is intended for benchmarking. I've sent compiled dlls to @sebastienros . No need to review before we have benchmark results.
My expectation is this will perform better.
If it doesn't, I'll tackle https://github.com/dotnet/corefx/issues/25253.

CC @stephentoub @geoffkizer @davidfowl

@tmds
Copy link
Member Author

tmds commented Apr 1, 2019

Linux arm_64 release build failed with:

Installing Microsoft.DotNet.Arcade.Sdk 1.0.0-beta.19177.11.
  Zipping directory "/__w/1/s/artifacts/bin/testhost/netcoreapp-Linux-Release-arm64/" to "/__w/1/s/artifacts/helix/runtime/test-runtime-netcoreapp-Linux-Release-arm64.zip".
  Results will be available from https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F36518~2Fmerge/test~2Ffunctional~2Fcli~2F/20190401.1
  Sent Helix Job b296e0b8-d9c4-422a-99ec-91c2ac54b993
  Waiting for completion of job b296e0b8-d9c4-422a-99ec-91c2ac54b993
  Job b296e0b8-d9c4-422a-99ec-91c2ac54b993 is completed with 211 finished work items.
/home/vsts_azpcontainer/.nuget/packages/microsoft.dotnet.helix.sdk/2.0.0-beta.19177.11/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(67,5): error : This task can only be used on completed jobs. There are 117 of 211 unfinished work items. [/__w/1/s/eng/sendtohelix.proj]

Build FAILED.

/home/vsts_azpcontainer/.nuget/packages/microsoft.dotnet.helix.sdk/2.0.0-beta.19177.11/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(67,5): error : This task can only be used on completed jobs. There are 117 of 211 unfinished work items. [/__w/1/s/eng/sendtohelix.proj]
    0 Warning(s)
    1 Error(s)

Time Elapsed 00:06:07.49
Build failed (exit code '1').

@stephentoub
Copy link
Member

stephentoub commented Apr 1, 2019

Perform socket operation on epoll thread
My expectation is this will perform better.

Can you elaborate on this? IIRC, this is close to what we previously had before @geoffkizer changed it explicitly to do the socket call on the thread that would then be processing the results; I believe this was done not only to offload work from the epoll thread, but also to improve locality of access and cache coherency with the follow-on work accessing the data. Geoff had done some extensive benchmarking at the time.

@davidsh davidsh added this to the 3.0 milestone Apr 1, 2019
@davidsh davidsh added the os-linux Linux OS (any supported distro) label Apr 1, 2019
@davidfowl
Copy link
Member

@tmds there's a new draft PR feature you can use instead of [WIP]

@tmds
Copy link
Member Author

tmds commented Apr 1, 2019

I raised some concerns about the benchmarking (see #23115 (comment) and below). Different benchmarks I had done previously showed performing those operations on the epoll thread is preferable. We can see how the change affects the ASP.NET Core benchmarks and decide on based on that.

@stephentoub
Copy link
Member

We can see how the change affects the ASP.NET Core benchmarks and decide on based on that.

If we want to decide based on that, we should make sure to run well more than just plaintext.

I'd also like @geoffkizer to weigh in.

@tmds
Copy link
Member Author

tmds commented Apr 4, 2019

Benchmark results:

Plaintext: -0.41%
Plaintext non pipelined: -0.56%
Json: +1.97%
Fortunes: +1.02%

The plaintext benchmarks are performing less good. This is not a complete surprise. These match best with the type of benchmark @geoffkizer performed in #23115.

I'm going to try some things, do some more benchmarks and then share those results too.

@tmds
Copy link
Member Author

tmds commented Apr 5, 2019

Because the epoll threads seemed to be the bottleneck causing the plaintext perf regression, I increased the number of threads to match ProcessorCount.

Benchmark results:

Json: -1.77% (latency in ms: 1.12 -> 1.09)
Plaintext: -1.96% (latency in ms: 1.57 -> 1.61)
PlaintextNonPipelined: -1.39% (latency in ms: 1.06 -> 1.15)
Fortunes: -0.27% (latency in ms: 2.44 -> 2.18)

All of these are performing worse. I think this is caused by two things. Additional context switches due to the extra threads. And a bigger mismatch between the epoll threads that bring in the work and the threadpool that does the work. This causes more things to wait in the threadpool queue.

It's possible to avoid this for cases that can complete synchronously, by not going to the threadpool (that is: invoke the user callback from the epoll thread).
note: we can't make this the default behaviour because the risk is too high that someone blocks the epoll thread. (It would require some API change to specifically request an inline dispatch.)

Benchmark results:

Json: +7.73% (latency in ms: 1.12 -> 1.59)
Plaintext: +0.28% (latency in ms: 1.57 -> 1.58)
PlaintextNonPipelined: +4.64% (latency in ms: 1.06 -> 1.5)
Fortunes: -0.27% (latency in ms: 2.44 -> 2.1)

Based on these benchmarks, this is my conclusion:

  • For dealing with epoll thread exhaustion, using multiple epoll threads or dispatching the socket operation to the threadpool are competitive options.
  • By performing the socket operations on the epoll thread we can do some things that more significantly improve performance. But as long as we are not doing any, I think it makes sense to stick to the simpler implementation, which is: use a single epoll thread and dispatch to the threadpool.

I'm leaving it to @davidfowl and others to say if adding an inline dispatch mode for socket operations is something they would consume to get the perf gain and can safely make available to end-users (that is: avoid the risk users end up blocking the epoll thread).

One last thought. After having looked once more at TE benchmark results, I would like to see more benchmarks that show latency for throughput (wrk2 like) instead of RPS for max CPU (wrk like).

@sebastienros thank you for benchmarking these changes!

@stephentoub
Copy link
Member

I think it makes sense to stick to the simpler implementation, which is: use a single epoll thread and dispatch to the threadpool.

Are there benchmarking results for that change? I didn't see them above, but maybe I missed it. Right now we will use multiple epoll threads on larger core counts, right?

if adding an inline dispatch mode for socket operations is something they would consume to get the perf gain and can safely make available to end-users

This is something we contemplated in the 2.1 timeframe. From my perspective currently, it would really just help with a vanity metric. Most systems would not be able to set that safely, and for most real-world usage the resulting difference in throughput it provided would be negligible; it's only on these microbenchmarks where the difference shows up. My concern would then be that folks see "oh, I can set this same flag that gives these other benchmarks a boost" even when it's not appropriate from a reliability perspective, and their system actually tanks in production. I'm open to others' opinions, of course.

@tmds
Copy link
Member Author

tmds commented Apr 5, 2019

Are there benchmarking results for that change? I didn't see them above, but maybe I missed it. Right now we will use multiple epoll threads on larger core counts, right?

Yes, we are using multiple epoll threads and https://github.com/dotnet/corefx/issues/25253 is tracking removing those since #23115 happened.
No benchmark has been performed yet. Both were implemented as a way of avoiding the epoll thread becoming a bottleneck. With the epoll dispatching all 'real' work to the threadpool, I don't think it can be a bottleneck. We need to do a benchmark when implementing the change to confirm that assumption.

@sebastienros
Copy link
Member

@tmds we do support wrk2 too, which uses fixed rates, and also bombardier which gives better average latencies, the ones in wrk are oddish usually.

@tmds
Copy link
Member Author

tmds commented Apr 8, 2019

Closing this one based on benchmark results not showing significant gain.
#36693 is for removing multiple epoll threads.

@tmds
Copy link
Member Author

tmds commented Apr 8, 2019

@benaadams pointed out there is a global lock that introduces contention between the poll threads when they try to find the SocketContexts to process events. I'll fix that, and we can see if it has a measurable impact.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Net.Sockets os-linux Linux OS (any supported distro)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants