Skip to content

Conversation

@kouvel
Copy link
Contributor

@kouvel kouvel commented Aug 21, 2021

  • Currently up to proc count thread requests are made, one when each work item is enqueued and one when each work item is dequeued. Some of the conditions for making a thread request are speculative, making it difficult to reduce the cap without running into issues (see Thread pool - Fix a couple of things that were reverted #8951 (comment)).
  • When the thread pool is not fully loaded, more threads are requested than necessary, causing more threads to wake up and compete for work items, and this shows a perf degradation at some point as load is decreased
  • Fixed by using a similar solution to this. With this change, at most one thread is requested at a time. A thread pool thread requests another thread for parallelization of work after dequeuing the first work item and does not request any more threads, leaving it up to whichever thread services the new thread request to parallelize further.
  • After this, there was a regression in ASP.NET benchmarks on arm64 with default number of connections, see Maybe spin-waits in the thread pool can be tuned better for arm/arm64 #39559 for more info. Increased the spin-waiting duration on thread pool worker threads to compensate. Spin-waiting longer is preferrable to busy-waiting and competing with other threads for work items, and hitting a full wait and waking up frequently.
  • The change seems to solve the perf cliff seen at some point as load is decreased, especially on arm64. A portion of the perf improvements seen on arm64 at lower load is due to the increase in spin-waiting. On x64 at full load there's not much change in perf, maybe a small improvement in FortunesPlatform and Fortunes. On arm64 at full load there seem to be some improvements in Json/Fortunes-like benchmarks.

Fixes #39559

@kouvel kouvel added this to the 7.0.0 milestone Aug 21, 2021
@kouvel kouvel requested review from janvorli and mangod9 August 21, 2021 18:18
@kouvel kouvel self-assigned this Aug 21, 2021
@ghost
Copy link

ghost commented Aug 21, 2021

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details
  • Currently up to proc count thread requests are made, one when each work item is enqueued and one when each work item is dequeued. Some of the conditions for making a thread request are speculative, making it difficult to reduce the cap without running into issues (see Thread pool - Fix a couple of things that were reverted #8951 (comment)).
  • When the thread pool is not fully loaded, more threads are requested than necessary, causing more threads to wake up and compete for work items, and this shows a perf degradation at some point as load is decreased
  • Fixed by using a similar solution to this. With this change, at most one thread is requested at a time. A thread pool thread requests another thread for parallelization of work after dequeuing the first work item and does not request any more threads, leaving it up to whichever thread services the new thread request to parallelize further.
  • After this, there was a regression in ASP.NET benchmarks on arm64 with default number of connections, see Maybe spin-waits in the thread pool can be tuned better for arm/arm64 #39559 for more info. Increased the spin-waiting duration on thread pool worker threads to compensate. Spin-waiting longer is preferrable to busy-waiting and competing with other threads for work items, and hitting a full wait and waking up frequently.
  • The change seems to solve the perf cliff seen at some point as load is decreased, especially on arm64. A portion of the perf improvements seen on arm64 at lower load is due to the increase in spin-waiting. On x64 at full load there's not much change in perf, maybe a small improvement in FortunesPlatform and Fortunes. On arm64 at full load there seem to be some improvements in Json/Fortunes-like benchmarks.

Fixes #39559

Author: kouvel
Assignees: kouvel
Labels:

area-System.Threading

Milestone: 7.0.0

@kouvel
Copy link
Contributor Author

kouvel commented Aug 21, 2021

The first commit is a non-functional change if it would be easier to review the commits separately.

@kouvel
Copy link
Contributor Author

kouvel commented Aug 21, 2021

Perf results

14-core 28-thread x64 processor

Test Connections Before After Diff
PlaintextPlatform 1024 11778884.9 11812301.7 0.3%
. 128 9345579.8 9602478.2 2.7%
. 64 5967326.6 6351665.8 6.4%
. 32 3370511.5 3699054.9 9.7%
. 16 1876116.3 2026980.7 8.0%
JsonPlatform 512 1282081.7 1282930.9 0.1%
. 128 904762.4 935978.1 3.5%
. 64 553516.6 597037.8 7.9%
. 32 312666.4 345400.7 10.5%
. 16 176371.1 202064.6 14.6%
FortunesPlatform 512 360573.2 362949.5 0.7%
. 128 293084.4 300313.4 2.5%
. 64 189093.8 196991.2 4.2%
. 32 111167.8 118190.1 6.3%
. 16 66368.9 68662.4 3.5%
Plaintext 256 4831994.3 4823856.7 -0.2%
. 128 4368565.4 4408030.3 0.9%
. 64 3444864.3 3460070.8 0.4%
. 32 2251174.7 2265688.4 0.6%
. 16 1155317.8 1262074.0 9.2%
Json 256 1004373.5 1009589.5 0.5%
. 128 851413.4 871906.0 2.4%
. 64 509635.9 564147.1 10.7%
. 32 302007.1 331372.8 9.7%
. 16 171973.3 186460.9 8.4%
Fortunes 256 360573.2 362949.5 0.7%
. 128 279174.8 286798.8 2.7%
. 64 180147.9 184358.2 2.3%
. 32 107639.5 113557.6 5.5%
. 16 63590.0 66166.3 4.1%

32-core arm64 processor

Test Connections Before After Diff
PlaintextPlatform 1024 6952199.3 6954765.0 0.0%
. 64 4692982.7 4905584.7 4.5%
. 32 1911023.9 3179312.0 66.4%
. 16 736905.6 1635691.6 122.0%
. 8 467633.3 846071.1 80.9%
JsonPlatform 512 644628.6 671099.2 4.1%
. 64 418614.3 437252.8 4.5%
. 32 290946.9 297934.9 2.4%
. 16 72023.7 164891.6 128.9%
. 8 44577.3 77018.2 72.8%
FortunesPlatform 512 95295.3 106695.5 12.0%
. 64 29753.1 58789.1 97.6%
. 32 19522.5 43028.3 120.4%
. 16 13108.5 29179.6 122.6%
. 8 8834.0 18195.0 106.0%
Plaintext 256 2625118.8 2644431.0 0.7%
. 64 1923785.4 1963492.3 2.1%
. 32 1340738.3 1361157.3 1.5%
. 16 285234.3 788014.0 176.3%
. 8 177535.4 415107.9 133.8%
Json 256 429996.6 437539.6 1.8%
. 64 346248.3 351532.4 1.5%
. 32 240169.7 243038.3 1.2%
. 16 50730.1 132630.4 161.4%
. 8 34659.1 62971.0 81.7%
Fortunes 256 71565.5 83441.6 16.6%
. 64 23091.1 42949.7 86.0%
. 32 15635.9 31614.4 102.2%
. 16 11090.8 22191.8 100.1%
. 8 7481.5 14996.7 100.4%

Koundinya Veluri added 2 commits August 26, 2021 15:48
- Currently up to proc count yet-to-be-serviced thread requests are made when each work item is enqueued and when each work item is dequeued. Some of the conditions for making a thread request are speculative, making it difficult to reduce the cap without running into issues (see #8951 (comment)).
- When the thread pool is not fully loaded, more threads are requested than necessary, causing more threads to wake up and compete for work items, and this shows a perf degradation at some point as load is decreased
- Fixed by using a similar solution to https://github.com/dotnet/runtime/blob/50576e326d1015906608e3c06670344e335c3225/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs#L209. With this change, at most one thread is requested at a time. A thread pool thread requests another thread for parallelization of work after dequeuing the first work item and does not request any more threads, leaving it up to whichever thread services the new thread request to parallelize further.
- After this, there was a regression in ASP.NET benchmarks on arm64 with default number of connections, see #39559 for more info. Increased the spin-waiting duration on thread pool worker threads to compensate. Spin-waiting longer is preferrable to busy-waiting and competing with other threads for work items, and hitting a full wait and waking up frequently.
- The change seems to solve the perf cliff seen at some point as load is decreased, especially on arm64. A portion of the perf improvements seen on arm64 at lower load is due to the increase in spin-waiting.

Fixes #39559
@kouvel
Copy link
Contributor Author

kouvel commented Aug 26, 2021

Rebased

@kouvel kouvel closed this Sep 1, 2021
@kouvel kouvel reopened this Sep 1, 2021
@kouvel
Copy link
Contributor Author

kouvel commented Sep 9, 2021

Failure is known and unrelated

@kouvel kouvel merged commit 32fed09 into dotnet:main Sep 9, 2021
@kouvel kouvel deleted the RequestFewerThreads branch September 9, 2021 17:03
@kunalspathak
Copy link
Contributor

kunalspathak commented Sep 14, 2021

Nice improvements on windows-x64: dotnet/perf-autofiling-issues#1366 and dotnet/perf-autofiling-issues#1339

@kunalspathak
Copy link
Contributor

Ubuntu/x64 improvements: dotnet/perf-autofiling-issues#1330

@kunalspathak
Copy link
Contributor

Arm64 improvements: dotnet/perf-autofiling-issues#1448

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2021

arm64 improvements: dotnet/perf-autofiling-issues#1543

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2021

arm64 improvements: dotnet/perf-autofiling-issues#1542

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
@sebastienros
Copy link
Member

Can these changes explain a perf improvement on Windows for Fortunes?
We are going from 236K to 280K on Windows with a range that contains this PR: ef85762...c21a701

@kouvel
Copy link
Contributor Author

kouvel commented Nov 9, 2021

Can these changes explain a perf improvement on Windows for Fortunes?

Possibly, it seems like the likely change out of the set. Fortunes queues fewer work items to the worker pool, so it would probably benefit more from this change compared to the faster platform benchmarks, but typically thread pool changes (at least on Linux) have not changed perf much on Fortunes, so not sure. Things work a bit differently on Windows though, as the IO pool threads may be competing for time with the worker pool threads.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Maybe spin-waits in the thread pool can be tuned better for arm/arm64

5 participants