Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@kouvel
Copy link

@kouvel kouvel commented Aug 30, 2017

Part of fix for https://github.com/dotnet/coreclr/issues/13388

Normalized equivalent of YieldProcessor

  • The delay incurred by YieldProcessor is measured once lazily at run-time
  • Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
  • Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment

  • At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
  • Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
    • I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
    • ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning

  • At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
  • General observations
    • YieldProcessor stage
      • At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
      • Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
      • Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
      • Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
    • Yield/Sleep(0) stage
      • Observed a couple of issues:
        • When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
        • If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
      • I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
    • Sleep(1) stage
      • I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
        • Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
      • For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
        • For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
        • For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
        • Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
  • SpinWait - Tweaked the spin strategy as mentioned above
  • ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
  • Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
  • SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
  • SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
  • ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
  • The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
    • On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
    • On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.

@kouvel kouvel added area-System.Threading tenet-performance Performance related issue labels Aug 30, 2017
@kouvel kouvel added this to the 2.1.0 milestone Aug 30, 2017
@kouvel kouvel self-assigned this Aug 30, 2017
@kouvel
Copy link
Author

kouvel commented Aug 30, 2017

Numbers from Thread.SpinWait divide count by 7 experiment:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score        Right score       ∆ Score %
------------------------------------------  ----------------  ----------------  ---------
BarrierSyncRate 1Pc                            131.65 ±0.13%     146.93 ±0.32%     11.61%
ConcurrentQueueThroughput 1Pc               16092.28 ±12.34%  16040.14 ±11.77%     -0.32%
ConcurrentStackThroughput 1Pc                39470.21 ±0.42%   38814.31 ±0.95%     -1.66%
CountdownEventAddCountSignalThroughput 1Pc    7088.57 ±1.33%    6872.57 ±0.76%     -3.05%
MresWaitDrainRate 1Pc                          556.26 ±0.71%     555.32 ±0.51%     -0.17%
MresWaitDrainRate 1Pc Delay                    524.40 ±1.22%     525.69 ±0.67%      0.24%
MresWaitDrainRate 2Pc                          679.15 ±0.52%     683.60 ±1.00%      0.66%
MresWaitDrainRate 2Pc Delay                    668.11 ±1.76%     673.72 ±0.98%      0.84%
MresWaitLatency 1Pc                            505.88 ±0.38%     497.54 ±0.88%     -1.65%
MresWaitLatency 1Pc Delay                      442.80 ±0.51%     442.81 ±0.71%      0.00%
SemaphoreSlimLatency 1Pc                       114.14 ±0.50%     110.93 ±1.42%     -2.81%
SemaphoreSlimLatency 1Pc Delay                 117.67 ±0.49%     118.51 ±0.48%      0.71%
SemaphoreSlimLatency 2Pc                        70.93 ±1.04%      76.91 ±0.88%      8.43%
SemaphoreSlimLatency 2Pc Delay                  91.47 ±1.90%     100.60 ±0.78%      9.99%
SemaphoreSlimThroughput 1Pc                   455.94 ±60.07%    183.44 ±25.25%    -59.77%
SemaphoreSlimWaitDrainRate 1Pc                  71.31 ±0.95%      71.86 ±1.89%      0.77%
SemaphoreSlimWaitDrainRate 1Pc Delay            68.33 ±2.13%      69.88 ±1.63%      2.27%
SemaphoreSlimWaitDrainRate 2Pc                  93.52 ±2.42%      91.41 ±2.91%     -2.26%
SemaphoreSlimWaitDrainRate 2Pc Delay            86.89 ±2.17%      90.84 ±0.71%      4.55%
SpinLockLatency 1Pc                            286.86 ±1.00%     284.45 ±0.95%     -0.84%
SpinLockLatency 1Pc Delay                      216.97 ±0.72%     212.25 ±0.69%     -2.18%
SpinLockLatency 2Pc                            142.92 ±2.09%     149.15 ±1.39%      4.36%
SpinLockLatency 2Pc Delay                       75.96 ±4.76%      80.89 ±4.59%      6.49%
SpinLockThroughput 1Pc                       44828.02 ±0.48%   44630.26 ±0.33%     -0.44%
------------------------------------------  ----------------  ----------------  ---------
Total                                          418.59 ±5.39%     408.67 ±2.77%     -2.37%

RwSB vs RwS                         Left score       Right score       ∆ Score %
----------------------------------  ---------------  ----------------  ---------
Concurrency_OnlyReadersPcx01        23249.72 ±0.26%   22543.68 ±0.44%     -3.04%
Concurrency_OnlyReadersPcx04        23244.97 ±0.11%   22974.79 ±0.52%     -1.16%
Concurrency_OnlyReadersPcx16        22999.13 ±0.06%   22638.62 ±0.79%     -1.57%
Concurrency_OnlyReadersPcx64        15791.12 ±4.44%   16368.50 ±5.28%      3.66%
Concurrency_OnlyWritersPcx01        22007.28 ±0.97%   23679.77 ±0.70%      7.60%
Concurrency_OnlyWritersPcx04        21556.55 ±1.41%   23412.23 ±1.16%      8.61%
Concurrency_OnlyWritersPcx16        21269.57 ±1.14%   23823.90 ±0.76%     12.01%
Concurrency_OnlyWritersPcx64        22935.28 ±1.47%   21531.70 ±1.55%     -6.12%
Concurrency_Pcx01Readers_01Writers  10900.17 ±2.26%   11174.93 ±6.12%      2.52%
Concurrency_Pcx01Readers_02Writers  7239.27 ±12.43%   6809.70 ±10.01%     -5.93%
Concurrency_Pcx04Readers_01Writers  16317.54 ±3.14%   13822.82 ±7.76%    -15.29%
Concurrency_Pcx04Readers_02Writers  14166.41 ±5.33%   10170.40 ±9.35%    -28.21%
Concurrency_Pcx04Readers_04Writers  15039.61 ±5.99%   15157.21 ±7.66%      0.78%
Concurrency_Pcx16Readers_01Writers  9526.18 ±17.75%  10408.31 ±18.95%      9.26%
Concurrency_Pcx16Readers_02Writers  7491.28 ±17.99%   3812.76 ±34.55%    -49.10%
Concurrency_Pcx16Readers_04Writers  8238.04 ±18.55%   17888.85 ±9.92%    117.15%
Concurrency_Pcx16Readers_08Writers  15473.47 ±7.52%   17661.60 ±7.00%     14.14%
Concurrency_Pcx64Readers_01Writers   1621.35 ±7.95%   2945.33 ±30.30%     81.66%
Concurrency_Pcx64Readers_02Writers  6725.27 ±21.51%   6391.88 ±26.82%     -4.96%
Concurrency_Pcx64Readers_04Writers  15767.55 ±7.69%  15424.21 ±10.83%     -2.18%
Concurrency_Pcx64Readers_08Writers  15849.51 ±6.85%   17318.19 ±6.69%      9.27%
Concurrency_Pcx64Readers_16Writers  21549.13 ±3.39%   20514.35 ±3.81%     -4.80%
----------------------------------  ---------------  ----------------  ---------
Total                               13437.83 ±6.98%   13773.15 ±9.71%      2.50%

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
BarrierSyncRate 1Pc                          5693.90 ±3.09%   9364.70 ±1.50%     64.47%
ConcurrentQueueThroughput 1Pc               37904.54 ±4.21%  25803.97 ±6.12%    -31.92%
ConcurrentStackThroughput 1Pc               47125.33 ±0.11%  48910.94 ±0.17%      3.79%
CountdownEventAddCountSignalThroughput 1Pc  34265.28 ±0.53%  13560.28 ±1.37%    -60.43%
MresWaitDrainRate 1Pc                         338.35 ±0.82%    699.29 ±0.35%    106.68%
MresWaitDrainRate 1Pc Delay                   342.14 ±0.50%    657.57 ±0.49%     92.19%
MresWaitDrainRate 2Pc                         634.33 ±0.19%    984.25 ±0.47%     55.16%
MresWaitDrainRate 2Pc Delay                   612.98 ±0.09%    884.72 ±0.48%     44.33%
MresWaitLatency 1Pc                           414.26 ±0.49%    610.93 ±0.36%     47.48%
MresWaitLatency 1Pc Delay                     454.59 ±1.06%    578.27 ±0.31%     27.21%
SemaphoreSlimLatency 1Pc                      351.74 ±0.48%    253.97 ±1.18%    -27.80%
SemaphoreSlimLatency 1Pc Delay                207.57 ±1.14%    167.06 ±0.82%    -19.52%
SemaphoreSlimLatency 2Pc                       51.93 ±3.80%     47.84 ±6.23%     -7.88%
SemaphoreSlimLatency 2Pc Delay                 46.49 ±3.01%     31.28 ±4.39%    -32.71%
SemaphoreSlimThroughput 1Pc                 14368.74 ±0.89%  14531.60 ±1.36%      1.13%
SemaphoreSlimWaitDrainRate 1Pc                 21.04 ±1.99%     65.49 ±1.59%    211.30%
SemaphoreSlimWaitDrainRate 1Pc Delay           21.28 ±2.45%     61.86 ±1.77%    190.74%
SemaphoreSlimWaitDrainRate 2Pc                 25.50 ±0.43%     86.66 ±0.47%    239.85%
SemaphoreSlimWaitDrainRate 2Pc Delay           25.19 ±0.43%     83.42 ±0.47%    231.10%
SpinLockLatency 1Pc                           337.00 ±0.47%    392.09 ±0.72%     16.35%
SpinLockLatency 1Pc Delay                     326.97 ±1.27%    342.38 ±1.18%      4.71%
SpinLockLatency 2Pc                           164.61 ±2.36%    173.41 ±2.06%      5.35%
SpinLockLatency 2Pc Delay                     148.40 ±3.75%    148.05 ±3.77%     -0.24%
SpinLockThroughput 1Pc                      55420.72 ±0.32%  58856.20 ±0.48%      6.20%
------------------------------------------  ---------------  ---------------  ---------
Total                                         536.33 ±1.42%    687.48 ±1.60%     28.18%

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         27479.34 ±0.15%   25137.00 ±1.24%     -8.52%
Concurrency_OnlyReadersPcx04         27464.91 ±0.17%   27044.91 ±0.27%     -1.53%
Concurrency_OnlyReadersPcx16         26662.72 ±0.52%   26741.52 ±0.67%      0.30%
Concurrency_OnlyReadersPcx64         26062.34 ±0.37%   26194.72 ±0.56%      0.51%
Concurrency_OnlyWritersPcx01         27062.37 ±1.15%   25318.99 ±1.46%     -6.44%
Concurrency_OnlyWritersPcx04         23594.37 ±3.73%   22894.77 ±3.36%     -2.97%
Concurrency_OnlyWritersPcx16         27225.09 ±1.94%   22369.25 ±2.08%    -17.84%
Concurrency_OnlyWritersPcx64        17451.93 ±11.31%   20954.91 ±3.54%     20.07%
Concurrency_Pcx01Readers_01Writers   7739.63 ±10.32%    9596.75 ±7.24%     23.99%
Concurrency_Pcx01Readers_02Writers   4714.65 ±14.40%   6436.92 ±16.55%     36.53%
Concurrency_Pcx04Readers_01Writers   9490.11 ±12.25%  11517.93 ±11.15%     21.37%
Concurrency_Pcx04Readers_02Writers   5379.94 ±17.06%    9158.23 ±8.38%     70.23%
Concurrency_Pcx04Readers_04Writers   5575.55 ±25.88%   5534.47 ±15.56%     -0.74%
Concurrency_Pcx16Readers_01Writers   7841.46 ±16.61%   8558.61 ±27.37%      9.15%
Concurrency_Pcx16Readers_02Writers   4355.57 ±12.71%   5121.77 ±32.87%     17.59%
Concurrency_Pcx16Readers_04Writers   2760.20 ±15.98%   5366.95 ±20.87%     94.44%
Concurrency_Pcx16Readers_08Writers   4930.88 ±26.49%   4958.81 ±26.60%      0.57%
Concurrency_Pcx64Readers_01Writers   9728.04 ±17.06%    229.29 ±11.63%    -97.64%
Concurrency_Pcx64Readers_02Writers   5646.32 ±14.26%     185.42 ±9.17%    -96.72%
Concurrency_Pcx64Readers_04Writers    4520.20 ±5.61%    716.59 ±50.04%    -84.15%
Concurrency_Pcx64Readers_08Writers    4649.90 ±8.66%   1404.82 ±41.50%    -69.79%
Concurrency_Pcx64Readers_16Writers   5842.78 ±14.01%   5082.68 ±24.73%    -13.01%
----------------------------------  ----------------  ----------------  ---------
Total                                9705.62 ±10.83%   6630.10 ±15.71%    -31.69%

@kouvel
Copy link
Author

kouvel commented Aug 30, 2017

Numbers from this PR:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score        Right score      ∆ Score %
------------------------------------------  ----------------  ---------------  ---------
BarrierSyncRate 1Pc                            131.65 ±0.13%    156.62 ±0.09%     18.97%
ConcurrentQueueThroughput 1Pc               16092.28 ±12.34%  37180.17 ±1.16%    131.04%
ConcurrentStackThroughput 1Pc                39470.21 ±0.42%  40290.88 ±0.94%      2.08%
CountdownEventAddCountSignalThroughput 1Pc    7088.57 ±1.33%  33169.51 ±1.76%    367.93%
MresWaitDrainRate 1Pc                          556.26 ±0.71%    630.82 ±0.42%     13.40%
MresWaitDrainRate 1Pc Delay                    524.40 ±1.22%    591.94 ±1.39%     12.88%
MresWaitDrainRate 2Pc                          679.15 ±0.52%    787.68 ±1.46%     15.98%
MresWaitDrainRate 2Pc Delay                    668.11 ±1.76%    793.68 ±0.27%     18.80%
MresWaitLatency 1Pc                            505.88 ±0.38%    571.14 ±0.55%     12.90%
MresWaitLatency 1Pc Delay                      442.80 ±0.51%    563.03 ±0.47%     27.15%
SemaphoreSlimLatency 1Pc                       114.14 ±0.50%    228.34 ±1.02%    100.06%
SemaphoreSlimLatency 1Pc Delay                 117.67 ±0.49%    170.44 ±3.62%     44.84%
SemaphoreSlimLatency 2Pc                        70.93 ±1.04%    220.51 ±1.75%    210.89%
SemaphoreSlimLatency 2Pc Delay                  91.47 ±1.90%    142.45 ±2.65%     55.74%
SemaphoreSlimThroughput 1Pc                   455.94 ±60.07%   914.93 ±12.88%    100.67%
SemaphoreSlimWaitDrainRate 1Pc                  71.31 ±0.95%    474.65 ±7.27%    565.60%
SemaphoreSlimWaitDrainRate 1Pc Delay            68.33 ±2.13%   293.44 ±18.24%    329.45%
SemaphoreSlimWaitDrainRate 2Pc                  93.52 ±2.42%    597.74 ±3.01%    539.17%
SemaphoreSlimWaitDrainRate 2Pc Delay            86.89 ±2.17%    605.61 ±2.29%    596.97%
SpinLockLatency 1Pc                            286.86 ±1.00%    291.97 ±1.29%      1.78%
SpinLockLatency 1Pc Delay                      217.72 ±0.81%    209.04 ±2.73%     -3.99%
SpinLockLatency 2Pc                            142.92 ±2.09%    269.79 ±1.03%     88.77%
SpinLockLatency 2Pc Delay                       75.96 ±4.76%    179.64 ±1.71%    136.49%
SpinLockThroughput 1Pc                       44828.02 ±0.48%  48026.99 ±0.77%      7.14%
------------------------------------------  ----------------  ---------------  ---------
Total                                          418.65 ±5.39%    799.70 ±2.96%     91.02%

RwSB vs RwS                         Left score       Right score       ∆ Score %
----------------------------------  ---------------  ----------------  ---------
Concurrency_OnlyReadersPcx01        23249.72 ±0.26%   23281.16 ±0.11%      0.14%
Concurrency_OnlyReadersPcx04        23244.97 ±0.11%   22884.67 ±0.17%     -1.55%
Concurrency_OnlyReadersPcx16        22999.13 ±0.06%   22711.34 ±0.13%     -1.25%
Concurrency_OnlyReadersPcx64        15791.12 ±4.44%   15103.34 ±2.92%     -4.36%
Concurrency_OnlyWritersPcx01        22007.28 ±0.97%   23716.49 ±0.33%      7.77%
Concurrency_OnlyWritersPcx04        21556.55 ±1.41%   23451.14 ±0.42%      8.79%
Concurrency_OnlyWritersPcx16        21269.57 ±1.14%   23611.47 ±0.40%     11.01%
Concurrency_OnlyWritersPcx64        22935.28 ±1.47%   22004.41 ±0.59%     -4.06%
Concurrency_Pcx01Readers_01Writers  10900.17 ±2.26%   10834.01 ±4.16%     -0.61%
Concurrency_Pcx01Readers_02Writers  7239.27 ±12.43%    7426.22 ±8.20%      2.58%
Concurrency_Pcx04Readers_01Writers  16317.54 ±3.14%   16688.11 ±2.54%      2.27%
Concurrency_Pcx04Readers_02Writers  14166.41 ±5.33%   12205.80 ±3.45%    -13.84%
Concurrency_Pcx04Readers_04Writers  15039.61 ±5.99%   10169.08 ±3.46%    -32.38%
Concurrency_Pcx16Readers_01Writers  9526.18 ±17.75%   16779.35 ±5.23%     76.14%
Concurrency_Pcx16Readers_02Writers  7491.28 ±17.99%   14328.17 ±5.72%     91.26%
Concurrency_Pcx16Readers_04Writers  8238.04 ±18.55%   12386.79 ±5.58%     50.36%
Concurrency_Pcx16Readers_08Writers  15473.47 ±7.52%   14103.08 ±8.87%     -8.86%
Concurrency_Pcx64Readers_01Writers   1621.35 ±7.95%   15344.30 ±6.25%    846.39%
Concurrency_Pcx64Readers_02Writers  6725.27 ±21.51%   13493.06 ±5.50%    100.63%
Concurrency_Pcx64Readers_04Writers  15767.55 ±7.69%   11061.25 ±6.63%    -29.85%
Concurrency_Pcx64Readers_08Writers  15849.51 ±6.85%   12762.56 ±9.17%    -19.48%
Concurrency_Pcx64Readers_16Writers  21549.13 ±3.39%  12718.30 ±11.93%    -40.98%
----------------------------------  ---------------  ----------------  ---------
Total                               13437.83 ±6.98%   15420.22 ±4.23%     14.75%

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
BarrierSyncRate 1Pc                          5693.90 ±3.09%   7549.66 ±6.62%     32.59%
ConcurrentQueueThroughput 1Pc               37904.54 ±4.21%  47712.92 ±1.50%     25.88%
ConcurrentStackThroughput 1Pc               47125.33 ±0.11%  49046.62 ±0.26%      4.08%
CountdownEventAddCountSignalThroughput 1Pc  34265.28 ±0.53%  24094.85 ±3.12%    -29.68%
MresWaitDrainRate 1Pc                         338.35 ±0.82%    781.15 ±0.25%    130.87%
MresWaitDrainRate 1Pc Delay                   342.14 ±0.50%    774.52 ±0.49%    126.37%
MresWaitDrainRate 2Pc                         634.33 ±0.19%    981.16 ±0.15%     54.68%
MresWaitDrainRate 2Pc Delay                   612.98 ±0.09%    964.77 ±0.43%     57.39%
MresWaitLatency 1Pc                           414.26 ±0.49%    890.84 ±0.79%    115.04%
MresWaitLatency 1Pc Delay                     454.59 ±1.06%    844.91 ±0.43%     85.86%
SemaphoreSlimLatency 1Pc                      351.74 ±0.48%    285.31 ±1.06%    -18.89%
SemaphoreSlimLatency 1Pc Delay                207.57 ±1.14%    234.78 ±2.05%     13.11%
SemaphoreSlimLatency 2Pc                       51.93 ±3.80%    280.19 ±1.01%    439.55%
SemaphoreSlimLatency 2Pc Delay                 46.49 ±3.01%    226.19 ±2.62%    386.58%
SemaphoreSlimThroughput 1Pc                 14368.74 ±0.89%   5504.80 ±3.43%    -61.69%
SemaphoreSlimWaitDrainRate 1Pc                 21.04 ±1.99%   176.38 ±18.92%    738.46%
SemaphoreSlimWaitDrainRate 1Pc Delay           21.28 ±2.45%   130.11 ±21.55%    511.58%
SemaphoreSlimWaitDrainRate 2Pc                 25.50 ±0.43%    517.83 ±0.15%   1930.81%
SemaphoreSlimWaitDrainRate 2Pc Delay           25.41 ±0.31%    466.44 ±0.50%   1735.61%
SpinLockLatency 1Pc                           337.00 ±0.47%    410.08 ±1.63%     21.68%
SpinLockLatency 1Pc Delay                     326.97 ±1.27%    347.44 ±2.04%      6.26%
SpinLockLatency 2Pc                           164.61 ±2.36%    357.89 ±2.32%    117.42%
SpinLockLatency 2Pc Delay                     148.40 ±3.75%    321.58 ±1.31%    116.69%
SpinLockThroughput 1Pc                      55420.72 ±0.32%  61147.67 ±0.72%     10.33%
------------------------------------------  ---------------  ---------------  ---------
Total                                         536.52 ±1.41%   1141.26 ±3.22%    112.71%

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         27479.34 ±0.15%   27099.63 ±0.26%     -1.38%
Concurrency_OnlyReadersPcx04         27464.91 ±0.17%   26101.59 ±0.85%     -4.96%
Concurrency_OnlyReadersPcx16         26662.72 ±0.52%   26892.08 ±0.16%      0.86%
Concurrency_OnlyReadersPcx64         26062.34 ±0.37%   25022.53 ±0.32%     -3.99%
Concurrency_OnlyWritersPcx01         27062.37 ±1.15%   28764.28 ±0.36%      6.29%
Concurrency_OnlyWritersPcx04         23594.37 ±3.73%   28707.49 ±0.29%     21.67%
Concurrency_OnlyWritersPcx16         27225.09 ±1.94%   24213.02 ±5.62%    -11.06%
Concurrency_OnlyWritersPcx64        17451.93 ±11.31%   26971.54 ±1.39%     54.55%
Concurrency_Pcx01Readers_01Writers   7739.63 ±10.32%    9620.09 ±8.70%     24.30%
Concurrency_Pcx01Readers_02Writers   4714.65 ±14.40%   11722.87 ±8.26%    148.65%
Concurrency_Pcx04Readers_01Writers   9490.11 ±12.25%   11005.10 ±7.46%     15.96%
Concurrency_Pcx04Readers_02Writers   5379.94 ±17.06%    7972.85 ±9.15%     48.20%
Concurrency_Pcx04Readers_04Writers   5575.55 ±25.88%   10421.93 ±9.10%     86.92%
Concurrency_Pcx16Readers_01Writers   7841.46 ±16.61%  14165.26 ±10.45%     80.65%
Concurrency_Pcx16Readers_02Writers   4355.57 ±12.71%   9627.95 ±12.92%    121.05%
Concurrency_Pcx16Readers_04Writers   2760.20 ±15.98%   5689.54 ±27.34%    106.13%
Concurrency_Pcx16Readers_08Writers   4930.88 ±26.49%  12242.60 ±13.48%    148.28%
Concurrency_Pcx64Readers_01Writers   9728.04 ±17.06%  10822.92 ±20.22%     11.25%
Concurrency_Pcx64Readers_02Writers   5646.32 ±14.26%   8372.09 ±19.54%     48.28%
Concurrency_Pcx64Readers_04Writers    4520.20 ±5.61%   8471.40 ±21.00%     87.41%
Concurrency_Pcx64Readers_08Writers    4649.90 ±8.66%   6870.99 ±22.40%     47.77%
Concurrency_Pcx64Readers_16Writers   5842.78 ±14.01%   8183.58 ±10.72%     40.06%
----------------------------------  ----------------  ----------------  ---------
Total                                9705.62 ±10.83%   13739.79 ±9.92%     41.57%

@kouvel kouvel requested review from stephentoub and vancem August 30, 2017 00:08
/// when the resource becomes available.
/// </summary>
internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35;
internal const int Sleep1ThresholdForSpinBeforeWait = 40; // should be greater than MaxSpinCountBeforeWait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is MaxSpinCountBeforeWait?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops renamed that one, will fix

// (_count - YieldThreshold) % 2 == 0: The purpose of this check is to interleave Thread.Yield/Sleep(0) with
// Thread.SpinWait. Otherwise, the following issues occur:
// - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a
// busy -spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra space in "busy -spin"

// contention), they may switch between one another, delaying work that can make progress.
if ((
_count >= YieldThreshold &&
(_count >= sleep1Threshold || (_count - YieldThreshold) % 2 == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the formatting here reads strangely to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's formatted similarly to:

if (a ||
    b)

where

a ==
    (
        c &&
        d
    )

This is how I typically format multi-line expressions, trying to align parentheses and putting each type of expression (&& or ||) separately, one condition per line unless the whole expression fits on one line. What would you suggest instead? I can separate parts of it into locals if you prefer.

@stephentoub
Copy link
Member

Thanks, @kouvel. Do you have any throughput numbers on the thread pool with this change?

@kouvel
Copy link
Author

kouvel commented Aug 30, 2017

The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like.

@kouvel
Copy link
Author

kouvel commented Aug 30, 2017

Code used for Spin perf:

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;

internal class Program
{
    private static readonly int ProcessorCount = Environment.ProcessorCount;

    private static void Main(string[] args)
    {
        int ai = 1;
        int threadCount;
        if (args[ai].EndsWith("PcT"))
        {
            double pcMultiplier;
            if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcT".Length), out pcMultiplier))
                return;
            threadCount = Math.Max(1, (int)Math.Round(ProcessorCount * pcMultiplier));
        }
        else if (args[ai].EndsWith("T"))
        {
            if (!int.TryParse(args[ai].Substring(0, args[ai].Length - "T".Length), out threadCount))
                return;
        }
        else
            return;
        ++ai;

        switch (args[0])
        {
            case "MresWaitDrainRate":
                MresWaitDrainRate(threadCount);
                break;
            case "MresWaitLatency":
                MresWaitLatency(threadCount);
                break;
            case "SemaphoreSlimWaitDrainRate":
                SemaphoreSlimWaitDrainRate(threadCount);
                break;
            case "SemaphoreSlimLatency":
                SemaphoreSlimLatency(threadCount);
                break;
            case "SemaphoreSlimThroughput":
                SemaphoreSlimThroughput(threadCount);
                break;
            case "SpinLockLatency":
                SpinLockLatency(threadCount);
                break;
            case "SpinLockThroughput":
                SpinLockThroughput(threadCount);
                break;
            case "ConcurrentBagThroughput":
                ConcurrentBagThroughput(threadCount);
                break;
            case "ConcurrentBagFairness":
                ConcurrentBagFairness(threadCount);
                break;
            case "ConcurrentQueueThroughput":
                ConcurrentQueueThroughput(threadCount);
                break;
            case "ConcurrentQueueFairness":
                ConcurrentQueueFairness(threadCount);
                break;
            case "ConcurrentStackThroughput":
                ConcurrentStackThroughput(threadCount);
                break;
            case "ConcurrentStackFairness":
                ConcurrentStackFairness(threadCount);
                break;
            case "BarrierSyncRate":
                BarrierSyncRate(threadCount);
                break;
            case "CountdownEventSyncRate":
                CountdownEventSyncRate(threadCount);
                break;
            case "ThreadPoolSustainedWorkThroughput":
                ThreadPoolSustainedWorkThroughput(threadCount);
                break;
            case "ThreadPoolBurstWorkThroughput":
                {
                    if (ai >= args.Length || !args[ai].EndsWith("PcWi"))
                        return;
                    double workItemCountPcMultiplier;
                    if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcWi".Length), out workItemCountPcMultiplier))
                        return;
                    int maxWorkItemCount = Math.Max(1, (int)Math.Round(ProcessorCount * workItemCountPcMultiplier));

                    ThreadPoolBurstWorkThroughput(threadCount, maxWorkItemCount);
                    break;
                }
            case "TaskSustainedWorkThroughput":
                TaskSustainedWorkThroughput(threadCount);
                break;
            case "TaskBurstWorkThroughput":
                {
                    if (ai >= args.Length || !args[ai].EndsWith("PcWi"))
                        return;
                    double workItemCountPcMultiplier;
                    if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcWi".Length), out workItemCountPcMultiplier))
                        return;
                    int maxWorkItemCount = Math.Max(1, (int)Math.Round(ProcessorCount * workItemCountPcMultiplier));

                    TaskBurstWorkThroughput(threadCount, maxWorkItemCount);
                    break;
                }
            case "MonitorEnterExitThroughput_ThinLock":
                MonitorEnterExitThroughput(1, false, false);
                break;
            case "MonitorEnterExitThroughput_AwareLock":
                MonitorEnterExitThroughput(1, false, true);
                break;
            case "MonitorReliableEnterExitThroughput_ThinLock":
                MonitorReliableEnterExitThroughput(1, false, false);
                break;
            case "MonitorReliableEnterExitThroughput_AwareLock":
                MonitorReliableEnterExitThroughput(1, false, true);
                break;
            case "MonitorTryEnterExitWhenUnlockedThroughput_ThinLock":
                MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(1);
                break;
            case "MonitorTryEnterExitWhenUnlockedThroughput_AwareLock":
                MonitorTryEnterExitWhenUnlockedThroughput_AwareLock(1);
                break;
            case "MonitorTryEnterWhenLockedThroughput_ThinLock":
                MonitorTryEnterWhenLockedThroughput_ThinLock(1);
                break;
            case "MonitorTryEnterWhenLockedThroughput_AwareLock":
                MonitorTryEnterWhenLockedThroughput_AwareLock(1);
                break;
            case "MonitorReliableEnterExitLatency":
                MonitorReliableEnterExitLatency(threadCount);
                break;
            case "MonitorEnterExitThroughput":
                MonitorEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorReliableEnterExitThroughput":
                MonitorReliableEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorTryEnterExitThroughput":
                MonitorTryEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorReliableEnterExit1PcTOtherWorkThroughput":
                MonitorReliableEnterExit1PcTOtherWorkThroughput(threadCount);
                break;
            case "MonitorReliableEnterExitRoundRobinThroughput":
                MonitorReliableEnterExitRoundRobinThroughput(threadCount);
                break;
            case "MonitorReliableEnterExitFairness":
                MonitorReliableEnterExitFairness(threadCount);
                break;
            case "BufferMemoryCopyThroughput":
                BufferMemoryCopyThroughput(threadCount);
                break;
        }
    }

    [ThreadStatic]
    private static Random t_rng;

    private static void MresWaitDrainRate(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWokenCount = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var e = new ManualResetEventSlim(false);

        ThreadStart waitThreadStart = () =>
        {
            var localThreadCount = threadCount;
            var localThreadOperationCounts = threadOperationCounts;
            startTest.WaitOne();
            var allWaitersWoken = allWaitersWoken0;
            while (true)
            {
                e.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == localThreadCount)
                {
                    ++localThreadOperationCounts[16];
                    waiterWokenCount = 0;
                    e.Reset();
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var rng = new Random(0);
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                e.Set();
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void MresWaitLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var continueWaitThreads = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var e = new ManualResetEventSlim(false);

        ParameterizedThreadStart waitThreadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            startTest.WaitOne();
            while (true)
            {
                e.Wait();
                ++localThreadOperationCounts[threadIndex];
                continueWaitThreads.WaitOne();
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var rng = new Random(0);
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                continueWaitThreads.Reset();
                e.Set();
                e.Reset();
                continueWaitThreads.Set();
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts);
    }

    private static void SemaphoreSlimWaitDrainRate(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWokenCount = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(0);

        ThreadStart waitThreadStart = () =>
        {
            var localThreadCount = threadCount;
            var localThreadOperationCounts = threadOperationCounts;
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                ss.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == localThreadCount)
                {
                    ++localThreadOperationCounts[16];
                    waiterWokenCount = 0;
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var localThreadCount = threadCount;
            var rng = new Random(0);
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                ss.Release(localThreadCount);
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void SemaphoreSlimLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(1);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                ss.Wait();
                previousLockThreadId = threadId;
                Delay(d0);
                ss.Release();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SemaphoreSlimThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(1);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                ss.Wait();
                Delay(d0);
                ss.Release();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SpinLockLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new SpinLock(enableThreadOwnerTracking: false);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                bool lockTaken = false;
                m.Enter(ref lockTaken);
                previousLockThreadId = threadId;
                Delay(d0);
                m.Exit();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SpinLockThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new SpinLock(enableThreadOwnerTracking: false);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                bool lockTaken = false;
                m.Enter(ref lockTaken);
                Delay(d0);
                m.Exit();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentBagThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cb = new ConcurrentBag<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCb = cb;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCb.Add(threadId);
                Delay(d0);
                int item;
                localCb.TryTake(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentBagFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cb = new ConcurrentBag<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCb = cb;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCb.Add(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCb.TryTake(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void ConcurrentQueueThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cq = new ConcurrentQueue<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCq = cq;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCq.Enqueue(threadId);
                Delay(d0);
                int item;
                localCq.TryDequeue(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentQueueFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cq = new ConcurrentQueue<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCq = cq;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCq.Enqueue(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCq.TryDequeue(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void ConcurrentStackThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cs = new ConcurrentStack<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCs = cs;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCs.Push(threadId);
                Delay(d0);
                int item;
                localCs.TryPop(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentStackFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cs = new ConcurrentStack<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCs = cs;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCs.Push(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCs.TryPop(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void BarrierSyncRate(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var delayComplete0 = new ManualResetEvent(false);
        var delayComplete1 = new ManualResetEvent(false);
        int syncThreadCount = threadCount;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var b = new Barrier(threadCount);

        var rng = new Random(0);
        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadCount = threadCount;
            var localDelayComplete0 = delayComplete0;
            var localDelayComplete1 = delayComplete1;
            var localThreadOperationCounts = threadOperationCounts;
            var localB = b;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                localB.SignalAndWait();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    syncThreadCount = localThreadCount;
                    localDelayComplete1.Reset();
                    localDelayComplete0.Set();
                }
                else
                    localDelayComplete0.WaitOne();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    ++localThreadOperationCounts[16];
                    Delay(RandomShortDelay(rng));
                    syncThreadCount = localThreadCount;
                    localDelayComplete0.Reset();
                    localDelayComplete1.Set();
                }
                else
                    localDelayComplete1.WaitOne();
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void CountdownEventSyncRate(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var delayComplete0 = new ManualResetEvent(false);
        var delayComplete1 = new ManualResetEvent(false);
        int syncThreadCount = threadCount;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cde = new CountdownEvent(threadCount * 2);

        var rng = new Random(0);
        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadCount = threadCount;
            var localDelayComplete0 = delayComplete0;
            var localDelayComplete1 = delayComplete1;
            var localThreadOperationCounts = threadOperationCounts;
            var localCde = cde;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                localCde.Signal(2);
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    syncThreadCount = localThreadCount;
                    localDelayComplete1.Reset();
                    localDelayComplete0.Set();
                }
                else
                    localDelayComplete0.WaitOne();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    ++localThreadOperationCounts[16];
                    Delay(RandomShortDelay(rng));
                    syncThreadCount = localThreadCount;
                    localCde.Reset(localThreadCount * 2);
                    localDelayComplete0.Reset();
                    localDelayComplete1.Set();
                }
                else
                    localDelayComplete1.WaitOne();
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ThreadPoolSustainedWorkThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        WaitCallback workItemStart = null;
        workItemStart = data =>
        {
            ThreadPool.QueueUserWorkItem(workItemStart);
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
        };

        var producerThread = new Thread(() =>
        {
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            int initialWorkItemCount = ProcessorCount + threadCount * 4;
            for (int i = 0; i < initialWorkItemCount; ++i)
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void ThreadPoolBurstWorkThroughput(int threadCount, int maxWorkItemCount)
    {
        var startTest = new ManualResetEvent(false);
        var workComplete = new AutoResetEvent(false);
        int workItemCountToQueue = 0;
        int workItemCountToComplete = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        WaitCallback workItemStart = null;
        workItemStart = data =>
        {
            int n = Interlocked.Add(ref workItemCountToQueue, -2);
            if (n >= -1)
            {
                var localWorkItemStart = workItemStart;
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
                if (n >= 0)
                    ThreadPool.QueueUserWorkItem(localWorkItemStart);
            }
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
            if (Interlocked.Decrement(ref workItemCountToComplete) == 0)
                workComplete.Set();
        };

        var producerThread = new Thread(() =>
        {
            var localMaxWorkItemCount = maxWorkItemCount;
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            while (true)
            {
                workItemCountToQueue = localMaxWorkItemCount - 1;
                workItemCountToComplete = localMaxWorkItemCount;
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
                workComplete.WaitOne();
            }
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void TaskSustainedWorkThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        Action workItemStart = null;
        workItemStart = () =>
        {
            Task.Run(workItemStart);
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
        };

        Action initialWorkItemStart = () =>
        {
            var localWorkItemStart = workItemStart;
            for (int i = 0; i < 4; ++i)
                Task.Run(localWorkItemStart);
        };

        var producerThread = new Thread(() =>
        {
            var localThreadCount = threadCount;
            var localInitialWorkItemStart = initialWorkItemStart;
            startTest.WaitOne();
            int initialWorkItemCount = ProcessorCount + threadCount;
            for (int i = 0; i < initialWorkItemCount; ++i)
                Task.Run(localInitialWorkItemStart);
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void TaskBurstWorkThroughput(int threadCount, int maxWorkItemCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        Action<object> workItemStart = null;
        workItemStart = async data =>
        {
            Task t0 = null, t1 = null;
            int toQueue = (int)data;
            if (toQueue > 1)
            {
                var localWorkItemStart = workItemStart;
                --toQueue;
                t0 = new Task(localWorkItemStart, toQueue - toQueue / 2);
                t0.Start();
                t1 = new Task(localWorkItemStart, toQueue / 2);
                t1.Start();
            }
            else if (toQueue != 0)
            {
                t0 = new Task(workItemStart, 0);
                t0.Start();
            }
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
            if (t0 != null)
            {
                await t0;
                if (t1 != null)
                    await t1;
            }
        };

        var producerThread = new Thread(() =>
        {
            var localMaxWorkItemCount = maxWorkItemCount;
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            while (true)
            {
                var t = new Task(localWorkItemStart, localMaxWorkItemCount - 1);
                t.Start();
                t.Wait();
            }
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void MonitorReliableEnterExitLatency(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localM)
                {
                    previousLockThreadId = threadId;
                    Delay(d0);
                }
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    lock (localM)
                        Delay(d0);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    lock (localM)
                    {
                    }
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    Monitor.Enter(localM);
                    Delay(d0);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    Monitor.Enter(localM);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    if (!Monitor.TryEnter(localM, -1))
                        return;
                    Delay(d0);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    if (!Monitor.TryEnter(localM, -1))
                        return;
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExit1PcTOtherWorkThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var otherWorkThreadOperationCounts = new int[(ProcessorCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = new Random((int)data);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localM)
                    Delay(d0);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        ParameterizedThreadStart otherWorkThreadStart = data =>
        {
            int threadIndex = (int)data;
            var localOtherWorkThreadOperationCounts = otherWorkThreadOperationCounts;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                ++localOtherWorkThreadOperationCounts[threadIndex];
            }
        };
        var otherWorkThreads = new Thread[ProcessorCount];
        for (int i = 0; i < otherWorkThreads.Length; ++i)
        {
            var t = new Thread(otherWorkThreadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            otherWorkThreads[i] = t;
        }

        RunWithOtherWork(startTest, threadOperationCounts, otherWorkThreadOperationCounts);
    }

    private static void MonitorReliableEnterExitRoundRobinThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var mutexes = new object[threadCount];
        for (int i = 0; i < mutexes.Length; ++i)
            mutexes[i] = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localMutexes = mutexes;
            int mutexCount = localMutexes.Length;
            int mutexIndex = (threadIndex / 16 - 1) % mutexCount;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localMutexes[mutexIndex])
                    Delay(d0);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                mutexIndex = (mutexIndex + 1) % mutexCount;
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExitFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localM = m;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                long stopTicks;
                lock (localM)
                {
                    stopTicks = Clock.Ticks;
                    Delay(d0);
                }
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                if (!Monitor.TryEnter(localM))
                    return;
                Monitor.Exit(localM);
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterExitWhenUnlockedThroughput_AwareLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            Monitor.Enter(localM);
            Monitor.Exit(localM);
            startTest.WaitOne();
            while (true)
            {
                if (!Monitor.TryEnter(localM))
                    return;
                Monitor.Exit(localM);
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Thread.Sleep(50);
        Monitor.Exit(m);

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterWhenLockedThroughput_ThinLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                if (Monitor.TryEnter(localM))
                    return;
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
        Monitor.Exit(m);
    }

    private static void MonitorTryEnterWhenLockedThroughput_AwareLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            if (Monitor.TryEnter(localM, 50))
                return;
            startTest.WaitOne();
            while (true)
            {
                if (Monitor.TryEnter(localM))
                    return;
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Thread.Sleep(50);

        Run(startTest, threadOperationCounts);
        Monitor.Exit(m);
    }

    private static unsafe void BufferMemoryCopyThroughput(int maxBytes)
    {
        const int threadCount = 1;
        int minBytes = maxBytes <= 8 ? 1 : maxBytes / 2 + 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(0);
            var src = stackalloc byte[maxBytes];
            var dst = stackalloc byte[maxBytes];
            for (int i = 0; i < maxBytes; ++i)
                src[i] = (byte)rng.Next();
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                Buffer.MemoryCopy(src, dst, maxBytes, rng.Next(minBytes, maxBytes + 1));
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts, iterations: 1);
    }

    private static void Run(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        bool hasOneResult = false,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadOperationCounts.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var operationCounts = new long[threadCount];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < threadCount; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < threadCount; ++i)
                operationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < threadCount; ++i)
                operationCounts[i] -= afterWarmupOperationCounts[i];

            double score = operationCounts.Sum() / sw.Elapsed.TotalMilliseconds;
            Console.WriteLine("Score: {0:0.000000}", score);
        }
    }

    private static void RunWithOtherWork(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        int[] otherWorkThreadOperationCounts,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadOperationCounts.Length / 16 - 1;
        int otherWorkThreadCount = otherWorkThreadOperationCounts.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var otherWorkAfterWarmupOperationCounts = new long[otherWorkThreadCount];
        var operationCounts = new long[threadCount];
        var otherWorkOperationCounts = new long[otherWorkThreadCount];
        var operationCountSums = new double[2];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < afterWarmupOperationCounts.Length; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < otherWorkAfterWarmupOperationCounts.Length; ++i)
                otherWorkAfterWarmupOperationCounts[i] = otherWorkThreadOperationCounts[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < operationCounts.Length; ++i)
                operationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < otherWorkOperationCounts.Length; ++i)
                otherWorkOperationCounts[i] = otherWorkThreadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < operationCounts.Length; ++i)
                operationCounts[i] -= afterWarmupOperationCounts[i];
            for (int i = 0; i < otherWorkOperationCounts.Length; ++i)
                otherWorkOperationCounts[i] -= otherWorkAfterWarmupOperationCounts[i];

            operationCountSums[0] = operationCounts.Sum();
            operationCountSums[1] = otherWorkOperationCounts.Sum();
            double score = operationCountSums.GeometricMean(1, otherWorkThreadCount) / sw.Elapsed.TotalMilliseconds;
            Console.WriteLine("Score: {0:0.000000}", score);
        }
    }

    private static void RunFairness(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        double[] threadWaitDurationsUs,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadWaitDurationsUs.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var afterWarmupWaitDurationsUs = new double[threadCount];
        var operationCounts = new long[threadCount];
        var waitDurationsUs = new double[threadCount];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < threadCount; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < threadCount; ++i)
                afterWarmupWaitDurationsUs[i] = threadWaitDurationsUs[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < threadCount; ++i)
            {
                int ti = (i + 1) * 16;
                operationCounts[i] = threadOperationCounts[ti];
                waitDurationsUs[i] = threadWaitDurationsUs[ti];
            }
            for (int i = 0; i < threadCount; ++i)
            {
                operationCounts[i] -= afterWarmupOperationCounts[i];
                waitDurationsUs[i] -= afterWarmupWaitDurationsUs[i];
            }

            double averageWaitDurationUs = Math.Sqrt(waitDurationsUs.Sum() / operationCounts.Sum());
            if (averageWaitDurationUs < 1)
                averageWaitDurationUs = 1;
            double score = 100_000 / averageWaitDurationUs;
            Console.WriteLine($"Score: {score:0.000000}");
        }
    }

    private static double BiasWaitDurationUsAgainstLongWaits(double waitDurationUs) =>
        waitDurationUs <= 1 ? 1 : waitDurationUs * waitDurationUs;

    internal static class Clock
    {
        private static readonly long s_swFrequency = Stopwatch.Frequency;
        private static readonly double s_swFrequencyDouble = s_swFrequency;

        public static long Ticks => Stopwatch.GetTimestamp();
        public static double TicksToS(long ticks) => ticks / s_swFrequencyDouble;
        public static double TicksToMs(long ticks) => ticks * 1000 / s_swFrequencyDouble;
        public static double TicksToUs(long ticks) => ticks * (1000 * 1000) / s_swFrequencyDouble;
    }

    private static uint RandomShortDelay(Random rng) => (uint)rng.Next(4, 10);
    private static uint RandomMediumDelay(Random rng) => (uint)rng.Next(10, 15);
    private static uint RandomLongDelay(Random rng) => (uint)rng.Next(15, 20);

    private static int[] s_delayValues = new int[32];

    private static void Delay(uint n)
    {
        Interlocked.MemoryBarrier();
        s_delayValues[16] += (int)Fib(n);
    }

    private static uint Fib(uint n)
    {
        if (n <= 1)
            return n;
        return Fib(n - 2) + Fib(n - 1);
    }
}

@kouvel
Copy link
Author

kouvel commented Aug 30, 2017

Code used for ReaderWriterLockSlim perf:

            var sw = new Stopwatch();
            var scores = new double[16];
            var startThreads = new ManualResetEvent(false);
            bool stop = false;

            var counts = new int[64];
            var readerThreads = new Thread[readerThreadCount];
            ThreadStart readThreadStart =
                () =>
                {
                    startThreads.WaitOne();
                    while (!stop)
                    {
                        rw.EnterReadLock();
                        rw.ExitReadLock();
                        Interlocked.Increment(ref counts[16]);
                    }
                };
            for (int i = 0; i < readerThreadCount; ++i)
            {
                readerThreads[i] = new Thread(readThreadStart);
                readerThreads[i].IsBackground = true;
                readerThreads[i].Start();
            }

            var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
            var writerThreads = new Thread[writerThreadCount];
            ThreadStart writeThreadStart =
                () =>
                {
                    startThreads.WaitOne();
                    while (!stop)
                    {
                        rw.EnterWriteLock();
                        rw.ExitWriteLock();
                        Interlocked.Increment(ref counts[32]);
                    }
                };
            for (int i = 0; i < writerThreadCount; ++i)
            {
                writerThreads[i] = new Thread(writeThreadStart);
                writerThreads[i].IsBackground = true;
                writerThreads[i].Start();
            }

            startThreads.Set();

            // Warmup

            Thread.Sleep(4000);

            // Actual run
            for(int i = 0; i < scores.Length; ++i)
            {
                counts[16] = 0;
                counts[32] = 0;
                Interlocked.MemoryBarrier();

                sw.Restart();
                Thread.Sleep(500);
                sw.Stop();

                int readCount = counts[16];
                int writeCount = counts[32];

                double elapsedMs = sw.Elapsed.TotalMilliseconds;
                scores[i] =
                    new double[]
                    {
                        Math.Max(1, (readCount + writeCount) / elapsedMs),
                        Math.Max(1, writeCount / elapsedMs)
                    }.GeometricMean(readerThreadCount, writerThreadCount);
            }

            return scores;

@stephentoub
Copy link
Member

stephentoub commented Aug 30, 2017

The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like.

ThreadPool's global queue is a ConcurrentQueue, and CQ uses System.Threading.SpinWait when there are contentions on various operations, including dequeues.

@kouvel
Copy link
Author

kouvel commented Aug 30, 2017

Ah ok, I included ConcurrentQueue, I'll add a test for thread pool as well

@kouvel
Copy link
Author

kouvel commented Aug 30, 2017

Updated code above with the added thread pool throughput test. Looks like there's no change:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score      Right score     ∆ Score  ∆ Score %
------------------------------------------  --------------  --------------  -------  ---------
ThreadPoolThroughput 1Pc                    7322.26 ±0.65%  7443.96 ±0.73%   121.71      1.66%
ThreadPoolThroughput 2Pc                    7377.70 ±0.63%  7467.42 ±0.82%    89.72      1.22%
ThreadPoolThroughput 4Pc                    7329.01 ±0.75%  7330.87 ±1.00%     1.86      0.03%
------------------------------------------  --------------  --------------  -------  ---------

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score      Right score     ∆ Score  ∆ Score %
------------------------------------------  --------------  --------------  -------  ---------
ThreadPoolThroughput 1Pc                    9434.79 ±0.55%  9484.14 ±0.54%    49.35      0.52%
ThreadPoolThroughput 2Pc                    9384.44 ±0.41%  9376.15 ±0.41%    -8.30     -0.09%
ThreadPoolThroughput 4Pc                    9390.46 ±0.62%  9387.43 ±0.75%    -3.03     -0.03%
------------------------------------------  --------------  --------------  -------  ---------

@kouvel
Copy link
Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel
Copy link
Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness

@kouvel
Copy link
Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel
Copy link
Author

kouvel commented Aug 31, 2017

@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness

@kouvel kouvel requested a review from tarekgh August 31, 2017 09:54
/// A suggested number of spin iterations before doing a proper wait, such as waiting on an event that becomes signaled
/// when the resource becomes available.
/// </summary>
internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

35 [](start = 105, length = 2)

did we get this number from experimenting different scenarios? just curious how we come up with this number. and it doesn't matter the number of processors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with ManualResetEventSlim to get an initial number, applied the same number to other similar situations, and then tweaked up and down to see what was working. Spinning less can lead to early waiting and more context switching, spinning more can decrease latency but may use up some CPU time unnecessarily. Depends on the situation too, like for SemaphoreSlim I had to double the spin iterations because the waiting there is a lot more expensive. Also depends on the likelihood of the spin being successful and how long the wait would be but those are not accounted for here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think including number of processors (N) works well. Multiplying by N increases spinning on each thread by N, so total spinning across N threads is increased by N^2. When there are more processors contending on a resource, it may even be better to spin less and wait sooner to reduce contention since with more processors something like a mutex has the natural possibility of having more contention.

// usually better for that.
//
int n = RuntimeThread.OptimalMaxSpinWaitsPerSpinIteration;
if (_count <= 30 && (1 << _count) < n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 [](start = 30, length = 2)

would be nice to comment how we choose this number.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a reference to Thread::InitializeYieldProcessorNormalized that describes and calculates it

{
get
{
if (s_optimalMaxSpinWaitsPerSpinIteration != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s_optimalMaxSpinWaitsPerSpinIteration [](start = 20, length = 37)

Looks this one can be converted to readonly field initialized with GetOptimalMaxSpinWaitsPerSpinIterationInternal() so we can avoid checking 0 value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want do that since the first call would trigger the measurement that takes about 10 ms. Static construction of RuntimeThread probably happens during startup for most apps.

}

return IsCompleted;
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false; [](start = 11, length = 14)

Is it possible between exiting the loop and executing the return, the task can get into completed state? I am asking to know if we should keep returning IsCompleted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally it doesn't make any difference, the caller will do the right thing. Previously it made sense to check IsCompleted before returning because the loop would have stopped immediately after a wait. But previously it was redundant to check IsCompleted first in the loop because it was already checked immediately before before the loop. So I just changed the loop to wait first and check later, now the loop exits right after checking IsCompleted and it would be redundant to check it again before returning.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@kouvel kouvel added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 31, 2017
kouvel added a commit to dotnet/corert that referenced this pull request Aug 31, 2017
PR dotnet/coreclr#13670 adds a new member to RuntimeThread is that used by SpinWait. SpinWait is shared, so this PR is adding the new member to CoreRT first to avoid a break.
@kouvel kouvel removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 31, 2017
@kouvel
Copy link
Author

kouvel commented Aug 31, 2017

Are there any other concerns with this change that should be addressed? As with any change to spin heuristics, there probably will be some regressions here and there since there are tradeoffs, which may have to be dealt with. If we're ok with that I'll go ahead and merge.

@tarekgh
Copy link
Member

tarekgh commented Aug 31, 2017

no concerns from my side.

@kouvel kouvel merged commit 03bf95c into dotnet:master Sep 1, 2017
@kouvel kouvel deleted the SpinFix branch September 1, 2017 20:09
kouvel added a commit that referenced this pull request Sep 5, 2017
In #13670, by mistake I made the spin loop infinite, that is now fixed.

As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem:
- Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state.
- Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary.
- Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
kouvel added a commit that referenced this pull request Sep 26, 2017
- Removed asm helpers on Windows and used portable C++ helpers instead
- Rearranged fast path code to improve them a bit and match the asm more closely

Perf:
- The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
  - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
  - Register usage
    - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
    - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
    - I believe this is the main remaining gap and PGO is not helping with this
- On Linux, perf is >= before for the most part
- Perf tests used for below are updated in PR #13670
@kouvel kouvel mentioned this pull request Sep 27, 2017
MichalStrehovsky added a commit to MichalStrehovsky/corert that referenced this pull request Apr 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants