-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add normalized equivalent of YieldProcessor, retune some spin loops #13670
Conversation
|
Numbers from Thread.SpinWait divide count by 7 experiment: Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread): Core i7-6700 (Skylake, 4-core, 8-thread): |
|
Numbers from this PR: Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread): Core i7-6700 (Skylake, 4-core, 8-thread): |
| /// when the resource becomes available. | ||
| /// </summary> | ||
| internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35; | ||
| internal const int Sleep1ThresholdForSpinBeforeWait = 40; // should be greater than MaxSpinCountBeforeWait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is MaxSpinCountBeforeWait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops renamed that one, will fix
| // (_count - YieldThreshold) % 2 == 0: The purpose of this check is to interleave Thread.Yield/Sleep(0) with | ||
| // Thread.SpinWait. Otherwise, the following issues occur: | ||
| // - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a | ||
| // busy -spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra space in "busy -spin"
| // contention), they may switch between one another, delaying work that can make progress. | ||
| if (( | ||
| _count >= YieldThreshold && | ||
| (_count >= sleep1Threshold || (_count - YieldThreshold) % 2 == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the formatting here reads strangely to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's formatted similarly to:
if (a ||
b)where
a ==
(
c &&
d
)This is how I typically format multi-line expressions, trying to align parentheses and putting each type of expression (&& or ||) separately, one condition per line unless the whole expression fits on one line. What would you suggest instead? I can separate parts of it into locals if you prefer.
|
Thanks, @kouvel. Do you have any throughput numbers on the thread pool with this change? |
|
The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like. |
|
Code used for Spin perf: |
|
Code used for ReaderWriterLockSlim perf: var sw = new Stopwatch();
var scores = new double[16];
var startThreads = new ManualResetEvent(false);
bool stop = false;
var counts = new int[64];
var readerThreads = new Thread[readerThreadCount];
ThreadStart readThreadStart =
() =>
{
startThreads.WaitOne();
while (!stop)
{
rw.EnterReadLock();
rw.ExitReadLock();
Interlocked.Increment(ref counts[16]);
}
};
for (int i = 0; i < readerThreadCount; ++i)
{
readerThreads[i] = new Thread(readThreadStart);
readerThreads[i].IsBackground = true;
readerThreads[i].Start();
}
var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
var writerThreads = new Thread[writerThreadCount];
ThreadStart writeThreadStart =
() =>
{
startThreads.WaitOne();
while (!stop)
{
rw.EnterWriteLock();
rw.ExitWriteLock();
Interlocked.Increment(ref counts[32]);
}
};
for (int i = 0; i < writerThreadCount; ++i)
{
writerThreads[i] = new Thread(writeThreadStart);
writerThreads[i].IsBackground = true;
writerThreads[i].Start();
}
startThreads.Set();
// Warmup
Thread.Sleep(4000);
// Actual run
for(int i = 0; i < scores.Length; ++i)
{
counts[16] = 0;
counts[32] = 0;
Interlocked.MemoryBarrier();
sw.Restart();
Thread.Sleep(500);
sw.Stop();
int readCount = counts[16];
int writeCount = counts[32];
double elapsedMs = sw.Elapsed.TotalMilliseconds;
scores[i] =
new double[]
{
Math.Max(1, (readCount + writeCount) / elapsedMs),
Math.Max(1, writeCount / elapsedMs)
}.GeometricMean(readerThreadCount, writerThreadCount);
}
return scores; |
ThreadPool's global queue is a ConcurrentQueue, and CQ uses System.Threading.SpinWait when there are contentions on various operations, including dequeues. |
|
Ah ok, I included ConcurrentQueue, I'll add a test for thread pool as well |
|
Updated code above with the added thread pool throughput test. Looks like there's no change: Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread): Core i7-6700 (Skylake, 4-core, 8-thread): |
|
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness |
|
@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness |
|
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness |
|
@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness |
| /// A suggested number of spin iterations before doing a proper wait, such as waiting on an event that becomes signaled | ||
| /// when the resource becomes available. | ||
| /// </summary> | ||
| internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
35 [](start = 105, length = 2)
did we get this number from experimenting different scenarios? just curious how we come up with this number. and it doesn't matter the number of processors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with ManualResetEventSlim to get an initial number, applied the same number to other similar situations, and then tweaked up and down to see what was working. Spinning less can lead to early waiting and more context switching, spinning more can decrease latency but may use up some CPU time unnecessarily. Depends on the situation too, like for SemaphoreSlim I had to double the spin iterations because the waiting there is a lot more expensive. Also depends on the likelihood of the spin being successful and how long the wait would be but those are not accounted for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think including number of processors (N) works well. Multiplying by N increases spinning on each thread by N, so total spinning across N threads is increased by N^2. When there are more processors contending on a resource, it may even be better to spin less and wait sooner to reduce contention since with more processors something like a mutex has the natural possibility of having more contention.
| // usually better for that. | ||
| // | ||
| int n = RuntimeThread.OptimalMaxSpinWaitsPerSpinIteration; | ||
| if (_count <= 30 && (1 << _count) < n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 [](start = 30, length = 2)
would be nice to comment how we choose this number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a reference to Thread::InitializeYieldProcessorNormalized that describes and calculates it
| { | ||
| get | ||
| { | ||
| if (s_optimalMaxSpinWaitsPerSpinIteration != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s_optimalMaxSpinWaitsPerSpinIteration [](start = 20, length = 37)
Looks this one can be converted to readonly field initialized with GetOptimalMaxSpinWaitsPerSpinIterationInternal() so we can avoid checking 0 value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want do that since the first call would trigger the measurement that takes about 10 ms. Static construction of RuntimeThread probably happens during startup for most apps.
| } | ||
|
|
||
| return IsCompleted; | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false; [](start = 11, length = 14)
Is it possible between exiting the loop and executing the return, the task can get into completed state? I am asking to know if we should keep returning IsCompleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally it doesn't make any difference, the caller will do the right thing. Previously it made sense to check IsCompleted before returning because the loop would have stopped immediately after a wait. But previously it was redundant to check IsCompleted first in the loop because it was already checked immediately before before the loop. So I just changed the loop to wait first and check later, now the loop exits right after checking IsCompleted and it would be redundant to check it again before returning.
tarekgh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
PR dotnet/coreclr#13670 adds a new member to RuntimeThread is that used by SpinWait. SpinWait is shared, so this PR is adding the new member to CoreRT first to avoid a break.
|
Are there any other concerns with this change that should be addressed? As with any change to spin heuristics, there probably will be some regressions here and there since there are tradeoffs, which may have to be dealt with. If we're ok with that I'll go ahead and merge. |
|
no concerns from my side. |
In #13670, by mistake I made the spin loop infinite, that is now fixed. As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem: - Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state. - Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary. - Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
- Removed asm helpers on Windows and used portable C++ helpers instead
- Rearranged fast path code to improve them a bit and match the asm more closely
Perf:
- The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
- There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
- Register usage
- x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
- x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
- I believe this is the main remaining gap and PGO is not helping with this
- On Linux, perf is >= before for the most part
- Perf tests used for below are updated in PR #13670
Part of fix for https://github.com/dotnet/coreclr/issues/13388
Normalized equivalent of YieldProcessor
Thread.SpinWait divide count by 7 experiment
Spin tuning