-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Remove Monitor asm helpers #14146
Remove Monitor asm helpers #14146
Conversation
- Removed asm helpers on Windows and used portable C++ helpers instead
- Rearranged fast path code to improve them a bit and match the asm more closely
Perf:
- The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
- There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
- Register usage
- x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
- x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
- I believe this is the main remaining gap and PGO is not helping with this
- On Linux, perf is >= before for the most part
- Perf tests used for below are updated in PR #13670
Numbers (no PGO):
Windows x64
```
Spin Left score Right score ∆ Score %
------------------------------------------------ --------------- --------------- ---------
MonitorEnterExitLatency 2T 800.56 ±0.33% 821.97 ±0.30% 2.67%
MonitorEnterExitLatency 4T 1533.25 ±0.34% 1553.82 ±0.13% 1.34%
MonitorEnterExitLatency 7T 1676.14 ±0.26% 1678.14 ±0.18% 0.12%
MonitorEnterExitThroughput Delay 1T 5174.77 ±0.25% 5125.56 ±0.27% -0.95%
MonitorEnterExitThroughput Delay 2T 4982.38 ±0.22% 4937.79 ±0.19% -0.90%
MonitorEnterExitThroughput Delay 4T 4720.41 ±0.37% 4694.09 ±0.24% -0.56%
MonitorEnterExitThroughput Delay 7T 3741.20 ±0.33% 3778.06 ±0.20% 0.99%
MonitorEnterExitThroughput_AwareLock 1T 63445.04 ±0.20% 61540.28 ±0.23% -3.00%
MonitorEnterExitThroughput_ThinLock 1T 59720.83 ±0.20% 59754.62 ±0.12% 0.06%
MonitorReliableEnterExitLatency 2T 809.31 ±0.23% 809.58 ±0.41% 0.03%
MonitorReliableEnterExitLatency 4T 1569.47 ±0.45% 1577.43 ±0.71% 0.51%
MonitorReliableEnterExitLatency 7T 1681.65 ±0.25% 1678.01 ±0.20% -0.22%
MonitorReliableEnterExitThroughput Delay 1T 4956.40 ±0.41% 4957.46 ±0.24% 0.02%
MonitorReliableEnterExitThroughput Delay 2T 4794.52 ±0.18% 4756.23 ±0.25% -0.80%
MonitorReliableEnterExitThroughput Delay 4T 4560.22 ±0.25% 4522.03 ±0.35% -0.84%
MonitorReliableEnterExitThroughput Delay 7T 3902.19 ±0.55% 3875.81 ±0.13% -0.68%
MonitorReliableEnterExitThroughput_AwareLock 1T 61944.11 ±0.20% 58083.95 ±0.08% -6.23%
MonitorReliableEnterExitThroughput_ThinLock 1T 59632.31 ±0.25% 58972.48 ±0.07% -1.11%
MonitorTryEnterExitThroughput_AwareLock 1T 62345.13 ±0.14% 57159.99 ±0.14% -8.32%
MonitorTryEnterExitThroughput_ThinLock 1T 59725.76 ±0.15% 58050.35 ±0.16% -2.81%
------------------------------------------------ --------------- --------------- ---------
Total 6795.49 ±0.28% 6723.21 ±0.23% -1.06%
```
Windows x86
```
Spin Left score Right score ∆ Score %
------------------------------------------------ --------------- --------------- ---------
MonitorEnterExitLatency 2T 958.97 ±0.37% 987.28 ±0.32% 2.95%
MonitorEnterExitLatency 4T 1675.18 ±0.41% 1704.64 ±0.08% 1.76%
MonitorEnterExitLatency 7T 1825.49 ±0.09% 1769.50 ±0.12% -3.07%
MonitorEnterExitThroughput Delay 1T 5083.01 ±0.27% 5047.10 ±0.37% -0.71%
MonitorEnterExitThroughput Delay 2T 4854.54 ±0.13% 4825.31 ±0.14% -0.60%
MonitorEnterExitThroughput Delay 4T 4628.89 ±0.17% 4579.92 ±0.56% -1.06%
MonitorEnterExitThroughput Delay 7T 4125.52 ±0.48% 4096.78 ±0.20% -0.70%
MonitorEnterExitThroughput_AwareLock 1T 61841.28 ±0.13% 57429.31 ±0.44% -7.13%
MonitorEnterExitThroughput_ThinLock 1T 59746.69 ±0.19% 57971.43 ±0.10% -2.97%
MonitorReliableEnterExitLatency 2T 983.26 ±0.22% 998.25 ±0.33% 1.52%
MonitorReliableEnterExitLatency 4T 1758.10 ±0.14% 1723.63 ±0.19% -1.96%
MonitorReliableEnterExitLatency 7T 1832.24 ±0.08% 1776.61 ±0.10% -3.04%
MonitorReliableEnterExitThroughput Delay 1T 5023.19 ±0.05% 4980.49 ±0.08% -0.85%
MonitorReliableEnterExitThroughput Delay 2T 4846.04 ±0.03% 4792.58 ±0.11% -1.10%
MonitorReliableEnterExitThroughput Delay 4T 4608.14 ±0.09% 4574.90 ±0.06% -0.72%
MonitorReliableEnterExitThroughput Delay 7T 4123.20 ±0.10% 4075.92 ±0.11% -1.15%
MonitorReliableEnterExitThroughput_AwareLock 1T 57951.11 ±0.11% 57006.12 ±0.21% -1.63%
MonitorReliableEnterExitThroughput_ThinLock 1T 58006.06 ±0.18% 58018.28 ±0.07% 0.02%
MonitorTryEnterExitThroughput_AwareLock 1T 60701.63 ±0.04% 53374.77 ±0.15% -12.07%
MonitorTryEnterExitThroughput_ThinLock 1T 58169.82 ±0.05% 56023.58 ±0.69% -3.69%
------------------------------------------------ --------------- --------------- ---------
Total 7037.46 ±0.17% 6906.42 ±0.22% -1.86%
```
Linux x64
```
Spin repeater Left score Right score ∆ Score %
----------------------------------------------- --------------- --------------- ---------
MonitorEnterExitLatency 2T 3755.92 ±1.51% 3802.80 ±0.62% 1.25%
MonitorEnterExitLatency 4T 3448.14 ±1.69% 3493.84 ±1.58% 1.33%
MonitorEnterExitLatency 7T 2593.97 ±0.13% 2655.21 ±0.15% 2.36%
MonitorEnterExitThroughput Delay 1T 4854.52 ±0.12% 4873.08 ±0.11% 0.38%
MonitorEnterExitThroughput Delay 2T 4659.19 ±0.85% 4695.61 ±0.38% 0.78%
MonitorEnterExitThroughput Delay 4T 4163.01 ±1.46% 4190.94 ±1.37% 0.67%
MonitorEnterExitThroughput Delay 7T 3012.69 ±0.45% 3123.75 ±0.32% 3.69%
MonitorEnterExitThroughput_AwareLock 1T 56665.09 ±0.16% 58524.86 ±0.24% 3.28%
MonitorEnterExitThroughput_ThinLock 1T 57476.36 ±0.68% 57573.08 ±0.61% 0.17%
MonitorReliableEnterExitLatency 2T 3952.35 ±0.45% 3937.80 ±0.49% -0.37%
MonitorReliableEnterExitLatency 4T 3001.75 ±1.02% 3008.55 ±0.76% 0.23%
MonitorReliableEnterExitLatency 7T 2456.20 ±0.65% 2479.78 ±0.09% 0.96%
MonitorReliableEnterExitThroughput Delay 1T 4907.10 ±0.85% 4940.83 ±0.23% 0.69%
MonitorReliableEnterExitThroughput Delay 2T 4750.81 ±0.62% 4725.81 ±0.87% -0.53%
MonitorReliableEnterExitThroughput Delay 4T 4329.93 ±1.18% 4360.67 ±1.04% 0.71%
MonitorReliableEnterExitThroughput Delay 7T 3180.52 ±0.27% 3255.88 ±0.51% 2.37%
MonitorReliableEnterExitThroughput_AwareLock 1T 54559.89 ±0.09% 55785.74 ±0.20% 2.25%
MonitorReliableEnterExitThroughput_ThinLock 1T 55936.06 ±0.36% 55519.74 ±0.80% -0.74%
MonitorTryEnterExitThroughput_AwareLock 1T 52694.96 ±0.18% 54282.77 ±0.12% 3.01%
MonitorTryEnterExitThroughput_ThinLock 1T 54942.18 ±0.24% 55031.84 ±0.38% 0.16%
----------------------------------------------- --------------- --------------- ---------
Total 8326.45 ±0.65% 8420.07 ±0.54% 1.12%
```
|
My estimation is that these regressions are small and unlikely to materialize into real-world regressions. It would simplify and ease maintenance a bit to remove the asm, but since it looks like the register allocation issues would not be resolved easily, I'm not sure if we want to remove the asm code at this time. Thoughts? |
jkotas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these minor regressions are acceptable - within noise. It is also a step towards turning on FEATURE_IMPLICIT_TLS that will allow the GetThread call to be inlined and likely recover some of it.
|
|
|
Both of those take the same path ( |
|
Here are the numbers for TryEnter(obj, 0) when the monitor is locked. The regression is magnified since it stresses the hottest fast path without anything else: Windows x64 Windows x86 The x86 asm has a bug, it's missing the (timeOut == 0) check and spinning. Linux x64 |
|
Since it sounds like we're generally ok with this, I'm going to go ahead and merge |
|
I have looked at this a bit. I was able to repro the approximate 6-8% regression on a empty C# lock{} statement on windows without contention, which is the scenario which showed the regression in the data above. This is a bit concerning as some lock synchronization does match that scenario (very little done inside the lock and no contention). However I do see the rather large amount of complexity that is removed by doing this, as well as harmonization of Windows/Linux. I do think it is best to see if we can't get this back by tweeking the C++ code, however I have not gotten to the point of offering suggestions there. I am OK with this being checked in... |
|
A couple of other things we discussed offline:
I don't immediately see any other opportunities to tweak the C++ code to improve register usage, but I'll take a closer look later. Will also try to find out why it's not using some volatile registers. |
|
Note that the data that I collected shows the lock instructions themselves as taking a good portion of the extra cost. That does not make sense in a simple perf model of the routine, so there are subtle things going on. It will require some care to tease it apart. |
|
Could it be the I assume |
|
I didn't see much time spent around GetThread(). I'll profile it again and will try to see if any rearranging around the lock cmpxchg would help. |
|
Figured out why nonvolatile registers are being used, not sure why I didn't notice it before. There are calls and it has to save arguments in nonvolatile registers around the calls. Indeed it does have something to do with GetThread(), which is the last call that is not easily inlined. At the moment, force-inlining the spin loops is producing worse code, but once GetThread() can be inlined, I can try to reshuffle to code to inline spin loops and have code generated that more closely matches the asm. |
|
including @jkotas There is magic associated with GetThread. Basically it is implemented in assembly code (see amd64\JitHelpers_Fast.asm) to be a 'slow but correct' version and then written to at runtime (see MakeOptimizedTlsGetter) that is faster. This faster version requires that there are not too many thread local variables at the time the runtime is loaded, and if so it reduces it to a two instructions (but whose value is only known at runtime (the TLS slot it got). mov rax,qword ptr gs:[14A8h] The inlining is inhibited by the fact that it is assembly code (and that fact we modify it at runtime). There is not an obvious way to get around this. (if C++ has a way of declaring that certain methods don't trash nonvolatile registers, that would certainly do it) One possible way out of this (but I am not sure it is worth it), is to have a very small assembly thunk that calls GetThread and then calls the main routine. |
|
On Unix and Windows ARM64, we are using the regular C++ thread statics that get inlined by the C++ just fine. The code for this is under I have been working on enabling |
Closes https://github.com/dotnet/coreclr/issues/13979
Perf:
Numbers (no PGO):
Windows x64
Windows x86
Linux x64