Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@kouvel
Copy link

@kouvel kouvel commented Sep 23, 2017

Closes https://github.com/dotnet/coreclr/issues/13979

  • Removed asm helpers on Windows and used portable C++ helpers instead
  • Rearranged fast path code to improve them a bit and match the asm more closely

Perf:

  • The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
    • There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
    • Register usage
      • x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
      • x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
      • I believe this is the main remaining gap and PGO is not helping with this
  • On Linux, perf is >= before for the most part
  • Perf tests used for below are updated in PR Add normalized equivalent of YieldProcessor, retune some spin loops #13670

Numbers (no PGO):

Windows x64

Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          800.56 ±0.33%    821.97 ±0.30%      2.67%
MonitorEnterExitLatency 4T                         1533.25 ±0.34%   1553.82 ±0.13%      1.34%
MonitorEnterExitLatency 7T                         1676.14 ±0.26%   1678.14 ±0.18%      0.12%
MonitorEnterExitThroughput Delay 1T                5174.77 ±0.25%   5125.56 ±0.27%     -0.95%
MonitorEnterExitThroughput Delay 2T                4982.38 ±0.22%   4937.79 ±0.19%     -0.90%
MonitorEnterExitThroughput Delay 4T                4720.41 ±0.37%   4694.09 ±0.24%     -0.56%
MonitorEnterExitThroughput Delay 7T                3741.20 ±0.33%   3778.06 ±0.20%      0.99%
MonitorEnterExitThroughput_AwareLock 1T           63445.04 ±0.20%  61540.28 ±0.23%     -3.00%
MonitorEnterExitThroughput_ThinLock 1T            59720.83 ±0.20%  59754.62 ±0.12%      0.06%
MonitorReliableEnterExitLatency 2T                  809.31 ±0.23%    809.58 ±0.41%      0.03%
MonitorReliableEnterExitLatency 4T                 1569.47 ±0.45%   1577.43 ±0.71%      0.51%
MonitorReliableEnterExitLatency 7T                 1681.65 ±0.25%   1678.01 ±0.20%     -0.22%
MonitorReliableEnterExitThroughput Delay 1T        4956.40 ±0.41%   4957.46 ±0.24%      0.02%
MonitorReliableEnterExitThroughput Delay 2T        4794.52 ±0.18%   4756.23 ±0.25%     -0.80%
MonitorReliableEnterExitThroughput Delay 4T        4560.22 ±0.25%   4522.03 ±0.35%     -0.84%
MonitorReliableEnterExitThroughput Delay 7T        3902.19 ±0.55%   3875.81 ±0.13%     -0.68%
MonitorReliableEnterExitThroughput_AwareLock 1T   61944.11 ±0.20%  58083.95 ±0.08%     -6.23%
MonitorReliableEnterExitThroughput_ThinLock 1T    59632.31 ±0.25%  58972.48 ±0.07%     -1.11%
MonitorTryEnterExitThroughput_AwareLock 1T        62345.13 ±0.14%  57159.99 ±0.14%     -8.32%
MonitorTryEnterExitThroughput_ThinLock 1T         59725.76 ±0.15%  58050.35 ±0.16%     -2.81%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              6795.49 ±0.28%   6723.21 ±0.23%     -1.06%

Windows x86

Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          958.97 ±0.37%    987.28 ±0.32%      2.95%
MonitorEnterExitLatency 4T                         1675.18 ±0.41%   1704.64 ±0.08%      1.76%
MonitorEnterExitLatency 7T                         1825.49 ±0.09%   1769.50 ±0.12%     -3.07%
MonitorEnterExitThroughput Delay 1T                5083.01 ±0.27%   5047.10 ±0.37%     -0.71%
MonitorEnterExitThroughput Delay 2T                4854.54 ±0.13%   4825.31 ±0.14%     -0.60%
MonitorEnterExitThroughput Delay 4T                4628.89 ±0.17%   4579.92 ±0.56%     -1.06%
MonitorEnterExitThroughput Delay 7T                4125.52 ±0.48%   4096.78 ±0.20%     -0.70%
MonitorEnterExitThroughput_AwareLock 1T           61841.28 ±0.13%  57429.31 ±0.44%     -7.13%
MonitorEnterExitThroughput_ThinLock 1T            59746.69 ±0.19%  57971.43 ±0.10%     -2.97%
MonitorReliableEnterExitLatency 2T                  983.26 ±0.22%    998.25 ±0.33%      1.52%
MonitorReliableEnterExitLatency 4T                 1758.10 ±0.14%   1723.63 ±0.19%     -1.96%
MonitorReliableEnterExitLatency 7T                 1832.24 ±0.08%   1776.61 ±0.10%     -3.04%
MonitorReliableEnterExitThroughput Delay 1T        5023.19 ±0.05%   4980.49 ±0.08%     -0.85%
MonitorReliableEnterExitThroughput Delay 2T        4846.04 ±0.03%   4792.58 ±0.11%     -1.10%
MonitorReliableEnterExitThroughput Delay 4T        4608.14 ±0.09%   4574.90 ±0.06%     -0.72%
MonitorReliableEnterExitThroughput Delay 7T        4123.20 ±0.10%   4075.92 ±0.11%     -1.15%
MonitorReliableEnterExitThroughput_AwareLock 1T   57951.11 ±0.11%  57006.12 ±0.21%     -1.63%
MonitorReliableEnterExitThroughput_ThinLock 1T    58006.06 ±0.18%  58018.28 ±0.07%      0.02%
MonitorTryEnterExitThroughput_AwareLock 1T        60701.63 ±0.04%  53374.77 ±0.15%    -12.07%
MonitorTryEnterExitThroughput_ThinLock 1T         58169.82 ±0.05%  56023.58 ±0.69%     -3.69%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              7037.46 ±0.17%   6906.42 ±0.22%     -1.86%

Linux x64

Spin repeater                                    Left score       Right score      ∆ Score %
-----------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                        3755.92 ±1.51%   3802.80 ±0.62%      1.25%
MonitorEnterExitLatency 4T                        3448.14 ±1.69%   3493.84 ±1.58%      1.33%
MonitorEnterExitLatency 7T                        2593.97 ±0.13%   2655.21 ±0.15%      2.36%
MonitorEnterExitThroughput Delay 1T               4854.52 ±0.12%   4873.08 ±0.11%      0.38%
MonitorEnterExitThroughput Delay 2T               4659.19 ±0.85%   4695.61 ±0.38%      0.78%
MonitorEnterExitThroughput Delay 4T               4163.01 ±1.46%   4190.94 ±1.37%      0.67%
MonitorEnterExitThroughput Delay 7T               3012.69 ±0.45%   3123.75 ±0.32%      3.69%
MonitorEnterExitThroughput_AwareLock 1T          56665.09 ±0.16%  58524.86 ±0.24%      3.28%
MonitorEnterExitThroughput_ThinLock 1T           57476.36 ±0.68%  57573.08 ±0.61%      0.17%
MonitorReliableEnterExitLatency 2T                3952.35 ±0.45%   3937.80 ±0.49%     -0.37%
MonitorReliableEnterExitLatency 4T                3001.75 ±1.02%   3008.55 ±0.76%      0.23%
MonitorReliableEnterExitLatency 7T                2456.20 ±0.65%   2479.78 ±0.09%      0.96%
MonitorReliableEnterExitThroughput Delay 1T       4907.10 ±0.85%   4940.83 ±0.23%      0.69%
MonitorReliableEnterExitThroughput Delay 2T       4750.81 ±0.62%   4725.81 ±0.87%     -0.53%
MonitorReliableEnterExitThroughput Delay 4T       4329.93 ±1.18%   4360.67 ±1.04%      0.71%
MonitorReliableEnterExitThroughput Delay 7T       3180.52 ±0.27%   3255.88 ±0.51%      2.37%
MonitorReliableEnterExitThroughput_AwareLock 1T  54559.89 ±0.09%  55785.74 ±0.20%      2.25%
MonitorReliableEnterExitThroughput_ThinLock 1T   55936.06 ±0.36%  55519.74 ±0.80%     -0.74%
MonitorTryEnterExitThroughput_AwareLock 1T       52694.96 ±0.18%  54282.77 ±0.12%      3.01%
MonitorTryEnterExitThroughput_ThinLock 1T        54942.18 ±0.24%  55031.84 ±0.38%      0.16%
-----------------------------------------------  ---------------  ---------------  ---------
Total                                             8326.45 ±0.65%   8420.07 ±0.54%      1.12%

- Removed asm helpers on Windows and used portable C++ helpers instead
- Rearranged fast path code to improve them a bit and match the asm more closely

Perf:
- The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are:
  - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there.
  - Register usage
    - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used.
    - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths.
    - I believe this is the main remaining gap and PGO is not helping with this
- On Linux, perf is >= before for the most part
- Perf tests used for below are updated in PR #13670

Numbers (no PGO):

Windows x64

```
Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          800.56 ±0.33%    821.97 ±0.30%      2.67%
MonitorEnterExitLatency 4T                         1533.25 ±0.34%   1553.82 ±0.13%      1.34%
MonitorEnterExitLatency 7T                         1676.14 ±0.26%   1678.14 ±0.18%      0.12%
MonitorEnterExitThroughput Delay 1T                5174.77 ±0.25%   5125.56 ±0.27%     -0.95%
MonitorEnterExitThroughput Delay 2T                4982.38 ±0.22%   4937.79 ±0.19%     -0.90%
MonitorEnterExitThroughput Delay 4T                4720.41 ±0.37%   4694.09 ±0.24%     -0.56%
MonitorEnterExitThroughput Delay 7T                3741.20 ±0.33%   3778.06 ±0.20%      0.99%
MonitorEnterExitThroughput_AwareLock 1T           63445.04 ±0.20%  61540.28 ±0.23%     -3.00%
MonitorEnterExitThroughput_ThinLock 1T            59720.83 ±0.20%  59754.62 ±0.12%      0.06%
MonitorReliableEnterExitLatency 2T                  809.31 ±0.23%    809.58 ±0.41%      0.03%
MonitorReliableEnterExitLatency 4T                 1569.47 ±0.45%   1577.43 ±0.71%      0.51%
MonitorReliableEnterExitLatency 7T                 1681.65 ±0.25%   1678.01 ±0.20%     -0.22%
MonitorReliableEnterExitThroughput Delay 1T        4956.40 ±0.41%   4957.46 ±0.24%      0.02%
MonitorReliableEnterExitThroughput Delay 2T        4794.52 ±0.18%   4756.23 ±0.25%     -0.80%
MonitorReliableEnterExitThroughput Delay 4T        4560.22 ±0.25%   4522.03 ±0.35%     -0.84%
MonitorReliableEnterExitThroughput Delay 7T        3902.19 ±0.55%   3875.81 ±0.13%     -0.68%
MonitorReliableEnterExitThroughput_AwareLock 1T   61944.11 ±0.20%  58083.95 ±0.08%     -6.23%
MonitorReliableEnterExitThroughput_ThinLock 1T    59632.31 ±0.25%  58972.48 ±0.07%     -1.11%
MonitorTryEnterExitThroughput_AwareLock 1T        62345.13 ±0.14%  57159.99 ±0.14%     -8.32%
MonitorTryEnterExitThroughput_ThinLock 1T         59725.76 ±0.15%  58050.35 ±0.16%     -2.81%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              6795.49 ±0.28%   6723.21 ±0.23%     -1.06%
```

Windows x86

```
Spin                                              Left score       Right score      ∆ Score %
------------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                          958.97 ±0.37%    987.28 ±0.32%      2.95%
MonitorEnterExitLatency 4T                         1675.18 ±0.41%   1704.64 ±0.08%      1.76%
MonitorEnterExitLatency 7T                         1825.49 ±0.09%   1769.50 ±0.12%     -3.07%
MonitorEnterExitThroughput Delay 1T                5083.01 ±0.27%   5047.10 ±0.37%     -0.71%
MonitorEnterExitThroughput Delay 2T                4854.54 ±0.13%   4825.31 ±0.14%     -0.60%
MonitorEnterExitThroughput Delay 4T                4628.89 ±0.17%   4579.92 ±0.56%     -1.06%
MonitorEnterExitThroughput Delay 7T                4125.52 ±0.48%   4096.78 ±0.20%     -0.70%
MonitorEnterExitThroughput_AwareLock 1T           61841.28 ±0.13%  57429.31 ±0.44%     -7.13%
MonitorEnterExitThroughput_ThinLock 1T            59746.69 ±0.19%  57971.43 ±0.10%     -2.97%
MonitorReliableEnterExitLatency 2T                  983.26 ±0.22%    998.25 ±0.33%      1.52%
MonitorReliableEnterExitLatency 4T                 1758.10 ±0.14%   1723.63 ±0.19%     -1.96%
MonitorReliableEnterExitLatency 7T                 1832.24 ±0.08%   1776.61 ±0.10%     -3.04%
MonitorReliableEnterExitThroughput Delay 1T        5023.19 ±0.05%   4980.49 ±0.08%     -0.85%
MonitorReliableEnterExitThroughput Delay 2T        4846.04 ±0.03%   4792.58 ±0.11%     -1.10%
MonitorReliableEnterExitThroughput Delay 4T        4608.14 ±0.09%   4574.90 ±0.06%     -0.72%
MonitorReliableEnterExitThroughput Delay 7T        4123.20 ±0.10%   4075.92 ±0.11%     -1.15%
MonitorReliableEnterExitThroughput_AwareLock 1T   57951.11 ±0.11%  57006.12 ±0.21%     -1.63%
MonitorReliableEnterExitThroughput_ThinLock 1T    58006.06 ±0.18%  58018.28 ±0.07%      0.02%
MonitorTryEnterExitThroughput_AwareLock 1T        60701.63 ±0.04%  53374.77 ±0.15%    -12.07%
MonitorTryEnterExitThroughput_ThinLock 1T         58169.82 ±0.05%  56023.58 ±0.69%     -3.69%
------------------------------------------------  ---------------  ---------------  ---------
Total                                              7037.46 ±0.17%   6906.42 ±0.22%     -1.86%
```

Linux x64

```
Spin repeater                                    Left score       Right score      ∆ Score %
-----------------------------------------------  ---------------  ---------------  ---------
MonitorEnterExitLatency 2T                        3755.92 ±1.51%   3802.80 ±0.62%      1.25%
MonitorEnterExitLatency 4T                        3448.14 ±1.69%   3493.84 ±1.58%      1.33%
MonitorEnterExitLatency 7T                        2593.97 ±0.13%   2655.21 ±0.15%      2.36%
MonitorEnterExitThroughput Delay 1T               4854.52 ±0.12%   4873.08 ±0.11%      0.38%
MonitorEnterExitThroughput Delay 2T               4659.19 ±0.85%   4695.61 ±0.38%      0.78%
MonitorEnterExitThroughput Delay 4T               4163.01 ±1.46%   4190.94 ±1.37%      0.67%
MonitorEnterExitThroughput Delay 7T               3012.69 ±0.45%   3123.75 ±0.32%      3.69%
MonitorEnterExitThroughput_AwareLock 1T          56665.09 ±0.16%  58524.86 ±0.24%      3.28%
MonitorEnterExitThroughput_ThinLock 1T           57476.36 ±0.68%  57573.08 ±0.61%      0.17%
MonitorReliableEnterExitLatency 2T                3952.35 ±0.45%   3937.80 ±0.49%     -0.37%
MonitorReliableEnterExitLatency 4T                3001.75 ±1.02%   3008.55 ±0.76%      0.23%
MonitorReliableEnterExitLatency 7T                2456.20 ±0.65%   2479.78 ±0.09%      0.96%
MonitorReliableEnterExitThroughput Delay 1T       4907.10 ±0.85%   4940.83 ±0.23%      0.69%
MonitorReliableEnterExitThroughput Delay 2T       4750.81 ±0.62%   4725.81 ±0.87%     -0.53%
MonitorReliableEnterExitThroughput Delay 4T       4329.93 ±1.18%   4360.67 ±1.04%      0.71%
MonitorReliableEnterExitThroughput Delay 7T       3180.52 ±0.27%   3255.88 ±0.51%      2.37%
MonitorReliableEnterExitThroughput_AwareLock 1T  54559.89 ±0.09%  55785.74 ±0.20%      2.25%
MonitorReliableEnterExitThroughput_ThinLock 1T   55936.06 ±0.36%  55519.74 ±0.80%     -0.74%
MonitorTryEnterExitThroughput_AwareLock 1T       52694.96 ±0.18%  54282.77 ±0.12%      3.01%
MonitorTryEnterExitThroughput_ThinLock 1T        54942.18 ±0.24%  55031.84 ±0.38%      0.16%
-----------------------------------------------  ---------------  ---------------  ---------
Total                                             8326.45 ±0.65%   8420.07 ±0.54%      1.12%
```
@kouvel kouvel added this to the 2.1.0 milestone Sep 23, 2017
@kouvel kouvel self-assigned this Sep 23, 2017
@kouvel
Copy link
Author

kouvel commented Sep 23, 2017

My estimation is that these regressions are small and unlikely to materialize into real-world regressions. It would simplify and ease maintenance a bit to remove the asm, but since it looks like the register allocation issues would not be resolved easily, I'm not sure if we want to remove the asm code at this time. Thoughts?

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these minor regressions are acceptable - within noise. It is also a step towards turning on FEATURE_IMPLICIT_TLS that will allow the GetThread call to be inlined and likely recover some of it.

@benaadams
Copy link
Member

Monitor.TryEnter(obj, 0) or Monitor.TryEnter(obj, 0, ref locktaken) on a held lock are ones to pay attention to also - though likely similar behaviour

@kouvel
Copy link
Author

kouvel commented Sep 23, 2017

Both of those take the same path (JIT_MonTryEnter_Portable), I will add tests for them

@kouvel
Copy link
Author

kouvel commented Sep 25, 2017

Here are the numbers for TryEnter(obj, 0) when the monitor is locked. The regression is magnified since it stresses the hottest fast path without anything else:

Windows x64

Spin                                                    Left score        Right score       ∆ Score %
------------------------------------------------------  ----------------  ----------------  ---------
MonitorTryEnterExitWhenUnlockedThroughput_AwareLock 1T   62478.43 ±0.12%   57283.48 ±0.05%     -8.31%
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock 1T    59957.64 ±0.05%   58220.28 ±0.05%     -2.90%
MonitorTryEnterWhenLockedThroughput_AwareLock 1T        261757.49 ±0.07%  263728.85 ±0.05%      0.75%
MonitorTryEnterWhenLockedThroughput_ThinLock 1T         358791.08 ±0.10%  282192.04 ±0.08%    -21.35%
------------------------------------------------------  ----------------  ----------------  ---------

Windows x86

Spin                                                    Left score       Right score       ∆ Score %    
------------------------------------------------------  ---------------  ----------------  -------------
MonitorTryEnterExitWhenUnlockedThroughput_AwareLock 1T  60439.02 ±0.06%   58743.22 ±0.05%         -2.81%
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock 1T   58054.67 ±0.04%   54146.54 ±0.04%         -6.73%
MonitorTryEnterWhenLockedThroughput_AwareLock 1T            0.17 ±0.11%  263171.32 ±0.06%  154820492.22%
MonitorTryEnterWhenLockedThroughput_ThinLock 1T             0.17 ±0.06%  282132.46 ±0.04%  165506080.89%
------------------------------------------------------  ---------------  ----------------  -------------

The x86 asm has a bug, it's missing the (timeOut == 0) check and spinning.

Linux x64

Spin                                                    Left score        Right score       ∆ Score %
------------------------------------------------------  ----------------  ----------------  ---------
MonitorTryEnterExitWhenUnlockedThroughput_AwareLock 1T   52888.78 ±0.09%   53626.11 ±0.16%      1.39%
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock 1T    55162.14 ±0.11%   55233.58 ±0.08%      0.13%
MonitorTryEnterWhenLockedThroughput_AwareLock 1T        213186.82 ±0.08%  217280.85 ±0.06%      1.92%
MonitorTryEnterWhenLockedThroughput_ThinLock 1T         230271.52 ±0.10%  230525.31 ±0.12%      0.11%
------------------------------------------------------  ----------------  ----------------  ---------

@kouvel
Copy link
Author

kouvel commented Sep 26, 2017

Since it sounds like we're generally ok with this, I'm going to go ahead and merge

@vancem
Copy link

vancem commented Sep 26, 2017

I have looked at this a bit. I was able to repro the approximate 6-8% regression on a empty C# lock{} statement on windows without contention, which is the scenario which showed the regression in the data above. This is a bit concerning as some lock synchronization does match that scenario (very little done inside the lock and no contention).

However I do see the rather large amount of complexity that is removed by doing this, as well as harmonization of Windows/Linux. I do think it is best to see if we can't get this back by tweeking the C++ code, however I have not gotten to the point of offering suggestions there.

I am OK with this being checked in...

@kouvel
Copy link
Author

kouvel commented Sep 26, 2017

A couple of other things we discussed offline:

  • C++ helpers are not using tail calls. Seems nontrivial to fix, may help to avoid one register save/restore, but not sure.
  • If we can avoid the "lockTaken" parameter in MonReliableEnter and MonTryEnter, it would improve slightly (equivalent to calling MonEnter, which has one less register save/restore). It looks like this could be because lock(m) { ... } is translated to:
    try { Monitor.Enter(m, ref lockTaken); ... }
    finally { if(lockTaken) Monitor.Exit(m); }
    If it could be written as this instead, it could save a register:
    Monitor.Enter(m);
    try { ... }
    finally { Monitor.Exit(m); }
    I'll file an issue.

I don't immediately see any other opportunities to tweak the C++ code to improve register usage, but I'll take a closer look later. Will also try to find out why it's not using some volatile registers.

@kouvel kouvel merged commit 8f0ac5d into dotnet:master Sep 26, 2017
@kouvel kouvel deleted the RemoveMonitorAsm branch September 26, 2017 20:14
@vancem
Copy link

vancem commented Sep 26, 2017

Note that the data that I collected shows the lock instructions themselves as taking a good portion of the extra cost. That does not make sense in a simple perf model of the routine, so there are subtle things going on. It will require some care to tease it apart.

@benaadams
Copy link
Member

Could it be the pCurThread = GetThread(); that's used before acquiring the lock in the C++ @jkotas was referring to #14146 (review)?

I assume DWORD tid = pCurThread->GetThreadId(); would be lightweight?

@kouvel
Copy link
Author

kouvel commented Sep 26, 2017

I didn't see much time spent around GetThread(). I'll profile it again and will try to see if any rearranging around the lock cmpxchg would help.

@kouvel
Copy link
Author

kouvel commented Oct 6, 2017

Figured out why nonvolatile registers are being used, not sure why I didn't notice it before. There are calls and it has to save arguments in nonvolatile registers around the calls. Indeed it does have something to do with GetThread(), which is the last call that is not easily inlined. At the moment, force-inlining the spin loops is producing worse code, but once GetThread() can be inlined, I can try to reshuffle to code to inline spin loops and have code generated that more closely matches the asm.

@vancem
Copy link

vancem commented Oct 6, 2017

including @jkotas

There is magic associated with GetThread. Basically it is implemented in assembly code (see amd64\JitHelpers_Fast.asm) to be a 'slow but correct' version and then written to at runtime (see MakeOptimizedTlsGetter) that is faster. This faster version requires that there are not too many thread local variables at the time the runtime is loaded, and if so it reduces it to a two instructions (but whose value is only known at runtime (the TLS slot it got).

mov rax,qword ptr gs:[14A8h]
ret

The inlining is inhibited by the fact that it is assembly code (and that fact we modify it at runtime).

There is not an obvious way to get around this. (if C++ has a way of declaring that certain methods don't trash nonvolatile registers, that would certainly do it)

One possible way out of this (but I am not sure it is worth it), is to have a very small assembly thunk that calls GetThread and then calls the main routine.

@jkotas
Copy link
Member

jkotas commented Oct 6, 2017

On Unix and Windows ARM64, we are using the regular C++ thread statics that get inlined by the C++ just fine. The code for this is under FEATURE_IMPLICIT_TLS. The regular C++ statics have an extra indirection on Windows, but there is no magic required to access them and C++ compiler understands how to inline the access and optimized around it so it is a general win.

I have been working on enabling FEATURE_IMPLICIT_TLS for the rest of Windows platforms. It would fix this problem, and some other ones too - e.g. managed thread statics would be faster on Windows. There are a lot of places where the current GetThread magic needs to be undone or redone ... https://github.com/jkotas/coreclr/tree/tls

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove asm versions of Monitor helpers and use portable C++ versions instead

5 participants