Skip to content

Update SKIP_ALLOC_FRAME to match 3.1, 5.0 and 6.0 instruction sequences in function prolog#45184

Merged
echesakov merged 12 commits intodotnet:masterfrom
echesakov:Runtime_45090
Jan 5, 2021
Merged

Update SKIP_ALLOC_FRAME to match 3.1, 5.0 and 6.0 instruction sequences in function prolog#45184
echesakov merged 12 commits intodotnet:masterfrom
echesakov:Runtime_45090

Conversation

@echesakov
Copy link
Contributor

@echesakov echesakov commented Nov 25, 2020

First, let's confirm that the added test fails. It does

The added test is designed in a way that exposes an issue as in #45090.

In order to do this, AllocLocal methods must have esp-based frames. Most of the methods would require to call a stack probe helper, but some would use inlined stack probing sequence as it is demonstrated in an example below.

There are three cases that should be considered in SKIP_ALLOC_FRAME in order for the unwinder to work properly:

  1. frameSize <= 0xC00 - a function prolog contains sub esp, frameSize - no probing necessary
; Assembly listing for method Runtime_45090._00000800:AllocLocal(int):int
; Emitting BLENDED_CODE for Pentium 4 - Windows
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->  ecx
;  V01 loc0         [V01,T01] (  1,  1   )  struct (2052) [esp+0x00]   do-not-enreg[SF] ld-addr-op
;
; Lcl frame size = 2052

G_M11545_IG01:              ;; offset=0000H
       81EC04080000 sub      esp, 0x804
                                                ;; bbWeight=1    PerfScore 0.25
G_M11545_IG02:              ;; offset=0006H
       8BC1         mov      eax, ecx
       030424       add      eax, dword ptr [esp]
                                                ;; bbWeight=1    PerfScore 1.25
G_M11545_IG03:              ;; offset=000BH
       81C404080000 add      esp, 0x804
       C3           ret
                                                ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 18, prolog size 6, PerfScore 4.55, instruction count 5 (MethodHash=ccc1d2e6) for method Runtime_45090._00000800:AllocLocal(int):int
; ============================================================
  1. 0xC00 < frameSize < 0x1000 - a function prolog contains sub esp, frameSize followed by test eax, [esp] - the latter ensure that [esp+1024] always points to either already committed page or guard page.
; Assembly listing for method Runtime_45090._00000E00:AllocLocal(int):int
; Emitting BLENDED_CODE for Pentium 4 - Windows
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->  ecx
;  V01 loc0         [V01,T01] (  1,  1   )  struct (3588) [esp+0x00]   do-not-enreg[SF] ld-addr-op
;
; Lcl frame size = 3588

G_M17060_IG01:              ;; offset=0000H
       81EC040E0000 sub      esp, 0xE04
       850424       test     eax, dword ptr [esp]
                                                ;; bbWeight=1    PerfScore 2.25
G_M17060_IG02:              ;; offset=0009H
       8BC1         mov      eax, ecx
       030424       add      eax, dword ptr [esp]
                                                ;; bbWeight=1    PerfScore 1.25
G_M17060_IG03:              ;; offset=000EH
       81C4040E0000 add      esp, 0xE04
       C3           ret
                                                ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 21, prolog size 9, PerfScore 6.85, instruction count 6 (MethodHash=9d80bd5b) for method Runtime_45090._00000E00:AllocLocal(int):int
; ============================================================

frameSize > 0x1000 - a function prolog contains a call to stack probe helper

; Assembly listing for method Runtime_45090._00005000:AllocLocal(int):int
; Emitting BLENDED_CODE for Pentium 4 - Windows
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->  ecx
;  V01 loc0         [V01,T01] (  1,  1   )  struct (20484) [esp+0x00]   do-not-enreg[SF] ld-addr-op
;
; Lcl frame size = 20484

G_M55284_IG01:              ;; offset=0000H
       8D8424FCAFFFFF lea      eax, [esp-5004H]
       E8CADD6E5A   call     CORINFO_HELP_STACK_PROBE
       8BE0         mov      esp, eax
                                                ;; bbWeight=1    PerfScore 1.75
G_M55284_IG02:              ;; offset=000EH
       8BC1         mov      eax, ecx
       030424       add      eax, dword ptr [esp]
                                                ;; bbWeight=1    PerfScore 1.25
G_M55284_IG03:              ;; offset=0013H
       81C404500000 add      esp, 0x5004
       C3           ret
                                                ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 26, prolog size 14, PerfScore 6.85, instruction count 7 (MethodHash=3727280b) for method Runtime_45090._00005000:AllocLocal(int):int
; ============================================================

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@echesakov
Copy link
Contributor Author

@janvorli @jkotas Can you please take a look? What would you suggest for testing the change other than running GCStress test legs?

@echesakov echesakov marked this pull request as ready for review November 25, 2020 19:45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be hard-coded number instead of the OS page size to avoid problem with OS page size changing between AOT time and JIT time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. Thanks for catching this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if I take an approach as discussed in another PR - that this should match target page size for JIT-ed code and 0x1000 for R2R code, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime assumes that the unwinding algorithm is same for all code flavors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unwinding algorithm is part of R2R contract. This will require bumping the major R2R format version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never changed the R2R format version in the past. Is there any guideline I should follow to accomplish that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is about bumping the version number in https://github.com/dotnet/runtime/blob/master/src/coreclr/inc/readytorun.h and other places this file points to.

There are two ways how to evolve the R2R contract:

  • Backward compatible: It means keeping both old and new version of the unwinding algorithm, and choosing the right algortihm based on the R2R format version in the given image
  • Non-backward compatible (ie breaking): We will fallback to JIT for the older R2R images.
    We prefer the backward compatible evolution when we have a choice.

@dotnet/crossgen-contrib Do you have thoughts on whether we want to do R2R version break for 6.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the simplest thing to do for this one may be to do support both old and new unwind sequences. The unwinding inspects the instruction stream and so it should be possible to handle both without passing any explicit versions around.

You will still want to change these numbers to:

#define READYTORUN_MAJOR_VERSION 0x0005
#define READYTORUN_MINOR_VERSION 0x0001

but that will still be backward compatible change.

Copy link
Contributor Author

@echesakov echesakov Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for suggestion, Jan! I will rework this and add support for the old unwind sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still validating the changes but so far I haven't seen any issues with 3.1 and 5.0 R2R images.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to also include READYTORUN_MAJOR_VERSION bump that I have mentioned above.

Copy link
Contributor Author

@echesakov echesakov Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me to do the bump - I updated both readytorun.h and ModuleHeaders.cs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is special about this test that it cannot be written in C#?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably can. But it was simpler for me to express conditions for AllocLocal in order to have esp-based frame.

Copy link
Contributor Author

@echesakov echesakov Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the test in C# - turned out that it was enough to have a method as virtual to make inliner cooperate and not inline the method while keeping its frame esp based (I couldn't use NoInlining attribute since the method would have JIT_FLAG_FRAMED flag set and becomes ebp based).

@echesakov
Copy link
Contributor Author

The test seem to expose some limitation in Mono runtime

  Starting:    JIT.Regression.XUnitWrapper
    JIT/Regression/JitBlue/Runtime_45090/Runtime_45090/Runtime_45090.sh [FAIL]
      
      Unhandled Exception:
      System.InvalidProgramException: Unable to run method 'int Runtime_45090.Program:Main (string[])': locals size too big.
      [ERROR] FATAL UNHANDLED EXCEPTION: System.InvalidProgramException: Unable to run method 'int Runtime_45090.Program:Main (string[])': locals size too big.
      
      Return code:      1
      Raw output file:      /private/tmp/helix/working/A65B09FB/w/ADB70993/e/JIT/Regression/Reports/JIT.Regression/JitBlue/Runtime_45090/Runtime_45090/Runtime_45090.output.txt
      Raw output:
      BEGIN EXECUTION
      /tmp/helix/working/A65B09FB/p/corerun Runtime_45090.dll ''
      Expected: 100
      Actual: 1
      END EXECUTION - FAILED

@jkotas
Copy link
Member

jkotas commented Nov 25, 2020

What would you suggest for testing the change other than running GCStress test legs?

You can try to build a test that hits this crash even without GC stress.

@echesakov
Copy link
Contributor Author

What would you suggest for testing the change other than running GCStress test legs?

You can try to build a test that hits this crash even without GC stress.

@jkotas Do you have any suggestion how to approach this? I was thinking about spawning a large number of threads all running AllocLocal methods in a loop and have a dedicated thread that would trigger GC.Collect once in a while, so the unwinder would walk the stacks of stopped AllocLocal methods.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2020

Yes, this sounds like the right approach.

@ViktorHofer
Copy link
Member

// Auto-generated message

69e114c which was merged 12/7 removed the intermediate src/coreclr/src/ folder. This PR needs to be updated as it touches files in that directory which causes conflicts.

To update your commits you can use this bash script: https://gist.github.com/ViktorHofer/6d24f62abdcddb518b4966ead5ef3783. Feel free to use the comment section of the gist to improve the script for others.

@echesakov
Copy link
Contributor Author

Yes, this sounds like the right approach.

I tried this approach by running methods allocating a local frame in different threads while triggering GC at the same time.
Although such test does fail sometimes, I could not make it deterministic and fail always, so the current implementation of the test still uses GCStress.

I also extended the test cases and I believe I covered all the different frame allocation methods that JIT can use on win-x86:
PushReg

G_M21059_IG01:              ;; offset=0000H
       50           push     eax
       33C0         xor      eax, eax
       890424       mov      dword ptr [esp], eax

NoProbe

G_M27846_IG01:              ;; offset=0000H
       57           push     edi
       81EC00010000 sub      esp, 256

InlineProbe

G_M31598_IG01:              ;; offset=0000H
       57           push     edi
       81EC000E0000 sub      esp, 0xE00
       850424       test     eax, dword ptr [esp]

HelperProbe

G_M56481_IG01:              ;; offset=0000H
       57           push     edi
       8D842400F0FFFF lea      eax, [esp-1000H]
       E849E90673   call     CORINFO_HELP_STACK_PROBE
       8BE0         mov      esp, eax

@echesakov echesakov closed this Dec 11, 2020
@echesakov echesakov reopened this Dec 11, 2020
@JulieLeeMSFT
Copy link
Member

Waiting for #10184 fix for full validation.

@echesakov
Copy link
Contributor Author

Given that GCStress=8 doesn't currently work with R2R methods I tested the change manually.

Test methodology is to call one of the implementations of VirtMethodEspBasedFrame method in a thread and trigger GC from another thread. Then wait for debugger to stop at SKIP_ALLOC_FRAME and check that the final base+offset computed by that function points at an instruction after the stack allocation/probing instructions in a prolog. Repeat this for all implementations of the method and for R2R code produced by 3.1, 5.0 and master - total 24 scenarios.

@echesakov
Copy link
Contributor Author

Triggered runtime-coreclr gcstress0x3-gcstress0xc pipeline to validate the change and that it fixes #45090

@echesakov
Copy link
Contributor Author

echesakov commented Dec 18, 2020

gcstress windows x86 is clean - suggesting that the change fixes the original issue.

There are failures in gcstress windows x64 and gcstress linux arm pipelines - I triggered another gcstress on top of master.
Those were probably already fixed and appear in the first run since I triggered that first gcstress run manually in AzDO and hence didn't do rebasing on top of master.

@echesakov echesakov changed the title Update SKIP_ALLOC_FRAME to match new intrstruction sequence in function prolog Update SKIP_ALLOC_FRAME to match 3.1, 5.0 and 6.0 instruction sequences in function prolog Dec 18, 2020
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is JIT_FLAG_PROF_REJIT_NOPS still going to work with this change?

JIT_FLAG_PROF_REJIT_NOPS is not used currently. It may be best to delete it instead of keeping partially broken implementation around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT_FLAG_PROF_REJIT_NOPS is not used currently. It may be best to delete it instead of keeping partially broken implementation around.

@jkotas Would it be fine with you if I undo the change to the jit file in this PR and come back with another PR removing JIT_FLAG_PROF_REJIT_NOPS flag and all the related infrastructure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #46560 to address this

@echesakov
Copy link
Contributor Author

Rebased on top of latest master and reverted the change src/coreclr/jit/codegenxarch.cpp.

@echesakov echesakov merged commit 8b8a3fc into dotnet:master Jan 5, 2021
@echesakov echesakov deleted the Runtime_45090 branch January 5, 2021 01:07
@ghost ghost locked as resolved and limited conversation to collaborators Feb 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test failure: CheckInstrBytePattern(base[offset ] & 0xFD, 0x81, base[offset]) && CheckInstrBytePattern(base[offset+1] & 0xC0, 0xC0, base[offset+1])

5 participants