Update CreateAndExperiment.md #32

jkotas · 2020-07-25T16:53:44Z

No description provided.

jkotas · 2020-07-25T16:56:07Z

CreateAnExperiment.md

+## Steps to setup a new experiment
+
+- Pick a good name for your experiment and create branch for it in dotnet/runtimelab.
+   - If the experiment is expected to require changes of .NET runtime itself, it should be branched off of [dotnet/runtimelab:runtime-master](https://github.com/dotnet/runtimelab/tree/runtime-master) that is manually maitained mirror of [dotnet/runtime:master](https://github.com/dotnet/runtime/tree/master) branch.


These will need to be updated for the master->main rename once it happens.

AaronRobinsonMSFT · 2020-07-25T16:58:48Z

CreateAnExperiment.md

+## Steps to setup a new experiment
+
+- Pick a good name for your experiment and create branch for it in dotnet/runtimelab.
+   - If the experiment is expected to require changes of .NET runtime itself, it should be branched off of [dotnet/runtimelab:runtime-master](https://github.com/dotnet/runtimelab/tree/runtime-master) that is manually maitained mirror of [dotnet/runtime:master](https://github.com/dotnet/runtime/tree/master) branch.


~~is manually~~ -> "is a manually"

AaronRobinsonMSFT · 2020-07-25T16:59:54Z

CreateAnExperiment.md

+   - Otherwise, the experiment should be branched off of [dotnet/runtimelab:master](https://github.com/dotnet/runtimelab/tree/master) to get the required boilerplate such as LICENSE.TXT.
+- Submit a PR to update the [README.MD](https://github.com/dotnet/runtimelab/blob/master/README.md#active-experimental-projects) with the name of your branch and a brief description of the experiment. (Example: [#19](https://github.com/dotnet/runtimelab/pull/19/files))
+- Create label `area-<your experiment name>` for tagging issues. The label should use color `#d4c5f9`. 
+- If you experiment is branched from dotnet/runtime:


~~If you experiment~~ -> "If your experiment"

Commit migrated from dotnet/runtimelab@4072ab9

…(#63598) * Fix native frame unwind in syscall on arm64 for VS4Mac crash report. Add arm64 version of StepWithCompactNoEncoding for syscall leaf node wrappers that have compact encoding of 0. Fix ReadCompactEncodingRegister so it actually decrements the addr. Change StepWithCompactEncodingArm64 to match what MacOS libunwind does for framed and frameless stepping. arm64 can have frames with the same SP (but different IPs). Increment SP for this condition so createdump's unwind loop doesn't break out on the "SP not increasing" check and the frames are added to the thread frame list in the correct order. Add getting the unwind info for tail called functions like this: __ZL14PROCEndProcessPvji: 36630: f6 57 bd a9 stp x22, x21, [sp, #-48]! 36634: f4 4f 01 a9 stp x20, x19, [sp, #16] 36638: fd 7b 02 a9 stp x29, x30, [sp, #32] 3663c: fd 83 00 91 add x29, sp, #32 ... 367ac: e9 01 80 52 mov w9, #15 367b0: 7f 3e 02 71 cmp w19, #143 367b4: 20 01 88 1a csel w0, w9, w8, eq 367b8: 2e 00 00 94 bl _PROCAbort _TerminateProcess: -> 367bc: 22 00 80 52 mov w2, #1 367c0: 9c ff ff 17 b __ZL14PROCEndProcessPvji The IP (367bc) returns the (incorrect) frameless encoding with nothing on the stack (uses an incorrect LR to unwind). To fix this get the unwind info for PC -1 which points to PROCEndProcess with the correct unwind info. This matches how lldb unwinds this frame. Always address module segment to IP lookup list instead of checking the module regions. Strip pointer authentication bits on PC/LR.

# Local heap optimizations on Arm64 1. When not required to zero the allocated space for local heap (for sizes up to 64 bytes) - do not emit zeroing sequence. Instead do stack probing and adjust stack pointer: ```diff - stp xzr, xzr, [sp,#-16]! - stp xzr, xzr, [sp,#-16]! - stp xzr, xzr, [sp,#-16]! - stp xzr, xzr, [sp,#-16]! + ldr wzr, [sp],#-64 ``` 2. For sizes less than one `PAGE_SIZE` use `ldr wzr, [sp], #-amount` that does probing at `[sp]` and allocates the space at the same time. This saves one instruction for such local heap allocations: ```diff - ldr wzr, [sp] - sub sp, sp, #208 + ldr wzr, [sp],#-208 ``` Use `ldp tmpReg, xzr, [sp], #-amount` when the offset not encodable by post-index variant of `ldr`: ```diff - ldr wzr, [sp] - sub sp, sp, #512 + ldp x0, xzr, [sp],#-512 ``` 3. Allow non-loop zeroing (i.e. unrolled sequence) for sizes up to 128 bytes (i.e. up to `LCLHEAP_UNROLL_LIMIT`). This frees up two internal integer registers for such cases: ```diff - mov w11, #128 - ;; bbWeight=0.50 PerfScore 0.25 -G_M44913_IG19: ; gcrefRegs=00F9 {x0 x3 x4 x5 x6 x7}, byrefRegs=0000 {}, byref, isz stp xzr, xzr, [sp,#-16]! - subs x11, x11, #16 - bne G_M44913_IG19 + stp xzr, xzr, [sp,#-112]! + stp xzr, xzr, [sp,#16] + stp xzr, xzr, [sp,#32] + stp xzr, xzr, [sp,#48] + stp xzr, xzr, [sp,#64] + stp xzr, xzr, [sp,#80] + stp xzr, xzr, [sp,#96] ``` 4. Do zeroing in ascending order of the effective address: ```diff - mov w7, #96 -G_M49279_IG13: stp xzr, xzr, [sp,#-16]! - subs x7, x7, #16 - bne G_M49279_IG13 + stp xzr, xzr, [sp,#-80]! + stp xzr, xzr, [sp,#16] + stp xzr, xzr, [sp,#32] + stp xzr, xzr, [sp,#48] + stp xzr, xzr, [sp,#64] ``` In the example, the zeroing is done at `[initialSp-16], [initialSp-96], [initialSp-80], [initialSp-64], [initialSp-48], [initialSp-32]` addresses. The idea here is to allow a CPU to detect the sequential `memset` to `0` pattern and switch into write streaming mode.

* Support Arm64 "constructed" constants in SuperPMI asm diffs SuperPMI asm diffs tries to ignore constants that can change between multiple replays, such as addresses that the replay engine must generate and not simply hand back from the collected data. Often, addresses have associated relocations generated during replay. SuperPMI can use these relocations to adjust the constants to allow two replays to match. However, there are cases on Arm64 where an address both doesn't report a relocation and is "constructed" using multiple `mov`/`movk` instructions. One case is the `allocPgoInstrumentationBySchema()` API which returns a pointer to a PGO data buffer. An address within this buffer is constructed via a sequence such as: ``` mov x0, #63408 movk x0, #23602, lsl dotnet#16 movk x0, dotnet#606, lsl dotnet#32 ``` When SuperPMI replays this API, it constructs a new buffer and returns that pointer, which is used to construct various actual addresses that are generated as "constructed" constants, shown above. This change "de-constructs" the constants and looks them up in the replay address map. If base and diff match the mapped constants, there is no asm diff. * Fix 32-bit build I don't think we fully support 64-bit replay on 32-bit host, but this fix at least makes it possible for this case. * Support more general mov/movk sequence Allow JIT1 and JIT2 to have a different sequence of mov/movk[/movk[/movk]] that map to the same address in the address map. That is, the replay constant might require a different set of instructions (e.g., if a `movk` is missing because its constant is zero).

Based on the new `FIELD_LIST` support for returns this PR adds support for the JIT to combine smaller fields via bitwise operations when returned, instead of spilling these to stack. win-x64 examples: ```csharp static int? Test() { return Environment.TickCount; } ``` ```diff call System.Environment:get_TickCount():int - mov dword ptr [rsp+0x24], eax - mov byte ptr [rsp+0x20], 1 - mov rax, qword ptr [rsp+0x20] - ;; size=19 bbWeight=1 PerfScore 4.00 + mov eax, eax + shl rax, 32 + or rax, 1 + ;; size=15 bbWeight=1 PerfScore 2.00 ``` (the `mov eax, eax` is unnecessary, but not that simple to get rid of) ```csharp static (int x, float y) Test(int x, float y) { return (x, y); } ``` ```diff - mov dword ptr [rsp], ecx - vmovss dword ptr [rsp+0x04], xmm1 - mov rax, qword ptr [rsp] + vmovd eax, xmm1 + shl rax, 32 + mov ecx, ecx + or rax, rcx ;; size=13 bbWeight=1 PerfScore 3.00 ``` An arm64 example: ```csharp static Memory<int> ToMemory(int[] arr) { return arr.AsMemory(); } ``` ```diff G_M45070_IG01: ;; offset=0x0000 - stp fp, lr, [sp, #-0x20]! + stp fp, lr, [sp, #-0x10]! mov fp, sp - str xzr, [fp, #0x10] // [V03 tmp2] - ;; size=12 bbWeight=1 PerfScore 2.50 -G_M45070_IG02: ;; offset=0x000C + ;; size=8 bbWeight=1 PerfScore 1.50 +G_M45070_IG02: ;; offset=0x0008 cbz x0, G_M45070_IG06 ;; size=4 bbWeight=1 PerfScore 1.00 -G_M45070_IG03: ;; offset=0x0010 - str x0, [fp, #0x10] // [V07 tmp6] - str wzr, [fp, #0x18] // [V08 tmp7] - ldr x0, [fp, #0x10] // [V07 tmp6] - ldr w0, [x0, #0x08] - str w0, [fp, #0x1C] // [V09 tmp8] - ;; size=20 bbWeight=0.80 PerfScore 6.40 -G_M45070_IG04: ;; offset=0x0024 - ldp x0, x1, [fp, #0x10] // [V03 tmp2], [V03 tmp2+0x08] - ;; size=4 bbWeight=1 PerfScore 3.00 -G_M45070_IG05: ;; offset=0x0028 - ldp fp, lr, [sp], #0x20 +G_M45070_IG03: ;; offset=0x000C + ldr w1, [x0, #0x08] + ;; size=4 bbWeight=0.80 PerfScore 2.40 +G_M45070_IG04: ;; offset=0x0010 + mov w1, w1 + mov x2, xzr + orr x1, x2, x1, LSL dotnet#32 + ;; size=12 bbWeight=1 PerfScore 2.00 +G_M45070_IG05: ;; offset=0x001C + ldp fp, lr, [sp], #0x10 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 -G_M45070_IG06: ;; offset=0x0030 - str xzr, [fp, #0x10] // [V07 tmp6] - str xzr, [fp, #0x18] +G_M45070_IG06: ;; offset=0x0024 + mov x0, xzr + mov w1, wzr b G_M45070_IG04 - ;; size=12 bbWeight=0.20 PerfScore 0.60 + ;; size=12 bbWeight=0.20 PerfScore 0.40 ``` (sneak peek -- this codegen requires some supplementary changes, and there's additional opportunities here) This is the return counterpart to #112740. That PR has a bunch of regressions that makes it look like we need to support returns/call arguments first, before we try to support parameters. There's a few follow-ups here: - Support for float->float insertions (when a float value needs to be returned as the 1st, 2nd, .... field of a SIMD register) - Support for coalescing memory loads, particularly because the fields of the `FIELD_LIST` come from a promoted struct that ended up DNER. In those cases we should be able to recombine the fields back to a single large field, instead of combining them with bitwise operations. - Support for constant folding the bitwise insertions. This requires some more constant folding support in lowering. - The JIT has lots of (now outdated) restrictions based around multi-reg returns that get in the way. Lifting these should improve things considerably.

…740) Recent work now allows us to finally add support for the backend to extract fields out of parameters without spilling them to stack. Previously this was only supported when the fields mapped cleanly to registers. A win-x64 example: ```csharp static int Foo(int? foo) { return foo.HasValue ? foo.Value : 0; } ``` ```diff ; Method Program:Foo(System.Nullable`1[int]):int (FullOpts) G_M19236_IG01: ;; offset=0x0000 - mov qword ptr [rsp+0x08], rcx - ;; size=5 bbWeight=0.50 PerfScore 0.50 + ;; size=0 bbWeight=0.50 PerfScore 0.00 -G_M19236_IG02: ;; offset=0x0005 - movzx rcx, cl - xor eax, eax - test ecx, ecx - cmovne eax, dword ptr [rsp+0x0C] - ;; size=12 bbWeight=0.50 PerfScore 1.38 +G_M19236_IG02: ;; offset=0x0000 + movzx rax, cl + shr rcx, 32 + xor edx, edx + test eax, eax + mov eax, edx + cmovne eax, ecx + ;; size=16 bbWeight=0.50 PerfScore 0.88 -G_M19236_IG03: ;; offset=0x0011 +G_M19236_IG03: ;; offset=0x0010 ret ;; size=1 bbWeight=0.50 PerfScore 0.50 -; Total bytes of code: 18 - +; Total bytes of code: 17 ``` Another win-x64 example: ```csharp static float Sum(PointF p) { return p.X + p.Y; } ``` ```diff ; Method Program:Sum(System.Drawing.PointF):float (FullOpts) G_M48891_IG01: ;; offset=0x0000 - mov qword ptr [rsp+0x08], rcx - ;; size=5 bbWeight=1 PerfScore 1.00 + ;; size=0 bbWeight=1 PerfScore 0.00 -G_M48891_IG02: ;; offset=0x0005 - vmovss xmm0, dword ptr [rsp+0x08] - vaddss xmm0, xmm0, dword ptr [rsp+0x0C] - ;; size=12 bbWeight=1 PerfScore 8.00 +G_M48891_IG02: ;; offset=0x0000 + vmovd xmm0, ecx + shr rcx, 32 + vmovd xmm1, ecx + vaddss xmm0, xmm0, xmm1 + ;; size=16 bbWeight=1 PerfScore 7.50 -G_M48891_IG03: ;; offset=0x0011 +G_M48891_IG03: ;; offset=0x0010 ret ;; size=1 bbWeight=1 PerfScore 1.00 -; Total bytes of code: 18 +; Total bytes of code: 17 ``` An arm64 example: ```csharp static bool Test(Memory<int> mem) { return mem.Length > 10; } ``` ```diff ; Method Program:Test(System.Memory`1[int]):ubyte (FullOpts) G_M53448_IG01: ;; offset=0x0000 - stp fp, lr, [sp, #-0x20]! + stp fp, lr, [sp, #-0x10]! mov fp, sp - stp x0, x1, [fp, #0x10] // [V00 arg0], [V00 arg0+0x08] - ;; size=12 bbWeight=1 PerfScore 2.50 + ;; size=8 bbWeight=1 PerfScore 1.50 -G_M53448_IG02: ;; offset=0x000C - ldr w0, [fp, #0x1C] // [V00 arg0+0x0c] +G_M53448_IG02: ;; offset=0x0008 + lsr x0, x1, dotnet#32 cmp w0, dotnet#10 cset x0, gt - ;; size=12 bbWeight=1 PerfScore 3.00 + ;; size=12 bbWeight=1 PerfScore 2.00 -G_M53448_IG03: ;; offset=0x0018 - ldp fp, lr, [sp], #0x20 +G_M53448_IG03: ;; offset=0x0014 + ldp fp, lr, [sp], #0x10 ret lr ;; size=8 bbWeight=1 PerfScore 2.00 -; Total bytes of code: 32 +; Total bytes of code: 28 ``` Float -> float extractions that do not map cleanly is still not supported, but should be doable (via vector register extractions). Float -> int extractions are not supported, but I'm not sure we see these. This is often not a code size improvement, but typically a perfscore improvement. Also this seems to have some bad interactions with call arguments since they do not yet support something similar, but hopefully that can be improved separately.

Update CreateAndExperiment.md

62b2ea2

jkotas requested a review from AaronRobinsonMSFT July 25, 2020 16:55

jkotas commented Jul 25, 2020

View reviewed changes

AaronRobinsonMSFT approved these changes Jul 25, 2020

View reviewed changes

jkotas added 3 commits July 25, 2020 10:04

CR feedback

43f48de

Nit

3cdfb0e

More nits

83bcd27

jkotas merged commit 4072ab9 into dotnet:master Jul 25, 2020

jkotas deleted the CreateAnExperiment branch August 1, 2020 05:49

jkoritzinsky pushed a commit to jkoritzinsky/runtime that referenced this pull request Sep 17, 2021

Update CreateAndExperiment.md (dotnet/runtimelab#32)

9106576

Commit migrated from dotnet/runtimelab@4072ab9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update CreateAndExperiment.md #32

Update CreateAndExperiment.md #32

Uh oh!

jkotas commented Jul 25, 2020

Uh oh!

jkotas Jul 25, 2020

Uh oh!

AaronRobinsonMSFT Jul 25, 2020

Uh oh!

AaronRobinsonMSFT Jul 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update CreateAndExperiment.md #32

Update CreateAndExperiment.md #32

Uh oh!

Conversation

jkotas commented Jul 25, 2020

Uh oh!

jkotas Jul 25, 2020

Choose a reason for hiding this comment

Uh oh!

AaronRobinsonMSFT Jul 25, 2020

Choose a reason for hiding this comment

Uh oh!

AaronRobinsonMSFT Jul 25, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants