Skip to content

Conversation

@jkotas
Copy link
Member

@jkotas jkotas commented Jul 25, 2020

No description provided.

@jkotas jkotas requested a review from AaronRobinsonMSFT July 25, 2020 16:55
## Steps to setup a new experiment

- Pick a good name for your experiment and create branch for it in dotnet/runtimelab.
- If the experiment is expected to require changes of .NET runtime itself, it should be branched off of [dotnet/runtimelab:runtime-master](https://github.com/dotnet/runtimelab/tree/runtime-master) that is manually maitained mirror of [dotnet/runtime:master](https://github.com/dotnet/runtime/tree/master) branch.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will need to be updated for the master->main rename once it happens.

## Steps to setup a new experiment

- Pick a good name for your experiment and create branch for it in dotnet/runtimelab.
- If the experiment is expected to require changes of .NET runtime itself, it should be branched off of [dotnet/runtimelab:runtime-master](https://github.com/dotnet/runtimelab/tree/runtime-master) that is manually maitained mirror of [dotnet/runtime:master](https://github.com/dotnet/runtime/tree/master) branch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is manually -> "is a manually"

- Otherwise, the experiment should be branched off of [dotnet/runtimelab:master](https://github.com/dotnet/runtimelab/tree/master) to get the required boilerplate such as LICENSE.TXT.
- Submit a PR to update the [README.MD](https://github.com/dotnet/runtimelab/blob/master/README.md#active-experimental-projects) with the name of your branch and a brief description of the experiment. (Example: [#19](https://github.com/dotnet/runtimelab/pull/19/files))
- Create label `area-<your experiment name>` for tagging issues. The label should use color `#d4c5f9`.
- If you experiment is branched from dotnet/runtime:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you experiment -> "If your experiment"

@jkotas jkotas merged commit 4072ab9 into dotnet:master Jul 25, 2020
@jkotas jkotas deleted the CreateAnExperiment branch August 1, 2020 05:49
jkoritzinsky pushed a commit to jkoritzinsky/runtime that referenced this pull request Sep 17, 2021
runtimelab-bot pushed a commit that referenced this pull request Jan 13, 2022
…(#63598)

* Fix native frame unwind in syscall on arm64 for VS4Mac crash report.

Add arm64 version of StepWithCompactNoEncoding for syscall leaf node wrappers that have compact encoding of 0.

Fix ReadCompactEncodingRegister so it actually decrements the addr.

Change StepWithCompactEncodingArm64 to match what MacOS libunwind does for framed and frameless stepping.

arm64 can have frames with the same SP (but different IPs). Increment SP for this condition so createdump's unwind
loop doesn't break out on the "SP not increasing" check and the frames are added to the thread frame list in the
correct order.

Add getting the unwind info for tail called functions like this:

__ZL14PROCEndProcessPvji:
   36630:       f6 57 bd a9     stp     x22, x21, [sp, #-48]!
   36634:       f4 4f 01 a9     stp     x20, x19, [sp, #16]
   36638:       fd 7b 02 a9     stp     x29, x30, [sp, #32]
   3663c:       fd 83 00 91     add     x29, sp, #32
...
   367ac:       e9 01 80 52     mov     w9, #15
   367b0:       7f 3e 02 71     cmp     w19, #143
   367b4:       20 01 88 1a     csel    w0, w9, w8, eq
   367b8:       2e 00 00 94     bl      _PROCAbort
_TerminateProcess:
-> 367bc:       22 00 80 52     mov     w2, #1
   367c0:       9c ff ff 17     b       __ZL14PROCEndProcessPvji

The IP (367bc) returns the (incorrect) frameless encoding with nothing on the stack (uses an incorrect LR to unwind). To fix this
get the unwind info for PC -1 which points to PROCEndProcess with the correct unwind info. This matches how lldb unwinds this frame.

Always address module segment to IP lookup list instead of checking the module regions.

Strip pointer authentication bits on PC/LR.
runtimelab-bot pushed a commit that referenced this pull request Feb 8, 2022
# Local heap optimizations on Arm64

1. When not required to zero the allocated space for local heap (for sizes up to 64 bytes) - do not emit zeroing sequence. Instead do stack probing and adjust stack pointer:

```diff
-            stp     xzr, xzr, [sp,#-16]!
-            stp     xzr, xzr, [sp,#-16]!
-            stp     xzr, xzr, [sp,#-16]!
-            stp     xzr, xzr, [sp,#-16]!
+            ldr     wzr, [sp],#-64
```

2. For sizes less than one `PAGE_SIZE` use `ldr wzr, [sp], #-amount` that does probing at `[sp]` and allocates the space at the same time. This saves one instruction for such local heap allocations:

```diff
-            ldr     wzr, [sp]
-            sub     sp, sp, #208
+            ldr     wzr, [sp],#-208
```

Use `ldp tmpReg, xzr, [sp], #-amount` when the offset not encodable by post-index variant of `ldr`:
```diff
-            ldr     wzr, [sp]
-            sub     sp, sp, #512
+            ldp     x0, xzr, [sp],#-512
```

3. Allow non-loop zeroing (i.e. unrolled sequence) for sizes up to 128 bytes (i.e. up to `LCLHEAP_UNROLL_LIMIT`). This frees up two internal integer registers for such cases:

```diff
-            mov     w11, #128
-                                               ;; bbWeight=0.50 PerfScore 0.25
-G_M44913_IG19:        ; gcrefRegs=00F9 {x0 x3 x4 x5 x6 x7}, byrefRegs=0000 {}, byref, isz
             stp     xzr, xzr, [sp,#-16]!
-            subs    x11, x11, #16
-            bne     G_M44913_IG19
+            stp     xzr, xzr, [sp,#-112]!
+            stp     xzr, xzr, [sp,#16]
+            stp     xzr, xzr, [sp,#32]
+            stp     xzr, xzr, [sp,#48]
+            stp     xzr, xzr, [sp,#64]
+            stp     xzr, xzr, [sp,#80]
+            stp     xzr, xzr, [sp,#96]
```

4. Do zeroing in ascending order of the effective address:

```diff
-            mov     w7, #96
-G_M49279_IG13:
             stp     xzr, xzr, [sp,#-16]!
-            subs    x7, x7, #16
-            bne     G_M49279_IG13
+            stp     xzr, xzr, [sp,#-80]!
+            stp     xzr, xzr, [sp,#16]
+            stp     xzr, xzr, [sp,#32]
+            stp     xzr, xzr, [sp,#48]
+            stp     xzr, xzr, [sp,#64]
```

In the example, the zeroing is done at `[initialSp-16], [initialSp-96], [initialSp-80], [initialSp-64], [initialSp-48], [initialSp-32]` addresses. The idea here is to allow a CPU to detect the sequential `memset` to `0` pattern and switch into write streaming mode.
yowl pushed a commit to yowl/runtimelab that referenced this pull request Mar 2, 2023
* Support Arm64 "constructed" constants in SuperPMI asm diffs

SuperPMI asm diffs tries to ignore constants that can change between
multiple replays, such as addresses that the replay engine must generate
and not simply hand back from the collected data.

Often, addresses have associated relocations generated during replay.
SuperPMI can use these relocations to adjust the constants to allow
two replays to match. However, there are cases on Arm64 where an address
both doesn't report a relocation and is "constructed" using multiple
`mov`/`movk` instructions.

One case is the `allocPgoInstrumentationBySchema()`
API which returns a pointer to a PGO data buffer. An address within this
buffer is constructed via a sequence such as:
```
mov     x0, #63408
movk    x0, #23602, lsl dotnet#16
movk    x0, dotnet#606, lsl dotnet#32
```

When SuperPMI replays this API, it constructs a new buffer and returns that
pointer, which is used to construct various actual addresses that are
generated as "constructed" constants, shown above.

This change "de-constructs" the constants and looks them up in the replay
address map. If base and diff match the mapped constants, there is no asm diff.

* Fix 32-bit build

I don't think we fully support 64-bit replay on 32-bit host, but this
fix at least makes it possible for this case.

* Support more general mov/movk sequence

Allow JIT1 and JIT2 to have a different sequence of
mov/movk[/movk[/movk]] that map to the same address in the
address map. That is, the replay constant might require a different
set of instructions (e.g., if a `movk` is missing because its constant
is zero).
jakobbotsch added a commit to jakobbotsch/runtimelab that referenced this pull request Apr 9, 2025
Based on the new `FIELD_LIST` support for returns this PR adds support for the
JIT to combine smaller fields via bitwise operations when returned, instead of
spilling these to stack.

win-x64 examples:
```csharp
static int? Test()
{
    return Environment.TickCount;
}
```

```diff
        call     System.Environment:get_TickCount():int
-       mov      dword ptr [rsp+0x24], eax
-       mov      byte  ptr [rsp+0x20], 1
-       mov      rax, qword ptr [rsp+0x20]
-						;; size=19 bbWeight=1 PerfScore 4.00
+       mov      eax, eax
+       shl      rax, 32
+       or       rax, 1
+						;; size=15 bbWeight=1 PerfScore 2.00
```
(the `mov eax, eax` is unnecessary, but not that simple to get rid of)

 ```csharp
static (int x, float y) Test(int x, float y)
{
    return (x, y);
}
```

```diff
-       mov      dword ptr [rsp], ecx
-       vmovss   dword ptr [rsp+0x04], xmm1
-       mov      rax, qword ptr [rsp]
+       vmovd    eax, xmm1
+       shl      rax, 32
+       mov      ecx, ecx
+       or       rax, rcx
 						;; size=13 bbWeight=1 PerfScore 3.00
```

An arm64 example:
```csharp
static Memory<int> ToMemory(int[] arr)
{
    return arr.AsMemory();
}
```

```diff
 G_M45070_IG01:  ;; offset=0x0000
-            stp     fp, lr, [sp, #-0x20]!
+            stp     fp, lr, [sp, #-0x10]!
             mov     fp, sp
-            str     xzr, [fp, #0x10]	// [V03 tmp2]
-						;; size=12 bbWeight=1 PerfScore 2.50
-G_M45070_IG02:  ;; offset=0x000C
+						;; size=8 bbWeight=1 PerfScore 1.50
+G_M45070_IG02:  ;; offset=0x0008
             cbz     x0, G_M45070_IG06
 						;; size=4 bbWeight=1 PerfScore 1.00
-G_M45070_IG03:  ;; offset=0x0010
-            str     x0, [fp, #0x10]	// [V07 tmp6]
-            str     wzr, [fp, #0x18]	// [V08 tmp7]
-            ldr     x0, [fp, #0x10]	// [V07 tmp6]
-            ldr     w0, [x0, #0x08]
-            str     w0, [fp, #0x1C]	// [V09 tmp8]
-						;; size=20 bbWeight=0.80 PerfScore 6.40
-G_M45070_IG04:  ;; offset=0x0024
-            ldp     x0, x1, [fp, #0x10]	// [V03 tmp2], [V03 tmp2+0x08]
-						;; size=4 bbWeight=1 PerfScore 3.00
-G_M45070_IG05:  ;; offset=0x0028
-            ldp     fp, lr, [sp], #0x20
+G_M45070_IG03:  ;; offset=0x000C
+            ldr     w1, [x0, #0x08]
+						;; size=4 bbWeight=0.80 PerfScore 2.40
+G_M45070_IG04:  ;; offset=0x0010
+            mov     w1, w1
+            mov     x2, xzr
+            orr     x1, x2, x1,  LSL dotnet#32
+						;; size=12 bbWeight=1 PerfScore 2.00
+G_M45070_IG05:  ;; offset=0x001C
+            ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
-G_M45070_IG06:  ;; offset=0x0030
-            str     xzr, [fp, #0x10]	// [V07 tmp6]
-            str     xzr, [fp, #0x18]
+G_M45070_IG06:  ;; offset=0x0024
+            mov     x0, xzr
+            mov     w1, wzr
             b       G_M45070_IG04
-						;; size=12 bbWeight=0.20 PerfScore 0.60
+						;; size=12 bbWeight=0.20 PerfScore 0.40
```
(sneak peek -- this codegen requires some supplementary changes, and there's
additional opportunities here)

This is the return counterpart to #112740. That PR has a bunch of regressions
that makes it look like we need to support returns/call arguments first, before
we try to support parameters.

There's a few follow-ups here:
- Support for float->float insertions (when a float value needs to be returned
  as the 1st, 2nd, .... field of a SIMD register)
- Support for coalescing memory loads, particularly because the fields of the
  `FIELD_LIST` come from a promoted struct that ended up DNER. In those cases we
  should be able to recombine the fields back to a single large field, instead
  of combining them with bitwise operations.
- Support for constant folding the bitwise insertions. This requires some more
  constant folding support in lowering.
- The JIT has lots of (now outdated) restrictions based around multi-reg returns
  that get in the way. Lifting these should improve things considerably.
yowl pushed a commit to yowl/runtimelab that referenced this pull request Nov 8, 2025
…740)

Recent work now allows us to finally add support for the backend to
extract fields out of parameters without spilling them to stack.
Previously this was only supported when the fields mapped cleanly to
registers.

A win-x64 example:
```csharp
static int Foo(int? foo)
{
    return foo.HasValue ? foo.Value : 0;
}
```
```diff
 ; Method Program:Foo(System.Nullable`1[int]):int (FullOpts)
 G_M19236_IG01:  ;; offset=0x0000
-       mov      qword ptr [rsp+0x08], rcx
-						;; size=5 bbWeight=0.50 PerfScore 0.50
+						;; size=0 bbWeight=0.50 PerfScore 0.00

-G_M19236_IG02:  ;; offset=0x0005
-       movzx    rcx, cl
-       xor      eax, eax
-       test     ecx, ecx
-       cmovne   eax, dword ptr [rsp+0x0C]
-						;; size=12 bbWeight=0.50 PerfScore 1.38
+G_M19236_IG02:  ;; offset=0x0000
+       movzx    rax, cl
+       shr      rcx, 32
+       xor      edx, edx
+       test     eax, eax
+       mov      eax, edx
+       cmovne   eax, ecx
+						;; size=16 bbWeight=0.50 PerfScore 0.88

-G_M19236_IG03:  ;; offset=0x0011
+G_M19236_IG03:  ;; offset=0x0010
        ret
 						;; size=1 bbWeight=0.50 PerfScore 0.50
-; Total bytes of code: 18
-
+; Total bytes of code: 17
```

Another win-x64 example:
```csharp
static float Sum(PointF p)
{
    return p.X + p.Y;
}
```

```diff
 ; Method Program:Sum(System.Drawing.PointF):float (FullOpts)
 G_M48891_IG01:  ;; offset=0x0000
-       mov      qword ptr [rsp+0x08], rcx
-						;; size=5 bbWeight=1 PerfScore 1.00
+						;; size=0 bbWeight=1 PerfScore 0.00

-G_M48891_IG02:  ;; offset=0x0005
-       vmovss   xmm0, dword ptr [rsp+0x08]
-       vaddss   xmm0, xmm0, dword ptr [rsp+0x0C]
-						;; size=12 bbWeight=1 PerfScore 8.00
+G_M48891_IG02:  ;; offset=0x0000
+       vmovd    xmm0, ecx
+       shr      rcx, 32
+       vmovd    xmm1, ecx
+       vaddss   xmm0, xmm0, xmm1
+						;; size=16 bbWeight=1 PerfScore 7.50

-G_M48891_IG03:  ;; offset=0x0011
+G_M48891_IG03:  ;; offset=0x0010
        ret
 						;; size=1 bbWeight=1 PerfScore 1.00
-; Total bytes of code: 18
+; Total bytes of code: 17
```

An arm64 example:
```csharp
static bool Test(Memory<int> mem)
{
    return mem.Length > 10;
}
```

```diff
 ; Method Program:Test(System.Memory`1[int]):ubyte (FullOpts)
 G_M53448_IG01:  ;; offset=0x0000
-            stp     fp, lr, [sp, #-0x20]!
+            stp     fp, lr, [sp, #-0x10]!
             mov     fp, sp
-            stp     x0, x1, [fp, #0x10]	// [V00 arg0], [V00 arg0+0x08]
-						;; size=12 bbWeight=1 PerfScore 2.50
+						;; size=8 bbWeight=1 PerfScore 1.50

-G_M53448_IG02:  ;; offset=0x000C
-            ldr     w0, [fp, #0x1C]	// [V00 arg0+0x0c]
+G_M53448_IG02:  ;; offset=0x0008
+            lsr     x0, x1, dotnet#32
             cmp     w0, dotnet#10
             cset    x0, gt
-						;; size=12 bbWeight=1 PerfScore 3.00
+						;; size=12 bbWeight=1 PerfScore 2.00

-G_M53448_IG03:  ;; offset=0x0018
-            ldp     fp, lr, [sp], #0x20
+G_M53448_IG03:  ;; offset=0x0014
+            ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
-; Total bytes of code: 32
+; Total bytes of code: 28
```

Float -> float extractions that do not map cleanly is still not supported, but
should be doable (via vector register extractions). Float -> int extractions are
not supported, but I'm not sure we see these.

This is often not a code size improvement, but typically a perfscore
improvement. Also this seems to have some bad interactions with call arguments
since they do not yet support something similar, but hopefully that can be
improved separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants