Have CSE correctly track float/simd and mask registers for their own enregistration #117118

tannergooding · 2025-06-28T16:53:13Z

This updates CSE to correctly track float/simd and mask registers for their own enregistration.

Previously the logic attempted to skip the handling of these types or even misclassified simd/mask under the integer enregistration limits. This could lead to weird diffs and register selection in some cases due to suboptimal CSE considerations.

As is typical with these changes, it's "difficult" to assess whether its an improvement or not just by looking at the diffs, because the CSE decisions impact other optimizations, register selection, etc. The diffs are primarily impacting methods that use floating-point or SIMD and generally look positive overall.

In general, we're ending up with fewer things marked as CSE moderate and instead more being marked as CSE aggressive instead. This is because we're no longer eating up the CSE budget by classifying everything as integer. This leads to many cases with more total CSEs:

-;  V94 cse0         [V94,T11] (  3,  1.50)    long  ->   x1         "CSE #05: moderate"
-;  V95 cse1         [V95,T02] (  3, 12   )    long  ->   x5         "CSE #02: moderate"
-;  V96 cse2         [V96,T14] (  4, 15.84)   simd8  ->  d18         "CSE #07: moderate"
-;  V97 cse3         [V97,T15] (  4, 15.84)   simd8  ->  d19         "CSE #08: moderate"
+;  V94 cse0         [V94,T12] (  3,  1.50)    long  ->   x1         "CSE #05: moderate"
+;  V95 cse1         [V95,T02] (  3, 12   )    long  ->   x5         "CSE #02: aggressive"
+;  V96 cse2         [V96,T15] (  4, 15.84)   simd8  ->  d18         "CSE #07: aggressive"
+;  V97 cse3         [V97,T16] (  4, 15.84)   simd8  ->  d19         "CSE #08: aggressive"
+;  V98 cse4         [V98,T03] (  3, 11.88)    long  ->   x4         "CSE #06: aggressive"

This in turn shifts some temp numbers and impacts everything else in the method. Most of which look to be positive changes, such as:

-						;; size=236 bbWeight=1 PerfScore 81.00
+						;; size=220 bbWeight=1 PerfScore 73.00

Which is because we no longer reload some constants and do other computations unnecessarily:

; gcrRegs +[x19]
             ldr     d16, [@RWD00]
             ldr     d17, [@RWD08]
-            fmin    d16, d16, d17
-            str     d16, [fp, #0x30]	// [V08 cse0]
-            str     d16, [x19, #0x08]
-            ldr     d17, [@RWD16]
-            ldr     d18, [@RWD24]
-            fmin    d17, d17, d18
-            str     d17, [fp, #0x28]	// [V09 cse1]
-            str     d17, [x19, #0x10]
-            ldr     d18, [@RWD00]
-            ldr     d19, [@RWD08]
-            fmax    d18, d18, d19
-            str     d18, [fp, #0x20]	// [V10 cse2]
-            str     d18, [x19, #0x18]
+            fmin    d18, d16, d17
+            str     d18, [fp, #0x30]	// [V08 cse0]
+            str     d18, [x19, #0x08]
             ldr     d19, [@RWD16]
             ldr     d20, [@RWD24]
-            fmax    d19, d19, d20
-            str     d19, [fp, #0x18]	// [V11 cse3]
-            str     d19, [x19, #0x20]
+            fmin    d21, d19, d20
+            str     d21, [fp, #0x28]	// [V09 cse1]
+            str     d21, [x19, #0x10]
+            fmax    d16, d16, d17
+            str     d16, [fp, #0x20]	// [V10 cse2]
+            str     d16, [x19, #0x18]
+            fmax    d17, d19, d20
+            str     d17, [fp, #0x18]	// [V11 cse3]
+            str     d17, [x19, #0x20]

There's some tuning to be done, however, because there's some places where things like zero constants are being CSE'd where it'd be better to have morph mark against that so we can still optimize accordingly (which should be a simple fix):

             mov     w3, wzr
 						;; size=16 bbWeight=0.50 PerfScore 1.00
 G_M48160_IG04:        ; bbWeight=3.96, gcrefRegs=0000 {}, byrefRegs=0005 {x0 x2}, byref, isz
-            ldr     d18, [x0, w3, UXTW #3]
-            ldr     d19, [x2, w3, UXTW #3]
+            mov     w4, w3
+            ldr     d18, [x0, x4, LSL #3]
+            ldr     d19, [x2, x4, LSL #3]

There are many similar cases on x64 and with some similar fine tuning that could be done.

…regCount

src/coreclr/jit/optcse.cpp

tannergooding · 2025-06-28T21:33:44Z

CC. @dotnet/jit-contrib. This should be ready for review.

There's a minor (+0.02% for most, but up to +0.04% on x64 Windows) throughput hit, but this gets us more into a correct shape for CSE considerations and removes some historical quirks where we were eating the integer CSE budget due to floating-point and mask types being in the method. So I think it's worth taking and then doing some follow-up targeted fixes (likely around not CSE'ing certain containable constants) where possible.

risc-vv · 2025-06-28T21:35:38Z

@dotnet/samsung Could you please take a look? These changes may be related to riscv64.

risc-vv · 2025-06-28T21:35:58Z

RISC-V Release-CLR-QEMU: 9086 / 9116 (99.67%)

=======================
      passed: 9086
      failed: 2
     skipped: 600
      killed: 28
------------------------
 TOTAL tests: 9716
VIRTUAL time: 35h 20min 51s 144ms
   REAL time: 36min 7s 681ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9087 / 9117 (99.67%)

=======================
      passed: 9087
      failed: 2
     skipped: 600
      killed: 28
------------------------
 TOTAL tests: 9717
VIRTUAL time: 10h 43min 39s 782ms
   REAL time: 43min 36s 331ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 335773 / 336836 (99.68%)

=======================
      passed: 335773
      failed: 1054
     skipped: 39
      killed: 9
------------------------
 TOTAL tests: 336875
VIRTUAL time: 32h 34min 59s 596ms
   REAL time: 1h 11min 18s 986ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-VF2: 499778 / 501520 (99.65%)

=======================
      passed: 499778
      failed: 1734
     skipped: 39
      killed: 8
------------------------
 TOTAL tests: 501559
VIRTUAL time: 20h 55min 7s 21ms
   REAL time: 2h 18min 24s 878ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 8cde53693255d7d0fa117f8c1bae90bfd35974f2
CI: bf2351b4b801509d156b00d7a6d660aed10fd64f
REPO: dotnet/runtime
BRANCH: main
CONFIG: Release
LIB_CONFIG: Release

tannergooding · 2025-06-30T18:37:02Z

@jakobbotsch I requested your review since you touch CSE most frequently.

src/coreclr/jit/optcse.cpp

jakobbotsch

Looks reasonable to me. Pinged @AndyAyersMS on a question related to the CSE ML experiment.

risc-vv · 2025-07-01T17:10:57Z

bed117e is being scheduled for building and testing

GIT: bed117e4ea9353fa48b66c9ccb88415dfd91583d
REPO: dotnet/runtime
BRANCH: main

tannergooding added the NO-REVIEW Experimental/testing PR, do NOT review it label Jun 28, 2025

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 28, 2025

dotnet-policy-service bot assigned tannergooding Jun 28, 2025

Don't have CSE track simd or mask registers as part of the integer en…

6707408

…regCount

tannergooding force-pushed the optcse-int branch from 3352e17 to 6707408 Compare June 28, 2025 17:12

Have optcse track non-integer enregistration counts

8cde536

tannergooding force-pushed the optcse-int branch from ee5c6bb to 8cde536 Compare June 28, 2025 19:22

tannergooding changed the title ~~Don't have CSE track simd or mask registers as part of the integer enregCount~~ Have CSE correctly track simd and mask registers for their own enregistration Jun 28, 2025

tannergooding changed the title ~~Have CSE correctly track simd and mask registers for their own enregistration~~ Have CSE correctly track float/simd and mask registers for their own enregistration Jun 28, 2025