Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Dec 22, 2024

Currently, JIT always gives up on unrolling Memset(Fill) for anything other than byte or zero. This PR enables it for all non-zero primitives.

void Test(Span<char> span) => span.Slice(0, 10).Fill('x');

Current codegen:

; Method Program:Test(System.Span`1[ushort]):this (FullOpts)
G_M11560_IG01:
       sub      rsp, 40
G_M11560_IG02:
       cmp      dword ptr [rdx+0x08], 10
       jl       SHORT G_M11560_IG04
       mov      rcx, bword ptr [rdx]
       mov      edx, 10
       mov      r8d, 120
       call     [System.SpanHelpers:Fill[ushort](byref,ulong,ushort)]
       nop      
G_M11560_IG03:
       add      rsp, 40
       ret      
G_M11560_IG04:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
; Total bytes of code: 43

New codegen:

; Method Program:Test(System.Span`1[ushort]):this (FullOpts)
G_M11560_IG01:
       sub      rsp, 40
G_M11560_IG02:
       cmp      dword ptr [rdx+0x08], 10
       jl       SHORT G_M11560_IG04
       mov      rax, bword ptr [rdx]
       vmovups  xmm0, xmmword ptr [reloc @RWD00]
       vmovups  xmmword ptr [rax], xmm0
       mov      dword ptr [rax+0x10], 0x780078
G_M11560_IG03:
       add      rsp, 40
       ret      
G_M11560_IG04:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
RWD00  	dq	0078007800780078h, 0078007800780078h
; Total bytes of code: 44

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 22, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo EgorBo marked this pull request as ready for review December 23, 2024 02:16
@EgorBo
Copy link
Member Author

EgorBo commented Jan 15, 2025

PTAL @jakobbotsch @dotnet/jit-contrib

This PR unrolls Fill (memset idiom) for more than just 0 or byte. Not too many diffs in our BCL https://dev.azure.com/dnceng-public/public/_build/results?buildId=914818&view=ms.vss-build-web.run-extensions-tab (obviously, unrolling is mostly a size regression).

@EgorBo EgorBo requested a review from jakobbotsch January 15, 2025 11:37
@jkotas
Copy link
Member

jkotas commented Jan 15, 2025

Some of the diffs do not look good.

For example, windows x64 - System.IO.Compression.HuffmanTree:GetStaticLiteralTreeLength:

-       vmovups  ymm0, ymmword ptr [reloc @RWD00]
-       vmovdqu  ymmword ptr [rcx], ymm0
-       vmovdqu  ymmword ptr [rcx+0x20], ymm0
-       vmovdqu  ymmword ptr [rcx+0x40], ymm0
-       vmovdqu  ymmword ptr [rcx+0x60], ymm0
-       vmovdqu  xmmword ptr [rcx+0x80], xmm0
+       vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+       vmovups  ymmword ptr [rcx], ymm0
+       vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+       vmovups  ymmword ptr [rcx+0x20], ymm0
+       vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+       vmovups  ymmword ptr [rcx+0x40], ymm0
+       vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+       vmovups  ymmword ptr [rcx+0x60], ymm0
+       vmovups  xmm0, xmmword ptr [reloc @RWD00]
+       vmovups  xmmword ptr [rcx+0x80], xmm0

We are emitting redundant vbroadcasti128 instructions now.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 15, 2025

Some of the diffs do not look good.

Yeah, it's a bit unfortunate issue down stream. I've pushed a workaround to fix it.

the issue is - my phase takes Fill<T> and expands it to a sequence of stores (of T type) inline. It then relies on the other (unrelated) phase in lower to coalesce all these stores efficiently. It seems for byte[] we are left with

*(a + 0) = <const v256>
*(a + 32) = <const v256>
...
*(a + n) = <const v256>

and it's too late to do CSE for those const vectors. the broadcast instruction appears as an optimization for Load const v256 (if its lower and upper parts are the same so we can save some space in the data section).

As a workaround, I've disabled the expansion for bytes as our Lower already knows how to do it for bytes (byte memset expansion), but I'll enable it back once I figure out how to perform CSE in LowerStoreIndirCoalescing (in a separate PR).

@EgorBo EgorBo merged commit 5048656 into dotnet:main Jan 16, 2025
109 checks passed
@EgorBo EgorBo deleted the unroll-more-memset-fill branch January 16, 2025 16:00
@github-actions github-actions bot locked and limited conversation to collaborators Feb 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants