-
Notifications
You must be signed in to change notification settings - Fork 5.3k
JIT: unroll more memset patterns #110893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: unroll more memset patterns #110893
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
PTAL @jakobbotsch @dotnet/jit-contrib This PR unrolls Fill (memset idiom) for more than just 0 or byte. Not too many diffs in our BCL https://dev.azure.com/dnceng-public/public/_build/results?buildId=914818&view=ms.vss-build-web.run-extensions-tab (obviously, unrolling is mostly a size regression). |
|
Some of the diffs do not look good. For example, windows x64 - System.IO.Compression.HuffmanTree:GetStaticLiteralTreeLength: - vmovups ymm0, ymmword ptr [reloc @RWD00]
- vmovdqu ymmword ptr [rcx], ymm0
- vmovdqu ymmword ptr [rcx+0x20], ymm0
- vmovdqu ymmword ptr [rcx+0x40], ymm0
- vmovdqu ymmword ptr [rcx+0x60], ymm0
- vmovdqu xmmword ptr [rcx+0x80], xmm0
+ vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+ vmovups ymmword ptr [rcx], ymm0
+ vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+ vmovups ymmword ptr [rcx+0x20], ymm0
+ vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+ vmovups ymmword ptr [rcx+0x40], ymm0
+ vbroadcasti128 ymm0, xmmword ptr [reloc @RWD00]
+ vmovups ymmword ptr [rcx+0x60], ymm0
+ vmovups xmm0, xmmword ptr [reloc @RWD00]
+ vmovups xmmword ptr [rcx+0x80], xmm0We are emitting redundant |
Yeah, it's a bit unfortunate issue down stream. I've pushed a workaround to fix it. the issue is - my phase takes and it's too late to do CSE for those const vectors. the broadcast instruction appears as an optimization for As a workaround, I've disabled the expansion for bytes as our Lower already knows how to do it for bytes (byte memset expansion), but I'll enable it back once I figure out how to perform CSE in LowerStoreIndirCoalescing (in a separate PR). |
Currently, JIT always gives up on unrolling Memset(Fill) for anything other than byte or zero. This PR enables it for all non-zero primitives.
Current codegen:
New codegen: