JIT: Shrink data section for const vector loads#114040
JIT: Shrink data section for const vector loads#114040BruceForstall merged 14 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
@tannergooding I've updated this based on the approach we discussed on Discord, including using broadcast loads where possible.
Latest diffs look good. It's overall a small code size regression, but generally we're trading +1 byte of code for -8 or more bytes of data.
| case NI_AVX2_BroadcastVector128ToVector256: | ||
| case NI_AVX512F_BroadcastVector128ToVector512: | ||
| case NI_AVX512F_BroadcastVector256ToVector512: | ||
| if (GetAuxiliaryJitType() == CORINFO_TYPE_PTR) |
There was a problem hiding this comment.
Most of these deletions are reverting questionable changes from #92017, which attempted to solve the same problem, but only for uncontained CNS_VEC directly consumed by STOREIND. Those cases are handled better by this PR.
| case TYP_DOUBLE: | ||
| { | ||
| result->f64[arg1] = static_cast<float>(arg2); | ||
| result->f64[arg1] = arg2; |
There was a problem hiding this comment.
This, and the change in valuenum.cpp are unrelated things I noticed while reading how the SIMD constant folding works.
Yep, that's pretty typical. I'd actually like if we had a way for SPMI to track the diff of the method local data section as well to help account for such changes. Possibly @EgorBo or @jakobbotsch have ideas on if that's feasible. |
tannergooding
left a comment
There was a problem hiding this comment.
CC. @dotnet/jit-contrib, @EgorBo, @jakobbotsch for secondary review
|
Ping @dotnet/jit-contrib, @EgorBo, @jakobbotsch for secondary review |
|
/azp run runtime-coreclr outerloop |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
Noticed this while looking at the codegen for floating->integral casts where we use
vfixupimms[sd], which is a SIMD scalar instruction that takes a control table as a vector. The table value for that instruction is almost always const, and it's the operand that supports memory load, so we end up emitting a full SIMD16 vector to the data section for what ends up being a DWORD or QWORD reloc. This change identifies vector constants contained by instructions that read only a scalar and shrinks the emitted data section value down to the scalar size.We were also emitting full vectors to the data section for uncontained CNS_VEC to be loaded to register for use. #92017 had previously addressed compressing large vectors down to SIMD16 broadcast when they were immediately consumed by STOREIND. This PR does compression down as far as 4-byte scalar, and does it for all uncontained CNS_VEC loads.
Diffs