Sve2 Scatters need a temp register for indices#124865
Sve2 Scatters need a temp register for indices#124865a74nh wants to merge 1 commit intodotnet:mainfrom
Conversation
Fixes dotnet#124750 There are three forms of scatter instructions supported by CoreCLR * Vector of addresses * A single address plus a vector of indices (vector length offsets) * A single address plus a vector of byte offsets There are encodings for all of these in SVE1. SVE2 duplicates all the scatter instructions, providing non temporal versions of them. The encodings all match SVE1, except for the indices version, which is missing. This can be replicated by simply shifting the offsets before calling the instruction (and is exactly what happens in the equivalent C++ instrinsics). Therefore, ensure there is a temp register to hold the shifted value.
| // Build any immediates | ||
| BuildHWIntrinsicImmediate(intrinsicTree, intrin); | ||
|
|
||
| // Build any additional special cases |
There was a problem hiding this comment.
I really don't like special casing here (there are no other special cases in the function). Ideally I'd add a hwintrinsic flag, but we're running out of space for them.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
| break; | ||
| } | ||
|
|
||
| case NI_Sve2_Scatter16BitWithByteOffsetsNarrowingNonTemporal: |
There was a problem hiding this comment.
Split these out to make the code easier to read
|
@dotnet/arm64-contrib @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR updates the ARM64 JIT’s SVE2 non-temporal scatter codegen/LSRA to account for the missing “base + indices” encoding in SVE2 by materializing byte offsets via a shifted temporary register.
Changes:
- Adds LSRA handling to reserve an internal FP/SIMD temp for certain SVE2 non-temporal scatters.
- Updates SVE2 scatter non-temporal codegen to shift indices into a temp register before emitting the store.
- Splits SVE2 “with byte offsets” scatter intrinsics into a separate codegen case that does not require index conversion.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/coreclr/jit/lsraarm64.cpp | Reserves an internal float/SIMD temp register for selected SVE2 non-temporal scatter intrinsics. |
| src/coreclr/jit/hwintrinsiccodegenarm64.cpp | Converts indices to byte offsets via lsl into a temp register for SVE2 non-temporal scatters; separates byte-offset variants. |
| // SVE2 instruction only directly support byte offsets. Convert indices to bytes. | ||
| regNumber tempReg = internalRegisters.GetSingle(node, RBM_ALLFLOAT); | ||
| if (intrin.id == NI_Sve2_Scatter16BitNarrowingNonTemporal) | ||
| { | ||
| GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 1, opt); | ||
| GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, tempReg, op3Reg, 1, opt); |
There was a problem hiding this comment.
tempReg is always used as the offsets register for the indices-form SVE2 scatters, but it is only initialized for the 16-bit, 32-bit, and 64-bit (ScatterNonTemporal) cases. For NI_Sve2_Scatter8BitNarrowingNonTemporal, no shift/move is emitted, so tempReg contains an unrelated value when passed to the scatter instruction (and LSRA also doesn’t reserve an internal temp for this intrinsic). Handle the 8-bit case by using op3Reg directly (no temp needed) or by explicitly copying/initializing tempReg before use and ensuring LSRA reserves it.
| case NI_Sve2_Scatter16BitNarrowingNonTemporal: | ||
| case NI_Sve2_Scatter32BitNarrowingNonTemporal: | ||
| case NI_Sve2_ScatterNonTemporal: | ||
| buildInternalFloatRegisterDefForNode(intrinsicTree, internalFloatRegCandidates()); |
There was a problem hiding this comment.
LSRA reserves an internal float register for these SVE2 scatter intrinsics unconditionally, but codegen only needs the temp for the indices-form overload (!varTypeIsSIMD(intrin.op2->gtType)). Reserving an unused internal register for the vector-of-addresses form increases register pressure and can cause unnecessary spills; consider gating buildInternalFloatRegisterDefForNode on the operand type (and only for the IDs that actually need shifting).
| buildInternalFloatRegisterDefForNode(intrinsicTree, internalFloatRegCandidates()); | |
| // For these SVE2 scatter intrinsics, an internal float register is only | |
| // required for the indices-form overload (where the second operand is | |
| // not a SIMD type). Avoid reserving it for the vector-of-addresses | |
| // form to reduce register pressure. | |
| if (!varTypeIsSIMD(intrin.op2->gtType)) | |
| { | |
| buildInternalFloatRegisterDefForNode(intrinsicTree, internalFloatRegCandidates()); | |
| } |
| assert(intrin.numOperands == 4); | ||
|
|
||
| // Calculate the byte offsets if using indices. | ||
| // SVE2 instruction only directly support byte offsets. Convert indices to bytes. |
There was a problem hiding this comment.
Grammar: "instruction only directly support" should be "instruction only directly supports".
| // SVE2 instruction only directly support byte offsets. Convert indices to bytes. | |
| // SVE2 instruction only directly supports byte offsets. Convert indices to bytes. |
Fixes #124750
There are three forms of scatter instructions supported by CoreCLR
There are encodings for all of these in SVE1.
SVE2 duplicates all the scatter instructions, providing non temporal versions of them. The encodings all match SVE1, except for the indices version, which is missing. This can be replicated by simply shifting the offsets before calling the instruction (and is exactly what happens in the equivalent C++ instrinsics).
Therefore, ensure there is a temp register to hold the shifted value.