Skip to content

Sve2 Scatters need a temp register for indices#124865

Open
a74nh wants to merge 1 commit intodotnet:mainfrom
a74nh:ntscatter2_github
Open

Sve2 Scatters need a temp register for indices#124865
a74nh wants to merge 1 commit intodotnet:mainfrom
a74nh:ntscatter2_github

Conversation

@a74nh
Copy link
Contributor

@a74nh a74nh commented Feb 25, 2026

Fixes #124750

There are three forms of scatter instructions supported by CoreCLR

  • Vector of addresses
  • A single address plus a vector of indices (vector length offsets)
  • A single address plus a vector of byte offsets

There are encodings for all of these in SVE1.

SVE2 duplicates all the scatter instructions, providing non temporal versions of them. The encodings all match SVE1, except for the indices version, which is missing. This can be replicated by simply shifting the offsets before calling the instruction (and is exactly what happens in the equivalent C++ instrinsics).

Therefore, ensure there is a temp register to hold the shifted value.

Fixes dotnet#124750

There are three forms of scatter instructions supported by CoreCLR
* Vector of addresses
* A single address plus a vector of indices (vector length offsets)
* A single address plus a vector of byte offsets

There are encodings for all of these in SVE1.

SVE2 duplicates all the scatter instructions, providing non temporal
versions of them. The encodings all match SVE1, except for the indices
version, which is missing. This can be replicated by simply shifting
the offsets before calling the instruction (and is exactly what
happens in the equivalent C++ instrinsics).

Therefore, ensure there is a temp register to hold the shifted value.
Copilot AI review requested due to automatic review settings February 25, 2026 17:45
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 25, 2026
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 25, 2026
// Build any immediates
BuildHWIntrinsicImmediate(intrinsicTree, intrin);

// Build any additional special cases
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like special casing here (there are no other special cases in the function). Ideally I'd add a hwintrinsic flag, but we're running out of space for them.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

break;
}

case NI_Sve2_Scatter16BitWithByteOffsetsNarrowingNonTemporal:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split these out to make the code easier to read

@a74nh
Copy link
Contributor Author

a74nh commented Feb 25, 2026

@dotnet/arm64-contrib @jakobbotsch

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the ARM64 JIT’s SVE2 non-temporal scatter codegen/LSRA to account for the missing “base + indices” encoding in SVE2 by materializing byte offsets via a shifted temporary register.

Changes:

  • Adds LSRA handling to reserve an internal FP/SIMD temp for certain SVE2 non-temporal scatters.
  • Updates SVE2 scatter non-temporal codegen to shift indices into a temp register before emitting the store.
  • Splits SVE2 “with byte offsets” scatter intrinsics into a separate codegen case that does not require index conversion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/coreclr/jit/lsraarm64.cpp Reserves an internal float/SIMD temp register for selected SVE2 non-temporal scatter intrinsics.
src/coreclr/jit/hwintrinsiccodegenarm64.cpp Converts indices to byte offsets via lsl into a temp register for SVE2 non-temporal scatters; separates byte-offset variants.

Comment on lines +2442 to +2446
// SVE2 instruction only directly support byte offsets. Convert indices to bytes.
regNumber tempReg = internalRegisters.GetSingle(node, RBM_ALLFLOAT);
if (intrin.id == NI_Sve2_Scatter16BitNarrowingNonTemporal)
{
GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 1, opt);
GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, tempReg, op3Reg, 1, opt);
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempReg is always used as the offsets register for the indices-form SVE2 scatters, but it is only initialized for the 16-bit, 32-bit, and 64-bit (ScatterNonTemporal) cases. For NI_Sve2_Scatter8BitNarrowingNonTemporal, no shift/move is emitted, so tempReg contains an unrelated value when passed to the scatter instruction (and LSRA also doesn’t reserve an internal temp for this intrinsic). Handle the 8-bit case by using op3Reg directly (no temp needed) or by explicitly copying/initializing tempReg before use and ensuring LSRA reserves it.

Copilot uses AI. Check for mistakes.
case NI_Sve2_Scatter16BitNarrowingNonTemporal:
case NI_Sve2_Scatter32BitNarrowingNonTemporal:
case NI_Sve2_ScatterNonTemporal:
buildInternalFloatRegisterDefForNode(intrinsicTree, internalFloatRegCandidates());
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSRA reserves an internal float register for these SVE2 scatter intrinsics unconditionally, but codegen only needs the temp for the indices-form overload (!varTypeIsSIMD(intrin.op2->gtType)). Reserving an unused internal register for the vector-of-addresses form increases register pressure and can cause unnecessary spills; consider gating buildInternalFloatRegisterDefForNode on the operand type (and only for the IDs that actually need shifting).

Suggested change
buildInternalFloatRegisterDefForNode(intrinsicTree, internalFloatRegCandidates());
// For these SVE2 scatter intrinsics, an internal float register is only
// required for the indices-form overload (where the second operand is
// not a SIMD type). Avoid reserving it for the vector-of-addresses
// form to reduce register pressure.
if (!varTypeIsSIMD(intrin.op2->gtType))
{
buildInternalFloatRegisterDefForNode(intrinsicTree, internalFloatRegCandidates());
}

Copilot uses AI. Check for mistakes.
assert(intrin.numOperands == 4);

// Calculate the byte offsets if using indices.
// SVE2 instruction only directly support byte offsets. Convert indices to bytes.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "instruction only directly support" should be "instruction only directly supports".

Suggested change
// SVE2 instruction only directly support byte offsets. Convert indices to bytes.
// SVE2 instruction only directly supports byte offsets. Convert indices to bytes.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JIT: SVE2 non-temporal scatter store tests are failing

2 participants