ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar #107786

a74nh · 2024-09-13T11:01:17Z

Spotted during implementation of #107459

The single register version of AdvSimd LoadAndInsertScalar is:
Vector128<X> LoadAndInsertScalar(Vector128<X> value, byte index, X* address);

Currently, the code builds op1 as an address and op3 as a rmw node:

BuildAddrUses(intrin.op1);
BuildDelayFreeUses(intrin.op2, intrin.op1, candidates)
BuildDelayFreeUses(intrin.op3, (tgtPrefOp2 ? intrin.op2 : intrin.op1), candidates);

This instead should be:

tgtPrefUse = BuildUse(intrin.op1);
BuildDelayFreeUses(intrin.op2, intrin.op1, candidates)
BuildAddrUses(intrin.op3);

dotnet-policy-service · 2024-09-13T11:01:46Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

a74nh · 2024-09-13T11:09:21Z

Checked in a debugger the right functions are now called.

@dotnet/arm64-contrib @kunalspathak

Change-Id: I9579349d76498ed0d73bade48eba66cc23a31ebf

a74nh · 2024-09-16T13:33:12Z

src/coreclr/jit/lsrabuild.cpp

            return 0;
        }
    }
+


@kunalspathak : As suggested offline, added some additional checks in the delay slot logic. If the register type of the register does not match that of the delay slot register then do not add the delay free use. This fixes where we call BuildDelayFreeUses() for op2 which is a integer value.

Instead, we could do these checks in BuildHWIntrinsic() to not call BuildDelayFreeUses(), but then BuildDelayFreeUses() would still need the same checks turned into asserts. So it seemed simpler to do inside.

This will work nicely with BuildHWIntrinsic() rewrite PR too.

Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?

Difficult to test with an assert because 1) all the hwintrinsic tests will fail at the first assert 2) many of these issues will be due during ConstantExpected APIs, and for a lot of those we are only testing using hardcoded constants, meaning the assert will never be hit.

Scanning the SVE API, all the methods we have that are RMW and have a ConstantExpected are:

AddRotateComplex DotProductBySelectedScalar ExtractVector FusedMultiplyAddBySelectedScalar FusedMultiplySubtractBySelectedScalar MultiplyAddRotateComplex MultiplyAddRotateComplexBySelectedScalar MultiplyBySelectedScalar SaturatingDecrementBy16BitElementCount SaturatingDecrementBy32BitElementCount SaturatingDecrementBy64BitElementCount SaturatingDecrementBy8BitElementCount SaturatingDecrementByActiveElementCount SaturatingIncrementBy16BitElementCount SaturatingIncrementBy32BitElementCount SaturatingIncrementBy64BitElementCount SaturatingIncrementBy8BitElementCount SaturatingIncrementByActiveElementCount ShiftRightArithmeticForDivide TrigonometricMultiplyAddCoefficient

Then there are AdvSimd ones. Then there are possibly others.

This is much easier to fix properly in the lsra rewrite PR, as the check can just be added into the main for loop in BuildHWIntrinsic()

else if (delayFreeOp != nullptr && TheExtraChecksFromThisPR.....) { srcCount += BuildDelayFreeUses(operand, delayFreeOp, candidates); } else { srcCount += BuildOperandUses(operand, candidates); }

kunalspathak · 2024-09-20T04:25:35Z

/azp run Fuzzlyn

azure-pipelines · 2024-09-20T04:25:49Z

Azure Pipelines successfully started running 1 pipeline(s).

kunalspathak

Can you double check the complete diff for this or similar methods that has zero-size diffs? Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

kunalspathak · 2024-09-20T04:15:38Z

src/tests/JIT/HardwareIntrinsics/Arm/Shared/LoadAndInsertScalarTest.template

            for (var i = 0; i < Op1ElementCount; i++) { _data1[i] = {NextValueOp1}; }
            Unsafe.CopyBlockUnaligned(ref Unsafe.As<{Op1VectorType}<{Op1BaseType}>, byte>(ref _fld1), ref Unsafe.As<{Op1BaseType}, byte>(ref _data1[0]), (uint)Unsafe.SizeOf<{Op1VectorType}<{Op1BaseType}>>());

+            _fld2 = {ElementIndex};


Lot of the tests that takes immediate value is missing this coverage. We should fix it some day.

Added an issue: #108060

kunalspathak · 2024-09-20T04:20:10Z

src/coreclr/jit/lsrabuild.cpp

            return 0;
        }
    }
+


Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?

src/coreclr/jit/lsrabuild.cpp

kunalspathak · 2024-09-20T04:25:02Z

src/coreclr/jit/lsrabuild.cpp

+    // Multi register nodes should no go via this route.
+    assert(!node->IsMultiRegNode());
+    if (rmwNode == nullptr || varTypeUsesSameRegType(rmwNode->TypeGet(), node->TypeGet()) ||
+        (rmwNode->IsMultiRegNode() && varTypeUsesFloatReg(node->TypeGet())))


did not understand the && varTypeUsesFloatReg(node->TypeGet()) here? did you see code where having delay free is ok to have for multi-reg node that are GPR (if there is such a thing)?

If the RMW node is a multi register, then it'll always be vector registers.

The normal node could be general register or vector register. We want to discount the case where node is a general register. So, varTypeUsesFloatReg() is confirming node uses a FP, vector or sve vector register

Eg:

(Vector128<sbyte> Value1, Vector128<sbyte> Value2) LoadAndInsertScalar((Vector128<sbyte>, Vector128<sbyte>) values, [ConstantExpected(Max = (byte)(15))] byte index, sbyte* address)

The RMW node is the multi register values.
This code makes sure index is not delay slotted.

I am wondering we should have an assert that if it is a multiRegNode, then node should be floatreg?

Added some asserts.

That then broke things because for delayFreeMultiple intrinsics we do:

BuildDelayFreeUses(use.GetNode(), intrinsicTree);

Which is passing the entire intrinsicTree down as the rmw node which feels wrong. Fixed that to pass op1 instead.

kunalspathak · 2024-09-20T17:37:13Z

Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

Other than this and the comment in #107786 (comment), we should be good.

a74nh · 2024-09-23T13:37:20Z

Can you double check the complete diff for this or similar methods that has zero-size diffs? Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

#37@69 is the function return value, #35@68 is the def of the INS

In base:

 Interval 17: simd16 (interfering uses) RefPositions {#35@68 #37@69} physReg:NA Preferences=[d0] Aversions=[]

 <RefPosition #33  @67  RefTypeUse <Ivl:13> BB02 regmask=[allFloat] minReg=1 last wt=100.00>
 <RefPosition #34  @67  RefTypeUse <Ivl:16> BB02 regmask=[x0-xip0 x19-x28] minReg=1 last delay wt=100.00>
 <RefPosition #35  @68  RefTypeDef <Ivl:17> HWINTRINSIC BB02 regmask=[allFloat] minReg=1 wt=400.00>
 <RefPosition #36  @69  RefTypeFixedReg <Reg:d0 > BB02 regmask=[d0] minReg=1 wt=100.00>
 <RefPosition #37  @69  RefTypeUse <Ivl:17> BB02 regmask=[d0] minReg=1 last fixed wt=100.00>

Because of this conflict, it allocates the def of INS to d17 and then plants an extra MOV to move from the def to the function return.

in diff:
The conflict no longer exists:

 Interval 17: simd16 RefPositions {#35@68 #37@69} physReg:NA Preferences=[d0] Aversions=[] 

<RefPosition #33  @67  RefTypeUse <Ivl:13> BB02 regmask=[allFloat] minReg=1 last wt=100.00>
 <RefPosition #34  @67  RefTypeUse <Ivl:16> BB02 regmask=[x0-xip0 x19-x28] minReg=1 last wt=100.00>
 <RefPosition #35  @68  RefTypeDef <Ivl:17> HWINTRINSIC BB02 regmask=[d0] minReg=1 wt=400.00>
 <RefPosition #36  @69  RefTypeFixedReg <Reg:d0 > BB02 regmask=[d0] minReg=1 wt=100.00>
 <RefPosition #37  @69  RefTypeUse <Ivl:17> BB02 regmask=[d0] minReg=1 last fixed wt=100.00>

Which means it is able to allocate d0 for the def of INS. At codegen, it has to plant the MOV from d17 to d0 before generating the INS.

full log for base

full log for diff

kunalspathak · 2024-09-23T16:53:43Z

@a74nh - seems the unit tests are failing. can you take a look?

a74nh · 2024-09-24T09:23:46Z

@a74nh - seems the unit tests are failing. can you take a look?

I've restored the change to the call to BuildDelayFreeUses() and removed an assert. Seems the safest option for now. The rewrite PR will fix it better anyway.

kunalspathak

LGTM

* ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar * Fix comment * Expand LoadAndInsertScalar testing * Only delayfree when register types match * fix formatting * Fix comment * Check that multiregister nodes use fp registers * Revert rmw assert

ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar

7ebea21

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 13, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 13, 2024

a74nh marked this pull request as ready for review September 13, 2024 11:09

a74nh mentioned this pull request Sep 13, 2024

ARM64-SVE: refactor lsra buildHWIntrinsic #107459

Merged

a74nh added 2 commits September 16, 2024 09:32

Merge main

9636970

Change-Id: I9579349d76498ed0d73bade48eba66cc23a31ebf

Fix comment

067a311

a74nh added the arch-arm64 label Sep 16, 2024

a74nh added 2 commits September 16, 2024 13:59

Expand LoadAndInsertScalar testing

f467f77

Only delayfree when register types match

2d9392d

a74nh commented Sep 16, 2024

View reviewed changes

fix formatting

6f5b8f5

build-analysis bot mentioned this pull request Sep 16, 2024

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Closed

3 tasks

kunalspathak reviewed Sep 20, 2024

View reviewed changes

Fix comment

5476323

Check that multiregister nodes use fp registers

b1b063c

Revert rmw assert

a8ee462

build-analysis bot mentioned this pull request Sep 24, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

kunalspathak approved these changes Sep 26, 2024

View reviewed changes

kunalspathak merged commit d22c74d into dotnet:main Sep 26, 2024

a74nh deleted the lsra_insertscalar_github branch September 26, 2024 13:50

github-actions bot locked and limited conversation to collaborators Oct 27, 2024

ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar #107786

ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar #107786

Uh oh!

Conversation

a74nh commented Sep 13, 2024

Uh oh!

dotnet-policy-service bot commented Sep 13, 2024

Uh oh!

a74nh commented Sep 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak commented Sep 20, 2024

Uh oh!

azure-pipelines bot commented Sep 20, 2024

Uh oh!

kunalspathak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak commented Sep 20, 2024

Uh oh!

a74nh commented Sep 23, 2024

Uh oh!

kunalspathak commented Sep 23, 2024

Uh oh!

a74nh commented Sep 24, 2024

Uh oh!

kunalspathak left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants