Skip to content

Conversation

@a74nh
Copy link
Contributor

@a74nh a74nh commented Sep 13, 2024

Spotted during implementation of #107459

The single register version of AdvSimd LoadAndInsertScalar is:
Vector128<X> LoadAndInsertScalar(Vector128<X> value, byte index, X* address);

Currently, the code builds op1 as an address and op3 as a rmw node:

BuildAddrUses(intrin.op1);
BuildDelayFreeUses(intrin.op2, intrin.op1, candidates)
BuildDelayFreeUses(intrin.op3, (tgtPrefOp2 ? intrin.op2 : intrin.op1), candidates);

This instead should be:

tgtPrefUse = BuildUse(intrin.op1);
BuildDelayFreeUses(intrin.op2, intrin.op1, candidates)
BuildAddrUses(intrin.op3);

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 13, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 13, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@a74nh a74nh marked this pull request as ready for review September 13, 2024 11:09
@a74nh
Copy link
Contributor Author

a74nh commented Sep 13, 2024

Checked in a debugger the right functions are now called.

@dotnet/arm64-contrib @kunalspathak

Change-Id: I9579349d76498ed0d73bade48eba66cc23a31ebf
return 0;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak : As suggested offline, added some additional checks in the delay slot logic. If the register type of the register does not match that of the delay slot register then do not add the delay free use. This fixes where we call BuildDelayFreeUses() for op2 which is a integer value.

Instead, we could do these checks in BuildHWIntrinsic() to not call BuildDelayFreeUses(), but then BuildDelayFreeUses() would still need the same checks turned into asserts. So it seemed simpler to do inside.

This will work nicely with BuildHWIntrinsic() rewrite PR too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difficult to test with an assert because 1) all the hwintrinsic tests will fail at the first assert 2) many of these issues will be due during ConstantExpected APIs, and for a lot of those we are only testing using hardcoded constants, meaning the assert will never be hit.

Scanning the SVE API, all the methods we have that are RMW and have a ConstantExpected are:

AddRotateComplex
DotProductBySelectedScalar
ExtractVector
FusedMultiplyAddBySelectedScalar
FusedMultiplySubtractBySelectedScalar
MultiplyAddRotateComplex
MultiplyAddRotateComplexBySelectedScalar
MultiplyBySelectedScalar
SaturatingDecrementBy16BitElementCount
SaturatingDecrementBy32BitElementCount
SaturatingDecrementBy64BitElementCount
SaturatingDecrementBy8BitElementCount
SaturatingDecrementByActiveElementCount
SaturatingIncrementBy16BitElementCount
SaturatingIncrementBy32BitElementCount
SaturatingIncrementBy64BitElementCount
SaturatingIncrementBy8BitElementCount
SaturatingIncrementByActiveElementCount
ShiftRightArithmeticForDivide
TrigonometricMultiplyAddCoefficient

Then there are AdvSimd ones. Then there are possibly others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much easier to fix properly in the lsra rewrite PR, as the check can just be added into the main for loop in BuildHWIntrinsic()

        else if (delayFreeOp != nullptr && TheExtraChecksFromThisPR.....)
        {
            srcCount += BuildDelayFreeUses(operand, delayFreeOp, candidates);
        }
        else
        {
            srcCount += BuildOperandUses(operand, candidates);
        }

@kunalspathak
Copy link
Contributor

/azp run Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check the complete diff for this or similar methods that has zero-size diffs? Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

image

for (var i = 0; i < Op1ElementCount; i++) { _data1[i] = {NextValueOp1}; }
Unsafe.CopyBlockUnaligned(ref Unsafe.As<{Op1VectorType}<{Op1BaseType}>, byte>(ref _fld1), ref Unsafe.As<{Op1BaseType}, byte>(ref _data1[0]), (uint)Unsafe.SizeOf<{Op1VectorType}<{Op1BaseType}>>());

_fld2 = {ElementIndex};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of the tests that takes immediate value is missing this coverage. We should fix it some day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue: #108060

return 0;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?

// Multi register nodes should no go via this route.
assert(!node->IsMultiRegNode());
if (rmwNode == nullptr || varTypeUsesSameRegType(rmwNode->TypeGet(), node->TypeGet()) ||
(rmwNode->IsMultiRegNode() && varTypeUsesFloatReg(node->TypeGet())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not understand the && varTypeUsesFloatReg(node->TypeGet()) here? did you see code where having delay free is ok to have for multi-reg node that are GPR (if there is such a thing)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the RMW node is a multi register, then it'll always be vector registers.

The normal node could be general register or vector register. We want to discount the case where node is a general register. So, varTypeUsesFloatReg() is confirming node uses a FP, vector or sve vector register

Eg:

(Vector128<sbyte> Value1, Vector128<sbyte> Value2) LoadAndInsertScalar((Vector128<sbyte>, Vector128<sbyte>) values, [ConstantExpected(Max = (byte)(15))] byte index, sbyte* address)

The RMW node is the multi register values.
This code makes sure index is not delay slotted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering we should have an assert that if it is a multiRegNode, then node should be floatreg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some asserts.

That then broke things because for delayFreeMultiple intrinsics we do:

BuildDelayFreeUses(use.GetNode(), intrinsicTree);

Which is passing the entire intrinsicTree down as the rmw node which feels wrong. Fixed that to pass op1 instead.

@kunalspathak
Copy link
Contributor

Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

Other than this and the comment in #107786 (comment), we should be good.

@a74nh
Copy link
Contributor Author

a74nh commented Sep 23, 2024

Can you double check the complete diff for this or similar methods that has zero-size diffs? Wondering why previously we decided to have copy of x0 in both v0 and v17, but now, we don't need it?

image

#37@69 is the function return value, #35@68 is the def of the INS

In base:

 Interval 17: simd16 (interfering uses) RefPositions {#35@68 #37@69} physReg:NA Preferences=[d0] Aversions=[]

 <RefPosition #33  @67  RefTypeUse <Ivl:13> BB02 regmask=[allFloat] minReg=1 last wt=100.00>
 <RefPosition #34  @67  RefTypeUse <Ivl:16> BB02 regmask=[x0-xip0 x19-x28] minReg=1 last delay wt=100.00>
 <RefPosition #35  @68  RefTypeDef <Ivl:17> HWINTRINSIC BB02 regmask=[allFloat] minReg=1 wt=400.00>
 <RefPosition #36  @69  RefTypeFixedReg <Reg:d0 > BB02 regmask=[d0] minReg=1 wt=100.00>
 <RefPosition #37  @69  RefTypeUse <Ivl:17> BB02 regmask=[d0] minReg=1 last fixed wt=100.00>

Because of this conflict, it allocates the def of INS to d17 and then plants an extra MOV to move from the def to the function return.

in diff:
The conflict no longer exists:

 Interval 17: simd16 RefPositions {#35@68 #37@69} physReg:NA Preferences=[d0] Aversions=[] 

<RefPosition #33  @67  RefTypeUse <Ivl:13> BB02 regmask=[allFloat] minReg=1 last wt=100.00>
 <RefPosition #34  @67  RefTypeUse <Ivl:16> BB02 regmask=[x0-xip0 x19-x28] minReg=1 last wt=100.00>
 <RefPosition #35  @68  RefTypeDef <Ivl:17> HWINTRINSIC BB02 regmask=[d0] minReg=1 wt=400.00>
 <RefPosition #36  @69  RefTypeFixedReg <Reg:d0 > BB02 regmask=[d0] minReg=1 wt=100.00>
 <RefPosition #37  @69  RefTypeUse <Ivl:17> BB02 regmask=[d0] minReg=1 last fixed wt=100.00>

Which means it is able to allocate d0 for the def of INS. At codegen, it has to plant the MOV from d17 to d0 before generating the INS.

full log for base

full log for diff

@kunalspathak
Copy link
Contributor

@a74nh - seems the unit tests are failing. can you take a look?

@a74nh
Copy link
Contributor Author

a74nh commented Sep 24, 2024

@a74nh - seems the unit tests are failing. can you take a look?

I've restored the change to the call to BuildDelayFreeUses() and removed an assert. Seems the safest option for now. The rewrite PR will fix it better anyway.

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak kunalspathak merged commit d22c74d into dotnet:main Sep 26, 2024
@a74nh a74nh deleted the lsra_insertscalar_github branch September 26, 2024 13:50
sirntar pushed a commit to sirntar/runtime that referenced this pull request Sep 30, 2024
* ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar

* Fix comment

* Expand LoadAndInsertScalar testing

* Only delayfree when register types match

* fix formatting

* Fix comment

* Check that multiregister nodes use fp registers

* Revert rmw assert
@github-actions github-actions bot locked and limited conversation to collaborators Oct 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants