Skip to content

Conversation

@am11
Copy link
Member

@am11 am11 commented Dec 6, 2024

Right now, clang does not see any difference between JIT_LMul and regular multiplication on arm32 or x86: https://godbolt.org/z/vj96KfnY4. This patch fixes the intended optimization when high 32 bits are zero.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 6, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 6, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.


if ((val1High == 0) && (val2High == 0))
return Mul32x32To64(val1, val2);
return (UINT64)((UINT32)val1 * (UINT32)val2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have the same semantics? 32 bit multiplication will presumably truncate the upper bits of the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's addressing the issue where Clang doesn't distinguish between the original and regular multiplication on 32-bit platforms due to a redundant cast (uint64->uint32->uint64 on each op separately). Now it's uint32 on each op, and uint64 cast (which is not strictly needed, just being explicit) on the overall result.

Copy link
Member

@jakobbotsch jakobbotsch Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not compute the same value as it did before. Casting a 32-bit result to 64-bits will leave the upper 32 bits zero always, but the result of 32x32 bit multiplication can be >= 2^32.

From Godbolt I do not see why the new version would be more efficient than the old version. They both do umull + two "multiply-add" operations. If clang did the branch it would be able to do just a single umull in the "upper bits zero" case, but seems like Clang has determined that the branchy version is not better than the version that does the superfluous multiply-add operations.

On x86 it's more important as "32x32 -> 64" multiplication can be done with a single instruction, while the "64x64 -> 64" version falls back to a helper call, at least for MSVC.

Copy link
Member Author

@am11 am11 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code only runs on linux for arm and x86. Do you mean we should update this function to Regular_LMul from godbolt, since there is no difference? This optimization is same as done in handwritten assembly for win-x86

PUBLIC JIT_LMul

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing the special case is ok given that it has no impact on Clang's codegen. Alternatively, we would want to figure out how to get Clang to emit the branchy version with only a single umull. I am fine with either.

@jakobbotsch
Copy link
Member

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch jakobbotsch merged commit 64874e4 into dotnet:main Jan 8, 2025
90 checks passed
@am11 am11 deleted the patch-25 branch January 9, 2025 00:34
@github-actions github-actions bot locked and limited conversation to collaborators Feb 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants