Skip to content

Conversation

@kunalspathak
Copy link
Contributor

@kunalspathak kunalspathak commented Apr 15, 2022

Do not restrict SIMD registers only for memory that are 16B aligned.
Experiment to see how many cases we miss out with today's restriction.

Motivation: https://godbolt.org/z/eb53xPvYT
Related discussion: #67326 (comment)

@ghost ghost assigned kunalspathak Apr 15, 2022
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 15, 2022
@ghost
Copy link

ghost commented Apr 15, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Do not restrict SIMD registers only for memory that are 16B aligned.
Experiment to see how many cases we miss out with today's restriction.

Motivation: https://godbolt.org/z/eb53xPvYT

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to adjust LSRA. Otherwise, it won't allocate SIMD register(s) when src/dst doesn't use sp/fp as base register.

Copy link
Contributor Author

@kunalspathak kunalspathak Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EgorChesakov . I believe that might be the reason SPC compilation is failing although, from what I understand, this was just the heuristics and we need not have to adjust LSRA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to adjust the following logic for InitBlock

if (isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES))
{
// For larger block sizes CodeGen can choose to use 16-byte SIMD instructions.
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates());
}

and for CopyBlock
bool canUse16ByteWideInstrs = isSrcAddrLocal && isDstAddrLocal && (size >= 2 * FP_REGSIZE_BYTES);
// Note that the SIMD registers allocation is speculative - LSRA doesn't know at this point
// whether CodeGen will use SIMD registers (i.e. if such instruction sequence will be more optimal).
// Therefore, it must allocate an additional integer register anyway.
if (canUse16ByteWideInstrs)
{
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates());
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates());
}

@kunalspathak
Copy link
Contributor Author

The (code size) diffs look very promising. I am inclined to take this PR and monitor how MicroBenchmark performs. If there are just handful of cases that are negatively impacted, we can reconsider the heuristics.

benchmarks.run.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 11793820 (overridden on cmd)
Total bytes of diff: 11791688 (overridden on cmd)
Total bytes of delta: -2132 (-0.02 % of base)
diff is an improvement.
relative diff is an improvement.

coreclr_tests.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 121897680 (overridden on cmd)
Total bytes of diff: 121843412 (overridden on cmd)
Total bytes of delta: -54268 (-0.04 % of base)
diff is an improvement.
relative diff is an improvement.

libraries.crossgen2.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 48036464 (overridden on cmd)
Total bytes of diff: 48023944 (overridden on cmd)
Total bytes of delta: -12520 (-0.03 % of base)
diff is an improvement.
relative diff is an improvement.

libraries.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 60496980 (overridden on cmd)
Total bytes of diff: 60486588 (overridden on cmd)
Total bytes of delta: -10392 (-0.02 % of base)
diff is an improvement.
relative diff is an improvement.

libraries_tests.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 136917816 (overridden on cmd)
Total bytes of diff: 136898512 (overridden on cmd)
Total bytes of delta: -19304 (-0.01 % of base)
diff is an improvement.
relative diff is an improvement.

Full report: https://dev.azure.com/dnceng/public/_build/results?buildId=1724373&view=ms.vss-build-web.run-extensions-tab

@kunalspathak kunalspathak changed the title WIP Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers Apr 19, 2022
@kunalspathak kunalspathak marked this pull request as ready for review April 19, 2022 05:48
@kunalspathak
Copy link
Contributor Author

@dotnet/jit-contrib

@kunalspathak
Copy link
Contributor Author

ping.

@AndyAyersMS
Copy link
Member

Improvements: dotnet/perf-autofiling-issues#4992

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants