-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Change loop cloning conditions codegen #59886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change loop cloning conditions codegen #59886
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsAdd an option, statically determined at compile time, This leads to code size reduction in the condition code, as well
|
|
This is a revival of #55863 |
|
The asm diffs are very good: aspnet.run.windows.x64.checked.mch:Detail diffsbenchmarks.run.windows.x64.checked.mch:Detail diffscoreclr_tests.pmi.windows.x64.checked.mch:Detail diffslibraries.crossgen2.windows.x64.checked.mch:Detail diffslibraries.pmi.windows.x64.checked.mch:Detail diffslibraries_tests.pmi.windows.x64.checked.mch:Detail diffs |
|
The microbenchmarks numbers are more mixed: Improvements (from largest to smallest, based on ratio)
Regressions (from largest to smallest, based on ratio)
This is from a run of: with a lot of manual data report post-processing. |
|
From my understanding it might regress harder on arm64 where branch prediction is not that great |
Do you have a pointer to data or spec on this? In this case, simple static "predict forward conditional branch not taken" is sufficient. |
|
A rerun of the "regressions" shows NewtR with no regression (it has no diffs), Trap with a 5% improvement (!), BenchNumericSortJagged, and BenchEmFloatClass also with improvements. |
Very good. |
There are many cases where that is the case. For "Trap", however, it once showed a 10% regression, then showed a 5% improvement, with the same JIT code, and in neither case are there any asm diffs in the benchmark! It makes it hard to trust the numbers on any of the benchmarks. |
Add an option, statically determined at compile time, to generate loop cloning loop choice conditions into one block per condition instead of using bitwise `and` operators that execute all non-dependent conditions in a single block. Set the condition to use the one-block-per-condition mode. This leads to code size reduction in the generated condition code, generally performance improvements in the microbenchmarks, as well as making it easier to understand.
|
Rerunning XposMatrix shows a consistent 7% improvement... but no asm diffs. |
|
I can only consistently reproduce regressions in 2 benchmarks:
In general, I could imagine regressions due to several possibilities: |
91fa5eb to
f1091c8
Compare
|
@AndyAyersMS @dotnet/jit-contrib PTAL |
|
Sample diffs: Baseline: mov edx, eax
not edx
shr edx, 31
mov r8d, r11d
not r8d
shr r8d, 31
and edx, r8d
cmp esi, r11d
setge r8b
movzx r8, r8b
test edx, r8d
je SHORT G_M64169_IG09Diff: test eax, eax
jl SHORT G_M64169_IG09
test r11d, r11d
jl SHORT G_M64169_IG09
cmp esi, r11d
jl SHORT G_M64169_IG09 |
AndyAyersMS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
What's below are some random ideas for follow-up.
How complicated do these things get? I wonder, since we generally expect that all these predicates evaluate true, for cases where we have > 3 conditions if building a somewhat-balanced tree of ANDs might have worked out better and have shorter critical path overall, eg
AND(AND(p1,p2),AND(p3,p4))
instead of
AND(AND(AND(AND(p1, p2), p3), p4)
But it still might lose out to the branch form, assuming the branches are highly predictable (which they should be).
Also if there can be indirs in these trees I wonder if "scheduling" the order might help any.
|
The bitwise-and code could be improved in another way (at least on 64-bit / x64). Here's a case: We could get rid of all the |
Add an option, statically determined at compile time,
to generate loop cloning loop choice conditions into one block
per condition instead of using bitwise and operators to always
execute all non-dependent conditions in a single block.
This leads to code size reduction in the condition code, as well
as making it easier to understand.