-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New MIR opt pass simplify_pow_of_two
#114254
Conversation
(rustbot has picked a reviewer for you, use r? to override) |
Some changes occurred to the CTFE / Miri engine cc @rust-lang/miri Some changes occurred to MIR optimizations cc @rust-lang/wg-mir-opt |
0dd63be
to
100171c
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment has been minimized.
This comment has been minimized.
This isn't the first time I feel like we need some sort of "if this condition is always true after optimizations always pick this branch, otherwise discard it completly" |
939f2bf
to
6a490b0
Compare
This comment has been minimized.
This comment has been minimized.
Is the goal of this MIR transform to improve final codegen? If it is, we should have something under |
There's no reason it can't be, but there are many ways to implement I don't know much about LLVM though. Maybe it already optimizes its own |
r? @Compiler |
This comment has been minimized.
This comment has been minimized.
6db3f53
to
bfa1165
Compare
This comment has been minimized.
This comment has been minimized.
&& let Some(def_id) = func.const_fn_def().map(|def| def.0) | ||
&& let def_path = tcx.def_path(def_id) | ||
&& tcx.crate_name(def_path.krate) == sym::core | ||
// FIXME(Centri3): I feel like we should do this differently... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps check that it's an associated method of an integer primitive type?
Using names is usually the wrong way to identify things.
(Arguably maybe it should be a lang item if it's recognized semantically, but that would probably need one lang item per type, which would be a mess...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lang items would make sense, though yeah, there'd be tons of them... I tried with diagnostic items before and it was a bit of pain due to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InstSimplify looks at the declaring type for one of its things, so maybe you can use something like that?
rust/compiler/rustc_mir_transform/src/instsimplify.rs
Lines 203 to 219 in 89acdae
// Only bother looking more if it's easy to know what we're calling | |
let Some((fn_def_id, fn_args)) = func.const_fn_def() else { return }; | |
// Clone needs one subst, so we can cheaply rule out other stuff | |
if fn_args.len() != 1 { | |
return; | |
} | |
// These types are easily available from locals, so check that before | |
// doing DefId lookups to figure out what we're actually calling. | |
let arg_ty = args[0].ty(self.local_decls, self.tcx); | |
let ty::Ref(_region, inner_ty, Mutability::Not) = *arg_ty.kind() else { return }; | |
if !inner_ty.is_trivially_pure_clone_copy() { | |
return; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd still need to make sure the method name is pow
, though that should be a lot better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most graceful way to do this is to add a diagnostic item: https://rustc-dev-guide.rust-lang.org/diagnostics/diagnostic-items.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does
lang
require it to not be const?
No, but we could have a single generic lang item that uses trait bounds for Add
, Mul
and such. Then all the integer impls could call the generic lang item, and we could make our optimization work on the generic lang item. Though that would still have to happen post-inlining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM has an llvm.is.constant intrinsic, which basically does what you want is_known_to_be_const to do. If you export that intrinsic and add a code path for handling power-of-two guarded by it to the pow implementation, that should get optimized fine on the LLVM level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would still get the power used and all at runtime though, afaict, as you can't use a non-constant in a const
block (which would be self
here, which as far as rust's concerned isn't a constant despite it being so). I believe this requires at least a pow
call anyway so if it does it at runtime it'll be slower regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the trick is that if you add a
if llvm_is_constant(self) {
if self.is_power_of_two() {
...
}
}
then, since LLVM knows self
is a constant, it will be able to optimize self.is_power_of_two()
into true
or false
, and then either optimize out your logic or the rest of pow
depending on the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news: I did that and the generated code is even better with it, and it 100% works! Will open a separate PR for that and close this one (though not rn as the intrinsic will cause rustc to segfault if called on non-i32s, will need some more work 🙃)
bfa1165
to
9b29907
Compare
This comment has been minimized.
This comment has been minimized.
The job Click to see the possible cause of the failure (guessed by this bot)
|
Not entirely sure what's wrong there. With stage1 it works fine but the output seems to differ on stage2, maybe it needs to be more generic in regards to local names? (I see other codegen tests do this) Same issue in the mir tests too, but that can be worked around by using only after rather than diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Very summary review. I'll get back to this PR in a few days.
@@ -1872,6 +1872,10 @@ impl<'tcx> Region<'tcx> { | |||
|
|||
/// Constructors for `Ty` | |||
impl<'tcx> Ty<'tcx> { | |||
pub fn new_bool(tcx: TyCtxt<'tcx>) -> Ty<'tcx> { | |||
Ty::new(tcx, TyKind::Bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pre-interned as tcx.types.bool
@@ -546,6 +547,7 @@ fn run_optimization_passes<'tcx>(tcx: TyCtxt<'tcx>, body: &mut Body<'tcx>) { | |||
&lower_slice_len::LowerSliceLenCalls, // has to be done before inlining, otherwise actual call will be almost always inlined. Also simple, so can just do first | |||
&unreachable_prop::UnreachablePropagation, | |||
&uninhabited_enum_branching::UninhabitedEnumBranching, | |||
&simplify_pow_of_two::SimplifyPowOfTwo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably go after const prop, to increase the likelihood to have constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's after inlining, with the current way it works there's a chance the inliner will inline the call, and thus we won't detect it.
// already entirely optimized away | ||
&& power_used != 0.0 | ||
// `-inf` would be `0.pow()` | ||
&& power_used.is_finite() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you split this huge chain into 'let ... else { continue}' ?
.. | ||
} = &term.kind | ||
&& let Some(def_id) = func.const_fn_def().map(|def| def.0) | ||
&& let def_path = tcx.def_path(def_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the Def path ? The crate is already 'def_id.krate'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Artifact from how it worked before
} | ||
&& let power_used = f32::log2(recv_val as f32) | ||
// Precision loss means it's not a power of two | ||
&& power_used == (power_used as u32) as f32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use floats instead of ilog2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's unfortunately no distinction between whether it's power of two there. i32::ilog2(4)
and i32::log2(5)
both return 2.
Oh nvm, this can almost certainly be followed by a call to 2.pow
and see if that matches the original value
recv_ty, | ||
) = recv_const.literal | ||
&& recv_ty.is_integral() | ||
&& tcx.item_name(def_id) == sym::pow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This detection mecanism is brittle. The compiler should be as independent as possible from the exact paths in core. Can diagnostic items be used for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with it, but if so we should use lang items instead as pointed out by #114254 (comment). But this feels like it'd be very verbose and slow, as we'd need 12 lang items for them all and then check whether it's any (rather than just if the item name is pow
).
This is already performed by LLVM, both with debug assertions and without (even at -Copt-level=1): https://rust.godbolt.org/z/ssGz4Gz13 I don't believe the complexity is worth it in a MIR opt, unless we can show it significantly improves debug build performance. Considering |
That's only if it's a constant: https://rust.godbolt.org/z/Pcfd8YPz4, if it's not it won't be at all optimized away. So this would also majorly improve performance on release mode if a The more realistic case of https://rust.godbolt.org/z/Eezfaoa7W is still quite a lot longer. |
In case it is not a constant, your optimization also does not apply, right? |
It does apply if |
🤦 ok, I completely misread this opt. Let me wake up and try again 😆 sorry |
LLVM only has float For float math the peephole opts LLVM does can be found in https://llvm.org/doxygen/SimplifyLibCalls_8cpp_source.html (search for https://stackoverflow.com/questions/2398442/why-isnt-int-powint-base-int-exponent-in-the-standard-c-libraries is very enlightening on why integer power ops don't exist in C/C++, so I'd assume LLVM doesn't really have any optimizations for it. |
256u32.pow(a) | ||
} | ||
|
||
// EMIT_MIR simplify_pow_of_two_no_overflow_checks.slow_256_i.SimplifyPowOfTwo.after.mir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// EMIT_MIR simplify_pow_of_two_no_overflow_checks.slow_256_i.SimplifyPowOfTwo.after.mir | |
// EMIT_MIR simplify_pow_of_two_no_overflow_checks.slow_256_i.SimplifyPowOfTwo.diff |
makes it easier for us to see what the optimization does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: ah, but it doesn't really optimize away anything, just replace a function... oh well 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wanted to use diff as well, just for clarity's sake though it seems from stage1 to stage2 it changes an unwind unreachable
to unwind continue
. It's the reason why CI was failing before. Any idea why that happens/how it can be prevented? Because I'd definitely prefer diff, even just for the highlighting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably not a stage1/stage2 difference, but a problem with the fact that we are running mir-opt tests for panic=unwind and panic=abort and are expecting the results to match. You can use // EMIT_MIR_FOR_EACH_PANIC_STRATEGY
to generate separate output files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, that makes sense, thanks!
This detects calls to
x.pow
wherex
is a power of two and an integer. This can use shl instead, with some nuances. Unfortunately modifyingpow
instead results in this check being done at runtime, and also getting the power used, thus is even slower than without it.Supersedes the clippy lint rust-lang/rust-clippy#11057
I haven't benchmarked this yet but from my testing, without debug assertions this has 0 branches, thus is branchless, and only has a couple instructions, so I'm 99% confident this is faster. With debug assertions, I'm not sure,
custom_mir
doesn't really like assertions 😅For calls where
x
is not 2, the rhs for shl can beexp * <power used to get x>
.Hope I did everything right ^^