-
Notifications
You must be signed in to change notification settings - Fork 3
Overview
Today's performance-oriented processors have fully pipelined and vectorized FMA units. However, there are still processors with non-pipelined division and sqrt units. This trend may not change soon because it could be more effective to spend transistor budgets for FMA units rather than division and sqrt.
This plugin implements transforms for removing division and sqrt, and replacing them with other instructions to make programs run faster. Mainly, the transforms can be applied to a combination of a comparison and either division or sqrt. The transforms are applied only if the fast-math options are specified.
The following transforms A and B try to remove division by reduction of fractions.
A. a/b + c/d -> (ad + bc) / (bd)
B. x/y + z > a/b + c -> (xb - ay) / (yb) + z > c
The following transform C removes division from a combination of division and comparison.
C. a/b + c > d -> b < 0 ? (a - b(d - c) < 0) : (-(a - b(d - c)) < 0)
The following transform D removes sqrt from a combination of sqrt and comparison.
D. w sqrt(x) + y > z -> w >= 0 ? ((z < y) | (wwx > (z-y)(z-y))) : ((z <= y) & (wwx < (z-y)(z-y)))
Ternary operators in transforms C and D are actually realized by copying sign bit of FP numbers.
See example transforms to see how the actual transformations are made.
In order to test the correctness of transforms, I tested the plugins by randomly generating C source codes like the following, and comparing the return values between non-transformed binaries and transformed binaries.
bool f4c(double a0, double a1, double2 a, double2 a) {
return (-asqrt(((-0.754564) / (a0 - ((-(asqrt((a2 - a0)) + ((-0.443820) + (-a1)))) + a0))))) <= (-(-(0.128679)));
}
asqrt(x) is sqrt(fabs(x)).
I evaluated how the transforms affect the performance of the code, on processors with Zen 2, Coffeelake and Goldmont+ microarchitectures. The results show improvements in both latency and throughput in most cases on Zen 2 and Goldmont+ processors. See the results for the details.