-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Fix SQRT, RCP, RSQRT PerfScores #50813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| result.insThroughput = PERFSCORE_THROUGHPUT_3C; | ||
| result.insLatency += PERFSCORE_LATENCY_12C; | ||
| result.insThroughput = PERFSCORE_THROUGHPUT_1C; | ||
| result.insLatency += PERFSCORE_LATENCY_4C; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On https://uops.info/table.html, I see the following for Skylake:
sqrtsd 64, TP 1.00 / 4.50, LAT [≤13.0;≤19]
sqrtpd 128, TP 1.00 / 4.50, LAT [≤13.0;≤19]
sqrtpd 256, TP 1.00 / 9.00, LAT [≤13.0;≤20]
sqrtss 32, TP 1.00 / 3.00, LAT [≤12.0;≤13]
sqrtps 128, TP 1.00 / 3.00, LAT [≤12.0;≤13]
sqrtps 256, TP 1.00 / 6.00, LAT [≤12.0;≤13]
rsqrtss 32, TP 1.00 / 1.00, LAT [≤4.0;≤5]
rsqrtps 128, TP 1.00 / 1.00, LAT [≤4.0;≤5]
rsqrtps 256, TP 1.00 / 1.00, LAT [≤4.0;≤5]
rcpss 32, TP 1.00 / 1.00, LAT 4
rcpps 128, TP 1.00 / 1.00, LAT 4
rcpps 256, TP 1.00 / 1.00, LAT 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, it looks like the 256-bit variants can be twice as expensive on throughput (as least for non-reciprocal cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw that also when I checked the values, thought of even coding the argument size check, but then decided against it based on the fact that it wouldn't have any practical effect since the PerfScore is dominated by the larger latency value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, just thought I'd call it out 😄
It might be worth an up-for-grabs issue to get these correctly tracked. Given that uops.info has everything available in an xml file (https://uops.info/xml.html), it would likely be possible to have a small generator that fills in a metadata header file (much like is used for hwintrinsiclistxarch.h) so this can be automated and is trivial to update to newer or older micro-architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with PerfScore code, but such an approach might be quite complicated - the uops.info measurements are made for each variant of encoding/reg-vs-mem/op-size and latencies are for each combination of input and output parameters. Currently PerfScore values are more of an art than science (e.g. latency for DIV is just set to the arithmetic mean of min/max, while in practice it might actually be on the lower end for 128:64 div as the upper 64bits are basically always zero) or the latency for MUL rdx:rax,reg is set to 4 (high half of output used) as it's the common use case, not 3 (only low half used) which is only rarely emitted.
|
@briansull PTAL |
briansull
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good,
Thanks for your contribution
These were incorrectly grouped and thus
sqrtPerfScores were way too low.