-
Notifications
You must be signed in to change notification settings - Fork 43
Inefficient x64 codegen for float->int truncation #173
Comments
I've tried to use the old trick of forcing the floating-point value's exponent to be able to extract the integer value directly - in my case the floating point value is non-negative so I can implement it like this: #if 1
zf = wasm_f32x4_add(zf, wasm_f32x4_splat(1 << 23));
zf = wasm_v128_and(zf, wasm_i32x4_splat(0x007FFFFF));
#else
zf = wasm_i32x4_trunc_saturate_f32x4(zf);
#endif And this reduced the cost of my function from 4.8 ms to 3.9 ms, so the cost in cycles appears to be a bit larger than the cost in instructions. I am assuming that one |
As per the sync meeting, the ideal solution here seems to be some sort of "fast" semantics under which the results in edge conditions (out of bounds values and NaN) can be loosened and potentially differ per architecture - but this is outside of the scope of the SIMD proposal right now. Thus I am looking into potential options, if any, of alternative specifications. As far as I can tell, IEEE doesn't exactly define the behavior for these edge conditions - it suggests that "the invalid operation exception shall be signaled if this situation cannot otherwise be indicated" but does not seem to mandate any specific behavior when exceptions are not available / silenced. Among the interesting architectures, ARM, PowerPC and MIPS all have a single instruction that behaves as the currently specified WASM SIMD instruction does (NaNs convert to 0; out of range values convert to INT_MAX or INT_MIN depending on the sign), and none of them have alternative instructions that seem relevant here. x64 is the odd one out, where NaNs and out of range values all convert to INT_MIN. So, given that we want the exact same behavior for all architectures, I see a few options.
; replaces values >INT_MAX (or NaN) with INT_MAX
vmaxps dst, src, [rip + OFFSET_THAT_CONTAINS_INT_MAX_AS_FLOAT]
; replaces NaN with 0
vcmpeqps temp, src, src
vpand dst, dst, temp
; numbers <INT_MIN will automatically get converted to INT_MIN
vcvttps2dq dst, dst This is 4 instructions instead of 7, the critical path is shorter as well (3 dependent instructions, probably 4 uops since one instruction has a memory access). It does use a memory access, without it we'd need to spend a couple extra instructions synthesizing INT_MAX.
; replaces values >INT_MAX (or NaN) with INT_MAX
vmaxps dst, src, [rip + OFFSET_THAT_CONTAINS_INT_MAX_AS_FLOAT]
; numbers <INT_MIN will automatically get converted to INT_MIN
vcvttps2dq dst, dst For this to work on other architectures, they need to convert NaNs to INT_MAX which requires 3 extra instructions (compare, load INT_MAX, select), bringing the total to 4 for ARM/PowerPC/MIPS. This seems strictly worse than option 2. It seems to me that basically any fixup on other architectures would require 3 extra instructions so I suspect that the best route right now is to optimize v8 codegen with the path I sketched above assuming I didn't miss anything important, and the best route in the future is to specify a "fast" instruction that gives indeterminate results for OOB/NaNs, or really just NaNs since saturation is easy to handle as per above. Comments welcome. |
Could we perhaps use an immediate value to control this behavior? I can see 4 cases as follow:
We need 2 bit flags: to determine if we should saturate and/or replace NaN with 0.
Using I agree that due to the abstraction layer and the common need to handle the edge cases, having instructions to handle this is the best way to go but perhaps control flags would be more flexible and clearer. |
For the current SIMD proposal the goals are to only standardize instructions where the behavior is exactly the same between different architectures, with a possible exception of denormal handling which can already be inconsistent between different WASM implementations. Controlling the exact behavior with flags is an interesting idea but when the flags are set to "don't care" we still get into the realm of different behavior on different ISAs which, as I understand it, is an interesting future direction for "fast" semantics but out of scope for the current proposal. |
Correction to the above: on architectures with a maximum instruction that doesn’t propagate NaN - such as ARM’s fmaxnm - option 3 would result in 3-instruction sequence (same as x64, but with a dedicated load). This is a touch more balanced than the current instruction, although the semantics are slightly odd. |
Thanks @zeux for detailing this so clearly. I'd like to also agree that option 2 seems to be the best way to tackle this. Specifically for V8, being able to use RIP relative addressing for constants in memory is on our roadmap as it makes code generation better for other operations as well. Would be open to prototyping option 3, if there is interest in getting some numbers. |
Closing this issue as there are no spec changes that are expected to come out of this issue. This is the tracking issue for V8 changes that better codegen will depend on. |
Unfortunately, x64 codegen - at least as employed by v8 - for i32x4.trunc_sat_f32x4_s is really elaborate:
https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/compiler/backend/ia32/code-generator-ia32.cc#L2083-L2100
This is 7 instructions for what could be 1 instruction in x64 if NaN handling or overflow behavior didn't have to match the specified one.
Is there any way this can be improved? I don't have a specific suggestion, but this costs ~10% of instructions (not sure how to measure cycle impact accurately) on one of my functions.
The text was updated successfully, but these errors were encountered: