Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Inefficient x64 codegen for float->int truncation #173

Closed
zeux opened this issue Jan 13, 2020 · 7 comments
Closed

Inefficient x64 codegen for float->int truncation #173

zeux opened this issue Jan 13, 2020 · 7 comments

Comments

@zeux
Copy link
Contributor

zeux commented Jan 13, 2020

Unfortunately, x64 codegen - at least as employed by v8 - for i32x4.trunc_sat_f32x4_s is really elaborate:

https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/compiler/backend/ia32/code-generator-ia32.cc#L2083-L2100

This is 7 instructions for what could be 1 instruction in x64 if NaN handling or overflow behavior didn't have to match the specified one.

Is there any way this can be improved? I don't have a specific suggestion, but this costs ~10% of instructions (not sure how to measure cycle impact accurately) on one of my functions.

@zeux
Copy link
Contributor Author

zeux commented Jan 13, 2020

I've tried to use the old trick of forcing the floating-point value's exponent to be able to extract the integer value directly - in my case the floating point value is non-negative so I can implement it like this:

	#if 1
		zf = wasm_f32x4_add(zf, wasm_f32x4_splat(1 << 23));
		zf = wasm_v128_and(zf, wasm_i32x4_splat(0x007FFFFF));
	#else
		zf = wasm_i32x4_trunc_saturate_f32x4(zf);
	#endif

And this reduced the cost of my function from 4.8 ms to 3.9 ms, so the cost in cycles appears to be a bit larger than the cost in instructions. I am assuming that one VCVTTPS2DQ instruction would be roughly the same cost but can't verify this claim.

@zeux
Copy link
Contributor Author

zeux commented Jan 15, 2020

As per the sync meeting, the ideal solution here seems to be some sort of "fast" semantics under which the results in edge conditions (out of bounds values and NaN) can be loosened and potentially differ per architecture - but this is outside of the scope of the SIMD proposal right now. Thus I am looking into potential options, if any, of alternative specifications.

As far as I can tell, IEEE doesn't exactly define the behavior for these edge conditions - it suggests that "the invalid operation exception shall be signaled if this situation cannot otherwise be indicated" but does not seem to mandate any specific behavior when exceptions are not available / silenced.

Among the interesting architectures, ARM, PowerPC and MIPS all have a single instruction that behaves as the currently specified WASM SIMD instruction does (NaNs convert to 0; out of range values convert to INT_MAX or INT_MIN depending on the sign), and none of them have alternative instructions that seem relevant here. x64 is the odd one out, where NaNs and out of range values all convert to INT_MIN.

So, given that we want the exact same behavior for all architectures, I see a few options.

  1. Do nothing. In this option we will maintain the status quo. Applications that require fast float->int conversions and are okay with rounding can use the float->int trick; there are ways to adapt it to be almost-truncating depending on the exact behavior you want around precise values. This may be an acceptable workaround - it is acceptable for my application.

  2. Find a way to lower existing instruction more efficiently on x64. The path that I see potentially viable here is to clamp to INT_MAX and separately fix up NaNs:

; replaces values >INT_MAX (or NaN) with INT_MAX
vmaxps dst, src, [rip + OFFSET_THAT_CONTAINS_INT_MAX_AS_FLOAT]
; replaces NaN with 0
vcmpeqps temp, src, src
vpand dst, dst, temp
; numbers <INT_MIN will automatically get converted to INT_MIN
vcvttps2dq dst, dst

This is 4 instructions instead of 7, the critical path is shorter as well (3 dependent instructions, probably 4 uops since one instruction has a memory access). It does use a memory access, without it we'd need to spend a couple extra instructions synthesizing INT_MAX.

  1. Change the semantics to make the x64 fallback easier at the cost of other ISAs. The fastest way to get saturating float->int on x64 seems to be to declare that NaN results in INT_MAX so that just two instructions are sufficient and we don't need to do NaN fixup:
; replaces values >INT_MAX (or NaN) with INT_MAX
vmaxps dst, src, [rip + OFFSET_THAT_CONTAINS_INT_MAX_AS_FLOAT]
; numbers <INT_MIN will automatically get converted to INT_MIN
vcvttps2dq dst, dst

For this to work on other architectures, they need to convert NaNs to INT_MAX which requires 3 extra instructions (compare, load INT_MAX, select), bringing the total to 4 for ARM/PowerPC/MIPS. This seems strictly worse than option 2.

It seems to me that basically any fixup on other architectures would require 3 extra instructions so I suspect that the best route right now is to optimize v8 codegen with the path I sketched above assuming I didn't miss anything important, and the best route in the future is to specify a "fast" instruction that gives indeterminate results for OOB/NaNs, or really just NaNs since saturation is easy to handle as per above.

Comments welcome.

@nfrechette
Copy link

Could we perhaps use an immediate value to control this behavior?

I can see 4 cases as follow:

  • values are all known to be within range and valid and no out-of-range handling is needed (fastest, just coerce, don't care about out of range/NaN)
  • values might be out of range and need saturation but are valid (need to saturate but not sanitize NaN)
  • values are all within range but perhaps invalid (no need to saturate but sanitizing NaN is required)
  • values might be out of range and perhaps invalid (safest, make no assumption about input, slowest).

We need 2 bit flags: to determine if we should saturate and/or replace NaN with 0.

i32x4.trunc_sat_f32x4_s(a: v128) -> v128 could become i32x4.convert_f32x4(a: v128, ImmFlagx2: flags) -> v128

Using trunc_sat as a prefix might paint us into a corner if we later decide to add new "fast" intrinsics that skip saturation/sanitization. It already breaks the symmetry with int->float conversion.

I agree that due to the abstraction layer and the common need to handle the edge cases, having instructions to handle this is the best way to go but perhaps control flags would be more flexible and clearer.

@zeux
Copy link
Contributor Author

zeux commented Jan 15, 2020

For the current SIMD proposal the goals are to only standardize instructions where the behavior is exactly the same between different architectures, with a possible exception of denormal handling which can already be inconsistent between different WASM implementations. Controlling the exact behavior with flags is an interesting idea but when the flags are set to "don't care" we still get into the realm of different behavior on different ISAs which, as I understand it, is an interesting future direction for "fast" semantics but out of scope for the current proposal.

@zeux
Copy link
Contributor Author

zeux commented Jan 15, 2020

Correction to the above: on architectures with a maximum instruction that doesn’t propagate NaN - such as ARM’s fmaxnm - option 3 would result in 3-instruction sequence (same as x64, but with a dedicated load). This is a touch more balanced than the current instruction, although the semantics are slightly odd.

@dtig
Copy link
Member

dtig commented Jan 16, 2020

Thanks @zeux for detailing this so clearly. I'd like to also agree that option 2 seems to be the best way to tackle this. Specifically for V8, being able to use RIP relative addressing for constants in memory is on our roadmap as it makes code generation better for other operations as well. Would be open to prototyping option 3, if there is interest in getting some numbers.

@dtig
Copy link
Member

dtig commented Dec 11, 2020

Closing this issue as there are no spec changes that are expected to come out of this issue. This is the tracking issue for V8 changes that better codegen will depend on.

@dtig dtig closed this as completed Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants