Skip to content

Conversation

@rootjalex
Copy link
Member

We can get much better codegen for double saturating narrows from i32 on x86 and wasm. HVX and ARM backends both already do this. Also added tests to simd_op_check

Fixes #7069

Here's an example from x86 (wasm is similar):

Var x("x");
ImageParam in(Int(32), 1);
Func f("f");

f(x) = u8_sat(in(x));
f.vectorize(x, 32);

Target x86("x86-64-linux-avx-avx2-fma-sse41");
f.compile_to_assembly("vpack.asm", f.infer_arguments(), x86);

Previously:

	vpminsd	-96(%rax,%rcx,4), %ymm0, %ymm2
	vpminsd	-64(%rax,%rcx,4), %ymm0, %ymm3
	vpminsd	-32(%rax,%rcx,4), %ymm0, %ymm4
	vpminsd	(%rax,%rcx,4), %ymm0, %ymm5
	vpmaxsd	%ymm1, %ymm2, %ymm2
	vpmaxsd	%ymm1, %ymm3, %ymm3
	vpackusdw	%ymm3, %ymm2, %ymm2
	vpmaxsd	%ymm1, %ymm4, %ymm3
	vpmaxsd	%ymm1, %ymm5, %ymm4
	vpackusdw	%ymm4, %ymm3, %ymm3
	vpermq	$216, %ymm3, %ymm3              # ymm3 = ymm3[0,2,1,3]
	vpermq	$216, %ymm2, %ymm2              # ymm2 = ymm2[0,2,1,3]
	vpackuswb	%ymm3, %ymm2, %ymm2
	vpermq	$216, %ymm2, %ymm2              # ymm2 = ymm2[0,2,1,3]
	vmovdqu	%ymm2, (%r8,%rcx)

Now:

	vmovdqu	-96(%rax,%rcx,4), %ymm0
	vmovdqu	-32(%rax,%rcx,4), %ymm1
	vpackssdw	-64(%rax,%rcx,4), %ymm0, %ymm0
	vpackssdw	(%rax,%rcx,4), %ymm1, %ymm1
	vpermq	$216, %ymm0, %ymm0              # ymm0 = ymm0[0,2,1,3]
	vpermq	$216, %ymm1, %ymm1              # ymm1 = ymm1[0,2,1,3]
	vpackuswb	%ymm1, %ymm0, %ymm0
	vpermq	$216, %ymm0, %ymm0              # ymm0 = ymm0[0,2,1,3]
	vmovdqu	%ymm0, (%r8,%rcx)

@steven-johnson
Copy link
Contributor

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootjalex
Copy link
Member Author

@steven-johnson I see the same failure on #7279 , so I don't think either PR is responsible for the failure
https://buildbot.halide-lang.org/master/#/builders/42/builds/692

@steven-johnson
Copy link
Contributor

steven-johnson commented Jan 17, 2023

@steven-johnson I see the same failure on #7279 , so I don't think either PR is responsible for the failure https://buildbot.halide-lang.org/master/#/builders/42/builds/692

great, it's probably another LLVM injection :-/

Let me try to confirm that first

@steven-johnson
Copy link
Contributor

The predicated-load failure isn't happening for me locally with top-of-tree LLVM, so maybe it's a temporary flake; I'm forcing rebuilds on the x64 bots to see if it recurs

@rootjalex
Copy link
Member Author

Only failing test appears unrelated. @steven-johnson think it’s good to go?

@steven-johnson
Copy link
Contributor

Failure is vectorized_gpu_allocation, which I've never seen before as a flake or even an ordinary failure, so let me retry it just a bit first.

@steven-johnson
Copy link
Contributor

The failure is now in our old friend, correctness_atomics, aka "Mr. Flaky", so I think we're good to go

@rootjalex rootjalex merged commit bafd60f into main Jan 20, 2023
@rootjalex rootjalex deleted the rootjalex/x86-double-sat branch January 20, 2023 18:03
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* better x86 double sat-cast + add test

* fix wasm too + test

Co-authored-by: Steven Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[x86] Codegen should split up double saturating-narrows

4 participants