This code, compiled via -O3 -march=znver4:
#include <immintrin.h>
#include <stdint.h>
void narrow_u32x16x4_to_u8x64(uint8_t* dst, __m512i x0, __m512i x1, __m512i x2, __m512i x3) {
__m512i inds = _mm512_set_epi8(
124, 120, 116, 112, 108, 104, 100, 96,
92, 88, 84, 80, 76, 72, 68, 64,
60, 56, 52, 48, 44, 40, 36, 32,
28, 24, 20, 16, 12, 8, 4, 0,
124, 120, 116, 112, 108, 104, 100, 96,
92, 88, 84, 80, 76, 72, 68, 64,
60, 56, 52, 48, 44, 40, 36, 32,
28, 24, 20, 16, 12, 8, 4, 0
);
__m512i x01 = _mm512_permutex2var_epi8(x0, inds, x1);
__m512i x23 = _mm512_permutex2var_epi8(x2, inds, x3);
__m512i x0123 = _mm512_mask_blend_epi64(0xF0, x01, x23);
_mm512_storeu_si512(dst, x0123);
}
produces:
narrow_u32x16x4_to_u8x64:
vmovdqa64 zmm4, zmmword ptr [rip + .LCPI0_0]
vmovdqa64 zmm5, zmmword ptr [rip + .LCPI0_1]
vpshufb zmm1, zmm1, zmm4
vpshufb zmm0, zmm0, zmm5
vpshufb zmm3, zmm3, zmm4
vpshufb zmm2, zmm2, zmm5
vporq zmm0, zmm0, zmm1
vporq zmm1, zmm2, zmm3
vpmovsxbd zmm3, xmmword ptr [rip + .LCPI0_3]
vpermi2d zmm3, zmm0, zmm1
vmovdqu64 zmmword ptr [rdi], zmm3
vzeroupper
ret
instead of the more direct version that gcc produces:
narrow_u32x16x4_to_u8x64:
vmovdqa64 zmm4, ZMMWORD PTR .LC0[rip]
kmovb k1, BYTE PTR .LC1[rip]
vpermt2b zmm0, zmm4, zmm1
vpermi2b zmm4, zmm2, zmm3
vmovdqa64 zmm0{k1}, zmm4
vmovdqu64 ZMMWORD PTR [rdi], zmm0
ret
The code implements a general 64-element u32 to u8 narrow, and should have 2x higher throughput than using vpmovdb as clang currently does via autovectorization on both Intel and AMD (and allows doing merging of multiple results via a blend instead of insert, which can run on more ports), so that's perhaps a separate thing that could be improved. I believe similar approaches should get a ~2x throughput boost for all narrowing conversions, on both Intel and AMD.
https://godbolt.org/z/ax7Yda7Ps
This code, compiled via
-O3 -march=znver4:produces:
instead of the more direct version that gcc produces:
The code implements a general 64-element
u32tou8narrow, and should have 2x higher throughput than usingvpmovdbas clang currently does via autovectorization on both Intel and AMD (and allows doing merging of multiple results via a blend instead of insert, which can run on more ports), so that's perhaps a separate thing that could be improved. I believe similar approaches should get a ~2x throughput boost for all narrowing conversions, on both Intel and AMD.https://godbolt.org/z/ax7Yda7Ps