[X86] Suboptimal code for AVX-512 narrowing / blended permutex2var

This code, compiled via `-O3 -march=znver4`:

```c
#include <immintrin.h>
#include <stdint.h>

void narrow_u32x16x4_to_u8x64(uint8_t* dst, __m512i x0, __m512i x1, __m512i x2, __m512i x3) {
    __m512i inds = _mm512_set_epi8(
        124, 120, 116, 112, 108, 104, 100, 96,
         92,  88,  84,  80,  76,  72,  68, 64,
         60,  56,  52,  48,  44,  40,  36, 32,
         28,  24,  20,  16,  12,   8,   4,  0,
        124, 120, 116, 112, 108, 104, 100, 96,
         92,  88,  84,  80,  76,  72,  68, 64,
         60,  56,  52,  48,  44,  40,  36, 32,
         28,  24,  20,  16,  12,   8,   4,  0
    );
    __m512i x01 = _mm512_permutex2var_epi8(x0, inds, x1);
    __m512i x23 = _mm512_permutex2var_epi8(x2, inds, x3);
    __m512i x0123 = _mm512_mask_blend_epi64(0xF0, x01, x23);
    _mm512_storeu_si512(dst, x0123);
}
```
produces:
```asm
narrow_u32x16x4_to_u8x64:
        vmovdqa64       zmm4, zmmword ptr [rip + .LCPI0_0]
        vmovdqa64       zmm5, zmmword ptr [rip + .LCPI0_1]
        vpshufb zmm1, zmm1, zmm4
        vpshufb zmm0, zmm0, zmm5
        vpshufb zmm3, zmm3, zmm4
        vpshufb zmm2, zmm2, zmm5
        vporq   zmm0, zmm0, zmm1
        vporq   zmm1, zmm2, zmm3
        vpmovsxbd       zmm3, xmmword ptr [rip + .LCPI0_3]
        vpermi2d        zmm3, zmm0, zmm1
        vmovdqu64       zmmword ptr [rdi], zmm3
        vzeroupper
        ret
```
instead of the more direct version that gcc produces:
```asm
narrow_u32x16x4_to_u8x64:
        vmovdqa64       zmm4, ZMMWORD PTR .LC0[rip]
        kmovb   k1, BYTE PTR .LC1[rip]
        vpermt2b        zmm0, zmm4, zmm1
        vpermi2b        zmm4, zmm2, zmm3
        vmovdqa64       zmm0{k1}, zmm4
        vmovdqu64       ZMMWORD PTR [rdi], zmm0
        ret
```

The code implements a general 64-element `u32` to `u8` narrow, and should have 2x higher throughput than using `vpmovdb` as clang currently does via autovectorization on both Intel and AMD (and allows doing merging of multiple results via a blend instead of insert, which can run on more ports), so that's perhaps a separate thing that could be improved. I believe similar approaches should get a ~2x throughput boost for all narrowing conversions, on both Intel and AMD.

https://godbolt.org/z/ax7Yda7Ps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Suboptimal code for AVX-512 narrowing / blended permutex2var #137422

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[X86] Suboptimal code for AVX-512 narrowing / blended permutex2var #137422

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions