Skip to content

[X86] Suboptimal code for AVX-512 narrowing / blended permutex2var #137422

@dzaima

Description

@dzaima

This code, compiled via -O3 -march=znver4:

#include <immintrin.h>
#include <stdint.h>

void narrow_u32x16x4_to_u8x64(uint8_t* dst, __m512i x0, __m512i x1, __m512i x2, __m512i x3) {
    __m512i inds = _mm512_set_epi8(
        124, 120, 116, 112, 108, 104, 100, 96,
         92,  88,  84,  80,  76,  72,  68, 64,
         60,  56,  52,  48,  44,  40,  36, 32,
         28,  24,  20,  16,  12,   8,   4,  0,
        124, 120, 116, 112, 108, 104, 100, 96,
         92,  88,  84,  80,  76,  72,  68, 64,
         60,  56,  52,  48,  44,  40,  36, 32,
         28,  24,  20,  16,  12,   8,   4,  0
    );
    __m512i x01 = _mm512_permutex2var_epi8(x0, inds, x1);
    __m512i x23 = _mm512_permutex2var_epi8(x2, inds, x3);
    __m512i x0123 = _mm512_mask_blend_epi64(0xF0, x01, x23);
    _mm512_storeu_si512(dst, x0123);
}

produces:

narrow_u32x16x4_to_u8x64:
        vmovdqa64       zmm4, zmmword ptr [rip + .LCPI0_0]
        vmovdqa64       zmm5, zmmword ptr [rip + .LCPI0_1]
        vpshufb zmm1, zmm1, zmm4
        vpshufb zmm0, zmm0, zmm5
        vpshufb zmm3, zmm3, zmm4
        vpshufb zmm2, zmm2, zmm5
        vporq   zmm0, zmm0, zmm1
        vporq   zmm1, zmm2, zmm3
        vpmovsxbd       zmm3, xmmword ptr [rip + .LCPI0_3]
        vpermi2d        zmm3, zmm0, zmm1
        vmovdqu64       zmmword ptr [rdi], zmm3
        vzeroupper
        ret

instead of the more direct version that gcc produces:

narrow_u32x16x4_to_u8x64:
        vmovdqa64       zmm4, ZMMWORD PTR .LC0[rip]
        kmovb   k1, BYTE PTR .LC1[rip]
        vpermt2b        zmm0, zmm4, zmm1
        vpermi2b        zmm4, zmm2, zmm3
        vmovdqa64       zmm0{k1}, zmm4
        vmovdqu64       ZMMWORD PTR [rdi], zmm0
        ret

The code implements a general 64-element u32 to u8 narrow, and should have 2x higher throughput than using vpmovdb as clang currently does via autovectorization on both Intel and AMD (and allows doing merging of multiple results via a blend instead of insert, which can run on more ports), so that's perhaps a separate thing that could be improved. I believe similar approaches should get a ~2x throughput boost for all narrowing conversions, on both Intel and AMD.

https://godbolt.org/z/ax7Yda7Ps

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions