A manual 16-bit ld4 (so normal load, then deinterleave with a shuffle) is recognized, and lowered as ld4. The same is for some odd reason not true for ld2, where more instructions are used.
https://godbolt.org/z/danjGfMb9
manual2:
ldr q1, [x0]
ext v2.16b, v1.16b, v1.16b, #8
uzp1 v0.4h, v1.4h, v2.4h
uzp2 v1.4h, v1.4h, v2.4h
ret
intrin2:
ld2 { v0.4h, v1.4h }, [x0]
ret
manual4:
ld4 { v0.4h, v1.4h, v2.4h, v3.4h }, [x0]
stp d0, d1, [x8]
stp d2, d3, [x8, #16]
ret
intrin4:
ld4 { v0.4h, v1.4h, v2.4h, v3.4h }, [x0]
stp d0, d1, [x8]
stp d2, d3, [x8, #16]
ret
The issue is that the VectorCombinePass turns
%0 = shufflevector <8 x i16> %tmp.sroa.0.0.copyload.i, <8 x i16> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%1 = shufflevector <8 x i16> %tmp.sroa.0.0.copyload.i, <8 x i16> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%2 = bitcast <4 x i16> %0 to <8 x i8>
%3 = bitcast <4 x i16> %1 to <8 x i8>
into
%0 = bitcast <8 x i16> %tmp.sroa.0.0.copyload.i to <16 x i8>
%1 = shufflevector <16 x i8> %0, <16 x i8> poison, <8 x i32> <i32 0, i32 1, i32 4, i32 5, i32 8, i32 9, i32 12, i32 13>
%2 = bitcast <8 x i16> %tmp.sroa.0.0.copyload.i to <16 x i8>
%3 = shufflevector <16 x i8> %2, <16 x i8> poison, <8 x i32> <i32 2, i32 3, i32 6, i32 7, i32 10, i32 11, i32 14, i32 15>
that presumably breaks the ld2 pattern recognition.
A manual 16-bit ld4 (so normal load, then deinterleave with a shuffle) is recognized, and lowered as
ld4. The same is for some odd reason not true forld2, where more instructions are used.https://godbolt.org/z/danjGfMb9
The issue is that the
VectorCombinePassturnsinto
that presumably breaks the
ld2pattern recognition.