aarch64: Reimplememnt vmovn/vmovl intrinsics with builtins instead
Turns out __builtin_convertvector is not as good a fit for the widening
and narrowing intrinsics as I had hoped.
During the veclower phase we lower most of it to bitfield operations and
hope DCE cleans it back up into
vector pack/unpack and extend operations. I received reports that in
more complex cases GCC fails to do that
and we're left with many vector extract operations that clutter the
output.
I think veclower can be improved on that front, but for GCC 10 I'd like
to just implement these builtins
with a good old RTL builtin rather than inline asm.
gcc/
* config/aarch64/aarch64-simd.md (aarch64_<su>xtl<mode>):
Define.
(aarch64_xtn<mode>): Likewise.
* config/aarch64/aarch64-simd-builtins.def (sxtl, uxtl, xtn):
Define
builtins.
* config/aarch64/arm_neon.h (vmovl_s8): Reimplement using
builtin.
(vmovl_s16): Likewise.
(vmovl_s32): Likewise.
(vmovl_u8): Likewise.
(vmovl_u16): Likewise.
(vmovl_u32): Likewise.
(vmovn_s16): Likewise.
(vmovn_s32): Likewise.
(vmovn_s64): Likewise.
(vmovn_u16): Likewise.
(vmovn_u32): Likewise.
(vmovn_u64): Likewise.