gallivm: support avx512 (16x32) in interleave2_half
lp_build_interleave2_half was not doing the right thing for avx512-style
16-wide loads.
This path is hit in the swr driver with a 16-wide vertex shader. It is
called from lp_build_transpose_aos, when doing texel fetches and the
fetched data needs to be transposed to one component per output register.
Special-case the post-load swizzle operations for avx512 16x32 (16-wide
32-bit values) so that we move the xyzw components correctly to the outputs.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>