gallivm: provide soa fetch path handling formats with more than 32bit
This previously always fell back to AoS conversion. Even for 4-float formats
(which is the optimal case by far for that fallback case) this was suboptimal,
since it meant the conversion couldn't be done with 256bit vectors. While this
may still only be partly possible for some formats, (unless there's AVX2
support) at least the transpose can be done with half the unpacks
(and before using the transpose for AoS fallbacks, it was worse still).
With less than 4 channels, things got way worse with the AoS fallback
quickly even with 128bit vectors.
The strategy is pretty much the same as the existing one for formats
which fit into 32 bits, except there's now multiple vectors to be
fetched (2 or 4 to be exact), which need to be shuffled first (if it's 4
vectors, this amounts to a transpose, for 2 it's a bit different),
then the unpack is done the same (with the exception that the shift
of the channels is now modulo 32, and we need to select the right
vector).
In fact the most complex part about it is to get the shuffles right
for separating into lo/hi parts for AVX/AVX2...
This also makes use of the new ability of gather to use provided type
information, which we abuse to outsmart llvm so we get decent shuffles,
and to fetch 3x32bit vectors without having to ZExt the scalar.
And just because we can, we handle double formats too, albeit they are
a bit different (draw sometimes needs to handle that).
v2: fix typo float/int bug (generating inefficient code).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>