gallivm: optimize yuv decoding
this is more a proof to show vector shifts on x86 with per-element shift count
are evil. Since we can avoid the shift with a single compare/select, use that
instead. Replaces more than 20 instructions (and slow ones at that) with about 3,
and cuts compiled shader size with mesa's yuvsqure demo by over 10%
(no performance measurements done - but selection is blazing fast).
Might want to revisit that for future cpus - unfortunately AVX won't have vector
shifts neither, but AMD's XOP will, but even in that case using selection here
is probably not slower.