llvmpipe: use single swizzled tile
Use a single swizzled tile per colorbuf (and per thread) to avoid
accumulating large amounts of cached swizzled data.
Now that the SSE3 code has been merged to master, the performance delta
of this change is minimal, the main benefit is reduced memory usage
due to no longer keeping swizzled copies of render targets.
It's clear from the performance of the in-place version of this code
that there is still quite a bit of time being spent swizzling &
unswizzling, but it's not clear exactly how to reduce that.