i965/tiled_memcpy: ytiled_to_linear a cache line at a time
Similar to the transformation applied to linear_to_ytiled, also align
each readback from the ytiled source to a cacheline (i.e. transfer a
whole cacheline from the source before moving on to the next column).
This will allow us to utilize movntqda (_mm_stream_si128) in a
subsequent patch to obtain near WB readback performance when accessing
the uncached ytiled memory, an order of magnitude improvement.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>