i965: Push most TES inputs in SIMD8 mode.
Using the push model for inputs is much more efficient than pulling
inputs - the hardware can simply copy a large chunk into URB registers
at thread creation time, rather than having the thread send messages to
request data from the L3 cache. Unfortunately, it's possible to have
more TES inputs than fit in registers, so we have to fall back to the
pull model in some cases.
However, it turns out that most tessellation evaluation shaders are
fairly simple, and don't use many inputs. An arbitrary cut-off of
32 vec4 slots (16 registers) is more than sufficient to ensure that
100% of TES inputs are pushed for Shadow of Mordor, Unigine Heaven,
GPUTest/TessMark, and SynMark.
Note that unlike most SIMD8 stages, this actually reads packed vec4
data, since that is what our vec4 TCS programs write.
Improves performance in GPUTest's tessmark_x64 microbenchmark
by 93.4426% +/- 5.35541% (n = 25) on my Lenovo X250 at 1024x768.
Improves performance in Synmark's Gl40TerrainFlyTess microbenchmark
by 22.74% +/- 0.309394% (n = 5).
Improves performance in Shadow of Mordor at low settings with
tessellation enabled at 1280x720 by 2.12197% +/- 0.478553% (n = 4).
shader-db statistics for files containing tessellation shaders:
total instructions in shared programs: 184358 -> 181181 (-1.72%)
instructions in affected programs: 27971 -> 24794 (-11.36%)
helped: 226
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>