i965/fs: Improve performance of shaders that start out with a discard.
I had tried this in the past, but ran into trouble with applications
that sample from undiscarded pixels in the same subspan. To fix that
issue, only jump to the end for an entire subspan at a time.
Improves GLbenchmark 2.7 (1024x768) performance by 7.9 +/- 1.5% (n=8).
v2: Drop the br variable in the jump instruction -- if I ever do jumps
pre-gen6, it'll be a different code block anyway since we don't have
HALT until gen6.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>