mesa.git
11 years agogallium/u_blitter: fix is_blit_generic_supported() stencil checking
Brian Paul [Fri, 5 Apr 2013 17:21:09 +0000 (11:21 -0600)]
gallium/u_blitter: fix is_blit_generic_supported() stencil checking

Don't check if there's sampler support for stencil if we're not
going to actually blit/copy stencil values.  Fixes the case where
we mistakenly said we can't support a blit of depth values from
S8Z24 to X8Z24.

Also, rename the is_stencil variable to dst_has_stencil to improve
readability.

NOTE: This is a candidate for the stable branches.

Reviewed-by: Marek Olšák <maraeo@gmail.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agoHonor GLX_DONT_CARE in MATCH_MASK
Alexander Monakov [Mon, 1 Apr 2013 21:38:27 +0000 (01:38 +0400)]
Honor GLX_DONT_CARE in MATCH_MASK

NOTE: This is a candidate for stable branches.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=47478
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62999
Bugzilla: http://bugs.winehq.org/show_bug.cgi?id=26763

11 years agofreedreno: use autogenerated register defs
Rob Clark [Fri, 5 Apr 2013 16:54:37 +0000 (12:54 -0400)]
freedreno: use autogenerated register defs

Switch to use the envytools generated headers for register/bitfield
definitions.  This is the first step in preparing to add a3xx support,
since it avoids having conflicting names for a3xx and a2xx registers.
And since I'm using envytools for a3xx it is simpler to just use it for
everything.

This shouldn't cause any functional change, it is really just a lot of
renaming.

Signed-off-by: Rob Clark <robdclark@gmail.com>
11 years agost/wgl: Install our windows message hook to threads created before the ICD is loaded.
José Fonseca [Thu, 4 Apr 2013 19:27:39 +0000 (20:27 +0100)]
st/wgl: Install our windows message hook to threads created before the ICD is loaded.

Otherwise we will not receive destroy windows events, causing framebuffers
to leak.

This happens particularly with java and jogl.

Tested with java + jogl, MATLAB.

VMware Internal Bug Number: 1013086.

Reviewed-by: Brian Paul <brianp@vmware.com>
11 years agollvmpipe: Work without sse2 if llvm is new enough
Adam Jackson [Thu, 4 Apr 2013 21:16:22 +0000 (17:16 -0400)]
llvmpipe: Work without sse2 if llvm is new enough

At least on llvm 3.2 this appears to work fine.  Tested on an Athlon XP
2600+, which has sse and 3dnow but not sse2.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Signed-off-by: Adam Jackson <ajax@redhat.com>
11 years agowinsys/radeon: add command stream replay dump for faulty lockup v3
Jerome Glisse [Wed, 27 Mar 2013 15:04:29 +0000 (11:04 -0400)]
winsys/radeon: add command stream replay dump for faulty lockup v3

Build time option, set RADEON_CS_DUMP_ON_LOCKUP to 1 in radeon_drm_cs.h to
enable it.

When enabled after each cs submission the code will try to detect lockup by
waiting on one of the buffer of the cs to become idle, after a timeout it
will consider that the cs triggered a lockup and will write a radeon_lockup.c
file in current directory that have all information for replaying the cs.

To build this file :
gcc -O0 -g radeon_lockup.c -ldrm -o radeon_lockup -I/usr/include/libdrm

v2: Add radeon_ctx.h file to mesa git tree
v3: Slightly improve dumped file for easier editing, only dump first faulty cs

Signed-off-by: Jerome Glisse <jglisse@redhat.com>
11 years agost/xlib: add HUD support for xlib/GLX
Brian Paul [Thu, 4 Apr 2013 20:06:51 +0000 (14:06 -0600)]
st/xlib: add HUD support for xlib/GLX

For the softpipe and llvmpipe drivers.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agogallium/hud: add GALLIUM_HUD_PERIOD env var
Brian Paul [Thu, 4 Apr 2013 22:37:56 +0000 (16:37 -0600)]
gallium/hud: add GALLIUM_HUD_PERIOD env var

To set the graph update rate, in seconds.  The default update rate
has also been changed to 1/2 second.

Reviewed-by: Marek Olšák <maraeo@gmail.com>
11 years agogallium/hud: initialize sampler state
Brian Paul [Thu, 4 Apr 2013 22:24:40 +0000 (16:24 -0600)]
gallium/hud: initialize sampler state

The default wrap mode (PIPE_TEX_WRAP_REPEAT) is incompatible with
unnormalized texcoords (at least for softpipe).

v2: use PIPE_TEX_WRAP_CLAMP_TO_EDGE

Reviewed-by: Marek Olšák <maraeo@gmail.com>
11 years agoglsl: Add an optimization pass to flatten simple nested if blocks.
Kenneth Graunke [Thu, 4 Apr 2013 06:56:57 +0000 (23:56 -0700)]
glsl: Add an optimization pass to flatten simple nested if blocks.

GLBenchmark 2.7's shaders contain conditional blocks like:

if (x) {
    if (y) {
        ...
    }
}

where the outer conditional's then clause contains exactly one statement
(the nested if) and there are no else clauses.  This can easily be
optimized into:

if (x && y) {
    ...
}

This saves a few instructions in GLBenchmark 2.7:

    total instructions in shared programs: 11833 -> 11649 (-1.55%)
    instructions in affected programs:     8234 -> 8050 (-2.23%)

It also helps CS:GO slightly (-0.05%/-0.22%).  More importantly,
however, it simplifies the control flow graph, which could enable other
optimizations.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
11 years agoi965: Use a variable for the push constant size in kB.
Kenneth Graunke [Wed, 3 Apr 2013 04:11:51 +0000 (21:11 -0700)]
i965: Use a variable for the push constant size in kB.

This clarifies that the offset of 2 is actually 16 kB / 8kB units.
It also keys both computations off of a single variable, which should
make it easier to change in the future.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
11 years agoi965: Turn brw->urb.vs_size and gs_size into local variables.
Kenneth Graunke [Wed, 3 Apr 2013 04:11:50 +0000 (21:11 -0700)]
i965: Turn brw->urb.vs_size and gs_size into local variables.

These variables are only used within a single function, so we may as
well make them local variables.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
11 years agoi965: Remove BRW_NEW_WM_INPUT_DIMENSIONS dirty bit.
Kenneth Graunke [Wed, 13 Mar 2013 05:16:37 +0000 (22:16 -0700)]
i965: Remove BRW_NEW_WM_INPUT_DIMENSIONS dirty bit.

This was only produced by the brw_wm_input_dimensions atom, which was
removed in the previous commit.  So there's no need for the dirty bit.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
11 years agoi965: Delete brw_vs_constval.c and the brw_wm_input_sizes atom.
Kenneth Graunke [Wed, 13 Mar 2013 04:12:08 +0000 (21:12 -0700)]
i965: Delete brw_vs_constval.c and the brw_wm_input_sizes atom.

This was only used to compute proj_attrib_mask, which was removed by the
previous commit.  That makes this dead code.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
11 years agoi965: Remove now dead brw_wm_prog_key::proj_attrib_mask field.
Kenneth Graunke [Wed, 13 Mar 2013 04:09:35 +0000 (21:09 -0700)]
i965: Remove now dead brw_wm_prog_key::proj_attrib_mask field.

The previous commit removed the last user of this field, so there's no
longer any point in setting it.  Removing this should eliminate
state-dependent recompiles, and make the precompile more reliable.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
11 years agoi965: Remove fixed-function texture projection avoidance optimization.
Kenneth Graunke [Wed, 13 Mar 2013 04:09:19 +0000 (21:09 -0700)]
i965: Remove fixed-function texture projection avoidance optimization.

This optimization attempts to avoid extra attribute interpolation
instructions for texture coordinates where the W-component is 1.0.

Unfortunately, it requires a lot of complexity: the brw_wm_input_sizes
state atom (all the brw_vs_constval.c code) needs to run on each draw.
It computes the input_size_masks array, then uses that to compute
proj_attrib_mask.  Differences in proj_attrib_mask can cause
state-dependent fragment shader recompiles.  We also often fail to guess
proj_attrib_mask for the fragment shader precompile, causing us to
needlessly compile it twice.

Furthermore, this optimization only applies to fixed-function programs;
it does not help modern GLSL-based programs at all.  Generally, older
fixed-function programs run fine on modern hardware anyway.

The optimization has existed in some form since the initial commit.  When
we rewrote the fragment shader backend, we dropped it for a while.  Eric
readded it in commit eb30820f268608cf451da32de69723036dddbc62 as part of
an attempt to cure a ~1% performance regression caused by converting the
fixed-function fragment shader generation code from Mesa IR to GLSL IR.
However, no performance data was included in the commit message, so it's
unclear whether or not it was successful.

Time has passed, so I decided to re-measure this.  Surprisingly,
Eric's OpenArena timedemo actually runs /faster/ after removing this and
the brw_wm_input_sizes atom.  On Ivybridge at 1024x768, I measured a
1.39532% +/- 0.91833% increase in FPS (n = 55).  On Ironlake, there was
no statistically significant difference (n = 37).

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
11 years agoi965: Use ctx->Stencil._WriteEnabled in DEPTH_STENCIL_STATE.
Kenneth Graunke [Tue, 2 Apr 2013 17:28:07 +0000 (10:28 -0700)]
i965: Use ctx->Stencil._WriteEnabled in DEPTH_STENCIL_STATE.

This is the same computation as the _WriteEnabled flag, so we may as
well use it.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
11 years agoi965: Fix stencil write enable flag in 3DSTATE_DEPTH_BUFFER on Gen7+.
Kenneth Graunke [Tue, 2 Apr 2013 17:29:37 +0000 (10:29 -0700)]
i965: Fix stencil write enable flag in 3DSTATE_DEPTH_BUFFER on Gen7+.

ctx->Stencil.WriteMask is a statically sized array of 3 elements.
Checking it against 0 actually is a NULL check, and can never fail,
which meant that we always said stencil writes were enabled.

Use the new core Mesa derived state flag to fix this.

NOTE: This is a candidate for stable branches.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
11 years agomesa: Add new ctx->Stencil._WriteEnabled derived state flag.
Kenneth Graunke [Tue, 2 Apr 2013 17:22:18 +0000 (10:22 -0700)]
mesa: Add new ctx->Stencil._WriteEnabled derived state flag.

i965 needs to know whether stencil writes are enabled in several places,
and gets the test wrong sometimes.  While we could create a function to
compute this, it seems generally useful enough to warrant a new piece of
derived state.  Also, all the plumbing is already in place.

NOTE: This is a candidate for stable branches.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
11 years agogallivm: some minor cube map cleanup
Roland Scheidegger [Thu, 4 Apr 2013 21:20:49 +0000 (23:20 +0200)]
gallivm: some minor cube map cleanup

The ar_ge_as_at variable was just very very confusing since the condition
was actually the other way around (as_at_ge_ar). So change the condition
(and the selects depending on it) to match the variable name.
And also change the chosen major axis in case the coord values are the
same. OpenGL doesn't care one bit which one is chosen in this case but
it looks like dx10 would require z chosen over y, and y chosen over x
(previously did x chosen over y, y chosen over z). Since it's all the
same effort just honor dx10's wishes. (Though actually, for some prefered
orderings, we could save one (or two with derivatives) selects since the
tnewx and tnewz (and the corresponding dmax values) are the same.)

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agoi965: Ask the register allocator to round-robin through registers.
Eric Anholt [Sat, 1 Dec 2012 00:34:09 +0000 (16:34 -0800)]
i965: Ask the register allocator to round-robin through registers.

The way we were allocating registers before, packing into low register
numbers for Ironlake, resulted in an overly-constrained dependency graph
for instruction scheduling.  Improves GLBenchmark 2.1 performance by
4.5% +/- 0.7% (n=26).  No difference on my old GLSL demo (n=20).  No
difference on nexuiz (n=15).

v2: Fix off-by-one bug that made the change only work for 16-wide on i965.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
11 years agollvmpipe: implement ucmp
Zack Rusin [Thu, 4 Apr 2013 04:15:13 +0000 (21:15 -0700)]
llvmpipe: implement ucmp

and add a test for it

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agoAvoid spurious GCC warnings in STATIC_ASSERT() macro.
Paul Berry [Tue, 2 Apr 2013 16:51:47 +0000 (09:51 -0700)]
Avoid spurious GCC warnings in STATIC_ASSERT() macro.

GCC 4.8 now warns about typedefs that are local to a scope and not
used anywhere within that scope.  This produced spurious warnings with
the STATIC_ASSERT() macro (which used a typedef to provoke a compile
error in the event of an assertion failure).

This patch switches to a simpler technique that avoids the warning.

v2: Avoid GCC-specific syntax.  Also update p_compiler.h.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
11 years agofreedreno: document debug flag
Erik Faye-Lund [Tue, 26 Mar 2013 13:48:45 +0000 (14:48 +0100)]
freedreno: document debug flag

Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
Signed-off-by: Brian Paul <brianp@vmware.com>
11 years agost/wgl: add HUD support
Brian Paul [Wed, 3 Apr 2013 19:46:40 +0000 (13:46 -0600)]
st/wgl: add HUD support

v2: fix a few minor issues spotted by Jose.

Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agost/wgl: make stw_current_context() non-static
Brian Paul [Wed, 3 Apr 2013 19:45:47 +0000 (13:45 -0600)]
st/wgl: make stw_current_context() non-static

Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agoutil: add debug_memory_check_block(), debug_memory_tag()
Brian Paul [Wed, 3 Apr 2013 19:36:50 +0000 (13:36 -0600)]
util: add debug_memory_check_block(), debug_memory_tag()

The former just checks that the given block is valid by checking
the header and footer.

The later sets the memory block's tag.  With extra debug code, we
can use that for monitoring/checking particular allocations.

Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agogallium/hud: replace malloc w/ MALLOC
Brian Paul [Wed, 3 Apr 2013 19:33:38 +0000 (13:33 -0600)]
gallium/hud: replace malloc w/ MALLOC

To match the FREE() called used later.  Fixes things on Windows.

Reviewed-by: Marek Olšák <maraeo@gmail.com>
11 years agor600g/llvm: Workaround for wrong tex.offset_*
Vincent Lejeune [Wed, 3 Apr 2013 19:19:22 +0000 (21:19 +0200)]
r600g/llvm: Workaround for wrong tex.offset_*

11 years agogallivm: honor explicit derivatives values for cube maps.
Roland Scheidegger [Wed, 3 Apr 2013 22:56:23 +0000 (00:56 +0200)]
gallivm: honor explicit derivatives values for cube maps.

This is trivial now, though need to make sure we pass all the necessary
derivative values (which is 3 each for ddx/ddy not 2).
Passes piglit arb_shader_texture_lod-texgradcube test.

v2: add the forgotten abs() for all incoming derivatives (discovered
by new piglit arb_shader_texture_lod-texgradcube test, though more by
luck as it was failing only for exactly one pixel...).

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agogallivm: do per-pixel cube face selection (finally!!!)
Roland Scheidegger [Wed, 3 Apr 2013 01:26:22 +0000 (03:26 +0200)]
gallivm: do per-pixel cube face selection (finally!!!)

This proved to be tricky, the problem is that after selection/mirroring
we cannot calculate reasonable derivatives (if not all pixels in a quad
end up on the same face the derivatives could get "randomly" exceedingly
large).
However, it is actually quite easy to simply calculate the derivatives
before selection/mirroring and then transform them similar to
the cube coordinates (they only need selection/projection, but not
mirroring as we're not interested in the sign bit, of course). While
there is a tiny bit more work to do (need to calculate derivs for 3
coords instead of 2, and additional selects) it also simplifies things
somewhat for the coord selection itself (as we save some broadcast aos
shuffles, and we don't need to calculate the average vector) - hence if
derivatives aren't needed this should actually be faster.
Also, this has the benefit that this will (trivially) work for explicit
derivatives too, which we completely ignored before that (will be in a
separate commit for better trackability).
Note that while the way for getting rho looks very different, it should
result in "nearly" the same values as before (the "nearly" is only because
before the code would choose the face based on an "average" vector and hence
the derivatives calculated according to this face, where now (for implicit
derivatives) the derivatives are projected on the face selected for the
first (top-left) pixel in a quad, so not necessarly the same face).
The transformation done might not quite be state-of-the-art, calculating
length(dx,dy) as max(dx,dy) certainly isn't neither but this stays the
same as before (that is I think a better transform would _somehow_ take
the "derivative major axis" into account so that derivative changes in
the major axis wouldn't get ignored).
Should solve some accuracy problems with cubemaps (can easily be seen with
the cubemap demo when switching wrapping/filtering), though we still don't
do seamless filtering to fix it completely (so not per-sample but per-pixel
is certainly better than per-quad and already sufficient for accurate
results with nearest tex filter).

As for performance, it seems to be a tiny bit faster too (maybe 3% or so
with cubemap demo). Which I'd have expected with nearest/nearest filtering
where this will be less instructions, but the difference seems to actually
be larger with linear/linear_mipmap_linear where it is slightly more
instructions, probably the code appears less serialized allowing better
scheduling (on a sandy bridge cpu). It actually seems to be now at least
as fast as the old path using a conditional when using 128bit vectors too
(that is probably more a result of testing with a newer cpu though), for now
that old path is still there but unused.
No piglit regressions.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agogallivm: minor rho calculation optimization for 1 or 3 coords
Roland Scheidegger [Wed, 3 Apr 2013 00:49:56 +0000 (02:49 +0200)]
gallivm: minor rho calculation optimization for 1 or 3 coords

Using a different packing for the single coord case should save a shuffle.
Plus some minor style fixes.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agogallivm: use f16c hw support for float->half and half->float conversion
Roland Scheidegger [Tue, 2 Apr 2013 23:06:52 +0000 (01:06 +0200)]
gallivm: use f16c hw support for float->half and half->float conversion

Should be way faster of course on cpus supporting this (includes AMD
Bulldozer and Jaguar cores, Intel Ivy Bridge and up (except budget models)).
Passes piglit fbo-blending-formats GL_ARB_texture_float -auto on Ivy Bridge.

Reviewed-by: Brian Paul <brianp@vmware.com>
11 years agodraw/llvmpipe: allow independent so attachments to the vs
Zack Rusin [Sat, 30 Mar 2013 13:21:41 +0000 (06:21 -0700)]
draw/llvmpipe: allow independent so attachments to the vs

When geometry shaders are present, one needs to be able to create
an empty geometry shader with stream output that needs to be
resolved later and attached to the currently bound vertex shader.
Lets add support for it to llvmpipe and draw. draw allows attaching
independent stream output info to any vertex shader and llvmpipe
resolves at draw time which vertex shader the given empty geometry
shader should be linked to.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agollvmpipe: reset so buffers when not appending
Zack Rusin [Sat, 30 Mar 2013 07:21:03 +0000 (00:21 -0700)]
llvmpipe: reset so buffers when not appending

We need to reset the internal state of the so buffers or we'll
keep appending even though we're not supposed to.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agodraw: remove unused function
Zack Rusin [Sat, 30 Mar 2013 07:20:05 +0000 (00:20 -0700)]
draw: remove unused function

we use draw_set_mapped_so_targets nowadays

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agodraw/llvm: use an enum instead of magic numbers
Zack Rusin [Sat, 30 Mar 2013 02:33:34 +0000 (19:33 -0700)]
draw/llvm: use an enum instead of magic numbers

I think this was there before and got accidently
removed during a merge. Same code as for the GS
context, which is also using an enum instead of
hardcoded numbers.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agodraw/gs: cleanup some debugging code
Zack Rusin [Sat, 30 Mar 2013 00:18:42 +0000 (17:18 -0700)]
draw/gs: cleanup some debugging code

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agodraw/so: maintain an exact number of written vertices
Zack Rusin [Fri, 29 Mar 2013 11:52:29 +0000 (04:52 -0700)]
draw/so: maintain an exact number of written vertices

It's quite helpful during the rendering when we know
exactly the count of the vertices available in the
buffer.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agodraw: Implement support for primitive id
Zack Rusin [Fri, 29 Mar 2013 11:50:32 +0000 (04:50 -0700)]
draw: Implement support for primitive id

We were largely ignoring primitive id.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agodraw/so: Fix bogus assert
Zack Rusin [Thu, 28 Mar 2013 03:13:13 +0000 (20:13 -0700)]
draw/so: Fix bogus assert

We do support so with multiple primitives.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agodraw/gs: Fix memory corruption with multiple primitives
Zack Rusin [Thu, 28 Mar 2013 03:11:16 +0000 (20:11 -0700)]
draw/gs: Fix memory corruption with multiple primitives

We were flushing with incorrect number of primitives. TGSI exec
can only work with a single primitive at a time. Plus the fetching
with multiple primitives on llvm paths wasn't copying the last
element.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agogallivm: cleanup the gs interface
Zack Rusin [Wed, 27 Mar 2013 11:27:59 +0000 (04:27 -0700)]
gallivm: cleanup the gs interface

Instead of void pointers use a base interface.

Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
11 years agosvga: add new memory-used HUD query
Brian Paul [Wed, 3 Apr 2013 16:23:57 +0000 (10:23 -0600)]
svga: add new memory-used HUD query

To track the amount of memory used by all pipe_resources (textures
and buffers).

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agoutil: add new util_resource_size() function in u_resource.[ch]
Brian Paul [Wed, 3 Apr 2013 16:23:16 +0000 (10:23 -0600)]
util: add new util_resource_size() function in u_resource.[ch]

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agoutil: move functions from u_resource.c to u_transfer.c
Brian Paul [Wed, 3 Apr 2013 16:21:34 +0000 (10:21 -0600)]
util: move functions from u_resource.c to u_transfer.c

The functions are prototyped in u_transfer.h and are related to the
other functions in u_transfer.c.

The next patch will re-use the u_resource.c file for new code.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agor600g/llvm: Do not override llvm provided stack_size
Vincent Lejeune [Wed, 3 Apr 2013 16:39:18 +0000 (18:39 +0200)]
r600g/llvm: Do not override llvm provided stack_size

11 years agor600g/llvm: Do not change cf_alu inst when adding alus
Vincent Lejeune [Tue, 2 Apr 2013 17:19:24 +0000 (19:19 +0200)]
r600g/llvm: Do not change cf_alu inst when adding alus

11 years agoradeonsi: add more cases for copying unsupported formats to resource_copy_region
Marek Olšák [Tue, 2 Apr 2013 22:47:06 +0000 (18:47 -0400)]
radeonsi: add more cases for copying unsupported formats to resource_copy_region

Ported from r600g commit:

8891b2f9c91b2f6c8625184c23a10b8e55875dc0

Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
NOTE: This is a candidate for the 9.1 branch.

11 years agosvga: add HUD queries for number of draw calls, number of fallbacks
Brian Paul [Mon, 1 Apr 2013 23:51:43 +0000 (17:51 -0600)]
svga: add HUD queries for number of draw calls, number of fallbacks

The fallbacks count is the number of drawing calls that use a "draw"
module fallback, such as polygon stipple.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agosvga: refactor occlusion query code
Brian Paul [Mon, 1 Apr 2013 23:49:31 +0000 (17:49 -0600)]
svga: refactor occlusion query code

This is in preparation for adding new query types for the HUD.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agogallium/hud: try L8 texture for font if I8 format isn't supported
Brian Paul [Mon, 1 Apr 2013 22:44:50 +0000 (16:44 -0600)]
gallium/hud: try L8 texture for font if I8 format isn't supported

11 years agosvga: add case for PIPE_CAP_QUERY_PIPELINE_STATISTICS
Brian Paul [Wed, 3 Apr 2013 14:19:44 +0000 (08:19 -0600)]
svga: add case for PIPE_CAP_QUERY_PIPELINE_STATISTICS

11 years agost/mesa: rewrite comment in st_manager.c
Brian Paul [Tue, 2 Apr 2013 20:33:42 +0000 (14:33 -0600)]
st/mesa: rewrite comment in st_manager.c

11 years agonv50,nvc0: remove MS resolve formats hack
Christoph Bumiller [Wed, 3 Apr 2013 11:19:15 +0000 (13:19 +0200)]
nv50,nvc0: remove MS resolve formats hack

Mesa now allows BlitFramebuffer resolve between RGBA and BGRA.

11 years agonvc0: fix 128 bit compressed storage type selection
Christoph Bumiller [Tue, 2 Apr 2013 23:17:46 +0000 (01:17 +0200)]
nvc0: fix 128 bit compressed storage type selection

11 years agonvc0: place staging textures in GART and map them directly
Christoph Bumiller [Tue, 2 Apr 2013 22:18:55 +0000 (00:18 +0200)]
nvc0: place staging textures in GART and map them directly

11 years agonv50: account for pesky prefetch in size calculation of linear textures
Christoph Bumiller [Tue, 2 Apr 2013 22:18:29 +0000 (00:18 +0200)]
nv50: account for pesky prefetch in size calculation of linear textures

11 years agonvc0: honour scaled coordiantes setting for linear textures
Christoph Bumiller [Tue, 2 Apr 2013 14:24:06 +0000 (16:24 +0200)]
nvc0: honour scaled coordiantes setting for linear textures

11 years agonvc0: fix for 2d engine R source formats writing RRR1 and not R001
Christoph Bumiller [Sat, 30 Mar 2013 20:28:30 +0000 (21:28 +0100)]
nvc0: fix for 2d engine R source formats writing RRR1 and not R001

11 years agonv50,nvc0: disable DEPTH_RANGE_NEAR/FAR clipping during blit
Christoph Bumiller [Sun, 31 Mar 2013 20:10:02 +0000 (22:10 +0200)]
nv50,nvc0: disable DEPTH_RANGE_NEAR/FAR clipping during blit

We send position.z == 0, DEPTH_RANGE may be some arbitrary range
not including 0 (for exmaple in piglit's hiz tests).

11 years agost/mesa: fix bitmap,drawpix,drawtex for PIPE_CAP_TGSI_TEXCOORD
Christoph Bumiller [Sat, 30 Mar 2013 13:57:21 +0000 (14:57 +0100)]
st/mesa: fix bitmap,drawpix,drawtex for PIPE_CAP_TGSI_TEXCOORD

NOTE: Changed the semantic index for the drawtex coordinate to
be the texture unit index instead of always 0.
Not sure if this is correct but since the value seems to depend
on the unit it would make sense to use different varying slots.

11 years agonouveau: accelerate buffer copies in resource_copy_region
Christoph Bumiller [Sat, 30 Mar 2013 14:55:20 +0000 (15:55 +0100)]
nouveau: accelerate buffer copies in resource_copy_region

11 years agonvc0: demagic some of the NVE4_COMPUTE_UPLOAD methods
Christoph Bumiller [Mon, 1 Apr 2013 19:46:24 +0000 (21:46 +0200)]
nvc0: demagic some of the NVE4_COMPUTE_UPLOAD methods

It's actually the same as P2MF.

11 years agonvc0: read PM counters for each warp scheduler separately
Christoph Bumiller [Tue, 2 Apr 2013 16:24:45 +0000 (18:24 +0200)]
nvc0: read PM counters for each warp scheduler separately

11 years agonvc0: add some metrics to driver specific queries
Christoph Bumiller [Mon, 1 Apr 2013 15:25:40 +0000 (17:25 +0200)]
nvc0: add some metrics to driver specific queries

11 years agonvc0: add some driver statistics queries
Christoph Bumiller [Fri, 29 Mar 2013 15:30:58 +0000 (16:30 +0100)]
nvc0: add some driver statistics queries

11 years agonvc0: disable compressed storage type 0xdb for now
Christoph Bumiller [Sun, 31 Mar 2013 18:10:23 +0000 (20:10 +0200)]
nvc0: disable compressed storage type 0xdb for now

Single-sample color compression doesn't seem that useful anyway.

11 years agonvc0: use correct hw query for PRIMITIVES_GENERATED
Christoph Bumiller [Fri, 29 Mar 2013 14:11:16 +0000 (15:11 +0100)]
nvc0: use correct hw query for PRIMITIVES_GENERATED

It was the same as SO_STATISTICS[1] before.

11 years agonvc0: use fence to check state of queries that don't write sequence
Christoph Bumiller [Fri, 29 Mar 2013 12:50:44 +0000 (13:50 +0100)]
nvc0: use fence to check state of queries that don't write sequence

This still isn't optimal, since the fence will signal a bit late,
but better than checking on the bo, which may never be ready if it
is shared (which is likely).

11 years agogallium/hud: add support for PIPE_QUERY_PIPELINE_STATISTICS
Christoph Bumiller [Fri, 29 Mar 2013 12:56:35 +0000 (13:56 +0100)]
gallium/hud: add support for PIPE_QUERY_PIPELINE_STATISTICS

Also, renamed "pixels-rendered" to "samples-passed" because the
occlusion counter increments even if colour and depth writes are
disabled, or (on some implementations) for killed fragments that
passed the depth test when PS early_fragment_tests is set.

11 years agogallium/docs: fix definition of PIPE_QUERY_SO_STATISTICS
Christoph Bumiller [Fri, 29 Mar 2013 13:30:49 +0000 (14:30 +0100)]
gallium/docs: fix definition of PIPE_QUERY_SO_STATISTICS

Reviewed-by: Marek Olšák <maraeo@gmail.com>
11 years agogallium: add PIPE_CAP_QUERY_PIPELINE_STATISTICS
Christoph Bumiller [Fri, 29 Mar 2013 12:02:49 +0000 (13:02 +0100)]
gallium: add PIPE_CAP_QUERY_PIPELINE_STATISTICS

Reviewed-by: Marek Olšák <maraeo@gmail.com>
11 years agoi965: Reduce code duplication in handling of depth, stencil, and HiZ.
Paul Berry [Tue, 26 Mar 2013 20:24:43 +0000 (13:24 -0700)]
i965: Reduce code duplication in handling of depth, stencil, and HiZ.

This patch consolidates duplicate code in the brw_depthbuffer and
gen7_depthbuffer state atoms.  Previously, these state atoms contained
5 chunks of code for emitting the _3DSTATE_DEPTH_BUFFER packet (3 for
Gen4-6 and 2 for Gen7).  Also a lot of logic for determining the
appropriate buffer setup was duplicated between the Gen4-6 and Gen7
functions.

This refactor splits the code into three separate functions:
brw_emit_depthbuffer(), which determines the appropriate buffer setup
in a mostly generation-independent way, brw_emit_depth_stencil_hiz(),
which emits the appropriate state packets for Gen4-6, and
gen7_emit_depth_stencil_hiz(), which emits the appropriate state
packets for Gen7.

Tested using Piglit on Gen5-7 (no regressions).

v2: Re-word some comments.  Fix an assertion that incorrectly
prohibited packed depth/stencil formats on Gen6 (these are allowed
provided that HiZ is disabled).

Reviewed-by: Chad Versace <chad.versace@linux.intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
11 years agoRevert "glsl: Replace constant-index vector array accesses with swizzles"
Paul Berry [Tue, 2 Apr 2013 16:35:32 +0000 (09:35 -0700)]
Revert "glsl: Replace constant-index vector array accesses with swizzles"

This reverts commit dbf94d105a48b7aafb2c8cf64d8b4392d87efea1, which
was working around a bug in the handling of array indexing when
constant folding built-in functions.  Now that the constant folding
bug has been fixed, the workaround is no longer needed.

11 years agoglsl: Fix array indexing when constant folding built-in functions.
Paul Berry [Fri, 29 Mar 2013 20:34:51 +0000 (13:34 -0700)]
glsl: Fix array indexing when constant folding built-in functions.

Mesa constant-folds built-in functions by using a miniature GLSL
interpreter (see
ir_function_signature::constant_expression_evaluate_expression_list()).
This interpreter had a bug in its handling of array indexing, which
caused expressions like "m[i][j]" (where m is a matrix) to be handled
incorrectly.  Specifically, it incorrectly treated j as indexing into
the whole matrix (rather than indexing just into the vector m[i]); as
a result the offset computed for m[i] was lost and m[i][j] was treated
as m[j][0].

Fixes piglit tests inverse-mat[234].{vert,frag}.

NOTE: This is a candidate for the 9.1 and 9.0 branches.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=57436

11 years agogallivm: bring back optimized but incorrect float to smallfloat optimizations
Roland Scheidegger [Tue, 2 Apr 2013 15:47:30 +0000 (17:47 +0200)]
gallivm: bring back optimized but incorrect float to smallfloat optimizations

Conceptually the same as previously done in float_to_half.
Should cut down number of instructions from 14 to 10 or so, but
will promote some NaNs to Infs, so it's disabled.
It gets a bit tricky though handling all the cases correctly...
Passes basic tests either way (though there are no tests testing special
cases, but some manual tests injecting them seemed promising).

v2: style and comment fixes suggested by Jose

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agogallivm: consolidate code for float-to-half and float-to-packed conversion.
Roland Scheidegger [Tue, 2 Apr 2013 15:41:44 +0000 (17:41 +0200)]
gallivm: consolidate code for float-to-half and float-to-packed conversion.

This replaces the existing float-to-half implementation.
There are definitely a couple of differences - the old implementation
had unspecified(?) rounding behavior, and could at least in theory
construct Inf values out of NaNs. NaNs and Infs should now always be
properly propagated, and rounding behavior is now towards zero
(note this means too large but non-Infinity values get propagated to max
representable value, not Infinity).
The implementation will definitely not match util code, however (which
does nearest rounding, which also means too large values will get
propagated to Infinity).

Also fix a bogus round mask probably leading to rounding bugs...
v2: fix a logic bug in handling infs/nans.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
11 years agor600g: don't reserve more stack space than required v5
Vadim Girlin [Tue, 2 Apr 2013 15:33:40 +0000 (19:33 +0400)]
r600g: don't reserve more stack space than required v5

Reduced stack size allows to run more threads in some cases,
improving performance for the shaders that use stack (that is, for the
shaders with control flow instructions). E.g. with unigine-based apps.

v4: implement exact computation taking into account wavefront size
v5: add cases for RV620, RS880

Signed-off-by: Vadim Girlin <vadimgirlin@gmail.com>
11 years agor600g: fix range handling for tgsi input declarations v2
Vadim Girlin [Tue, 2 Apr 2013 15:32:26 +0000 (19:32 +0400)]
r600g: fix range handling for tgsi input declarations v2

Signed-off-by: Vadim Girlin <vadimgirlin@gmail.com>
11 years agogallium/hud: do .xxxx swizzling for the font texture in the fragment shader
Marek Olšák [Tue, 2 Apr 2013 01:30:09 +0000 (03:30 +0200)]
gallium/hud: do .xxxx swizzling for the font texture in the fragment shader

This allows using L8 and R8 for the font if I8 isn't supported.

Tested-by: Brian Paul <brianp@vmware.com>
11 years agohud: flush/unmap the vertex buffer before drawing
Brian Paul [Mon, 1 Apr 2013 22:46:06 +0000 (16:46 -0600)]
hud: flush/unmap the vertex buffer before drawing

The VMware svga driver is picky about making sure the VBO is unmapped
before drawing.

Reviewed-by: Marek Olšák <maraeo@gmail.com>
11 years agodraw: use pipe_transfer_unmap() to match pipe_transfer_map()
Brian Paul [Mon, 1 Apr 2013 22:44:01 +0000 (16:44 -0600)]
draw: use pipe_transfer_unmap() to match pipe_transfer_map()

11 years agogallivm: fix signed small float to float conversion
Roland Scheidegger [Tue, 2 Apr 2013 11:20:24 +0000 (13:20 +0200)]
gallivm: fix signed small float to float conversion

Introduced by 5f41e08cf39d585d600aa506cdcd2f5380c60ddd,
just a silly typo.
Fixes https://bugs.freedesktop.org/show_bug.cgi?id=62921.

11 years agoradeonsi: add instance divisor support v3
Christian König [Fri, 22 Mar 2013 14:59:22 +0000 (15:59 +0100)]
radeonsi: add instance divisor support v3

v2: reduce key size, don't copy key around to much.
v3: remove key size reduction

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
11 years agoradeonsi: add start instance support
Christian König [Thu, 21 Mar 2013 17:30:23 +0000 (18:30 +0100)]
radeonsi: add start instance support

This works different than on R600, we need to add the start instance manually.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
11 years agoradeonsi: add instanceid support
Christian König [Thu, 21 Mar 2013 17:02:52 +0000 (18:02 +0100)]
radeonsi: add instanceid support

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
11 years agoradeon/llvm: move system value fetching to common code
Christian König [Thu, 21 Mar 2013 16:37:37 +0000 (17:37 +0100)]
radeon/llvm: move system value fetching to common code

This should be used by both SI and R600.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
11 years agoradeonsi: Handle arbitrary 2-byte formats in resource_copy_region
Michel Dänzer [Wed, 27 Mar 2013 11:43:32 +0000 (12:43 +0100)]
radeonsi: Handle arbitrary 2-byte formats in resource_copy_region

Fixes mplayer -vo vdpau OSD.

NOTE: This is a candidate for the 9.1 branch.

Reported-by: Igor Vagulin <igor.vagulin@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Tested-by: Christian König <christian.koenig@amd.com>
11 years agonvc0: Fix fd leak in nvc0_create_decoder
Maarten Lankhorst [Sun, 24 Mar 2013 13:37:41 +0000 (14:37 +0100)]
nvc0: Fix fd leak in nvc0_create_decoder

NOTE: This is a candidate for the 9.0 and 9.1 branches.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
11 years agoGLSL: fix lower_jumps to report progress properly
Aras Pranckevicius [Fri, 1 Mar 2013 10:05:11 +0000 (12:05 +0200)]
GLSL: fix lower_jumps to report progress properly

A fix for lower_jumps progress reporting, very much like similar in
c1e591eed.

NOTE: This is a candidate for stable branches.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
11 years agoi965/fs: Allow CSE on pre-gen7 varying-index uniform loads
Eric Anholt [Wed, 20 Mar 2013 00:45:02 +0000 (17:45 -0700)]
i965/fs: Allow CSE on pre-gen7 varying-index uniform loads

All the other expression types allowed here have inst->mlen == 0, and this
one has implied MRF writes for all of its payload, so nothing else in the
implementation should need to change.

Reduces SEND messages for loading from pull constants in kwin's Lanczos
shader from 16 to 6.  (Due to a deficiency in constant propagation, I
can't use the hack I did in the previous commit to test the performance
change)

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=61554
NOTE: This is a candidate for the 9.1 branch.

11 years agoi965/fs: Use LD messages for pre-gen7 varying-index uniform loads
Eric Anholt [Mon, 18 Mar 2013 17:16:42 +0000 (10:16 -0700)]
i965/fs: Use LD messages for pre-gen7 varying-index uniform loads

This comes at a minor performance cost at the moment (-3.2% +/- 0.2%, n=14 on
my GM45 forced to load all uniforms through the varying-index path), but we
get a whole vec4 at a time to reuse in the next commit.

v2: Fix comment about channels in the other message.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
NOTE: This is a candidate for the 9.1 branch.

11 years agoi965/fs: Don't double-emit SEND dependency workarounds at control flow.
Eric Anholt [Wed, 20 Mar 2013 00:36:10 +0000 (17:36 -0700)]
i965/fs: Don't double-emit SEND dependency workarounds at control flow.

We weren't setting needs_dep[i] in the loops, so we'd continue on to
potentially add the same workaround MOVs to the later basic block
boundaries, too.  We can either set needs_dep[i] to exit through the
normal path, or we can just return since we know we're done.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
11 years agoi965/fs: Bake regs_written into the IR instead of recomputing it later.
Eric Anholt [Mon, 18 Mar 2013 18:30:57 +0000 (11:30 -0700)]
i965/fs: Bake regs_written into the IR instead of recomputing it later.

For sampler messages, it depends on the target gen, and on gen4
SIMD16-sampler-on-SIMD8-execution we were returning 4 instead of 8 like we
should.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
NOTE: This is a candidate for the 9.1 branch.

11 years agoi965/fs: Clean up the setup of gen4 simd16 message destinations.
Eric Anholt [Mon, 18 Mar 2013 18:26:17 +0000 (11:26 -0700)]
i965/fs: Clean up the setup of gen4 simd16 message destinations.

I think this makes it much more obvious what's going on here.

NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
11 years agoi965/fs: Do CSE on gen7's varying-index pull constant loads.
Eric Anholt [Fri, 15 Mar 2013 21:43:28 +0000 (14:43 -0700)]
i965/fs: Do CSE on gen7's varying-index pull constant loads.

This is our first CSE on a regs_written() > 1 instruction, so it takes a
bit of extra fixup.  Reduces the number of loads on kwin's Lanczos shader
from 12 to 2.

v2: Fix compiler warning (false positive on possibly-uninitialized variable)

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=61554
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> (v1)
NOTE: This is a candidate for the 9.1 branch.

11 years agoi965/fs: Improve performance of varying-index uniform loads on IVB.
Eric Anholt [Wed, 13 Mar 2013 21:48:55 +0000 (14:48 -0700)]
i965/fs: Improve performance of varying-index uniform loads on IVB.

Like we have done for the VS and for constant-index uniform loads, we use
the sampler engine to get caching in front of the L3 to avoid tickling the
IVB L3 bug.  This is also a bit of a functional change, as we're now
loading a vec4 instead of a single dword, though we're not taking
advantage of the other 3 components of the vec4 (yet).

With the driver hacked to always take the varying-index path for all
uniforms, improves performance of my old GLSL demo by 315% +/- 2% (n=4).
This a major fix for some blur shaders in compositors from the
varying-index uniforms support I introduced in 9.1.

v2: Move old offset computation into the pre-gen7 path.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=61554
NOTE: This is a candidate for the 9.1 branch.

11 years agoi965/fs: Avoid inappropriate optimization with regs_written > 1.
Eric Anholt [Fri, 15 Mar 2013 21:31:46 +0000 (14:31 -0700)]
i965/fs: Avoid inappropriate optimization with regs_written > 1.

Right now we don't have anything with regs_written() > 1 and !inst->mlen,
but that's about to change.

NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
11 years agoi965: Make the fragment shader pull constants index by dwords, not vec4s.
Eric Anholt [Thu, 14 Mar 2013 21:41:37 +0000 (14:41 -0700)]
i965: Make the fragment shader pull constants index by dwords, not vec4s.

We want to load vec4s, since loading a vec4 instead of a dword is
basically no increased latency.  But for variable indexed access, the
previous requirement of aligned vec4s for a sampler LD was hard to
implement.

Note that this change only affects those messages that use the surface
format, like sampler LDs, but not to the untyped data cache loads we've
used in other cases.

No significant performance difference on my GLSL demo with uniforms forced
to take the varying pull constants path (n=4).

NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>