Kenneth Graunke [Thu, 4 Apr 2013 06:56:57 +0000 (23:56 -0700)]
glsl: Add an optimization pass to flatten simple nested if blocks.
GLBenchmark 2.7's shaders contain conditional blocks like:
if (x) {
if (y) {
...
}
}
where the outer conditional's then clause contains exactly one statement
(the nested if) and there are no else clauses. This can easily be
optimized into:
if (x && y) {
...
}
This saves a few instructions in GLBenchmark 2.7:
total instructions in shared programs: 11833 -> 11649 (-1.55%)
instructions in affected programs: 8234 -> 8050 (-2.23%)
It also helps CS:GO slightly (-0.05%/-0.22%). More importantly,
however, it simplifies the control flow graph, which could enable other
optimizations.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Kenneth Graunke [Wed, 3 Apr 2013 04:11:51 +0000 (21:11 -0700)]
i965: Use a variable for the push constant size in kB.
This clarifies that the offset of 2 is actually 16 kB / 8kB units.
It also keys both computations off of a single variable, which should
make it easier to change in the future.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
Kenneth Graunke [Wed, 3 Apr 2013 04:11:50 +0000 (21:11 -0700)]
i965: Turn brw->urb.vs_size and gs_size into local variables.
These variables are only used within a single function, so we may as
well make them local variables.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
Kenneth Graunke [Wed, 13 Mar 2013 05:16:37 +0000 (22:16 -0700)]
i965: Remove BRW_NEW_WM_INPUT_DIMENSIONS dirty bit.
This was only produced by the brw_wm_input_dimensions atom, which was
removed in the previous commit. So there's no need for the dirty bit.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Kenneth Graunke [Wed, 13 Mar 2013 04:12:08 +0000 (21:12 -0700)]
i965: Delete brw_vs_constval.c and the brw_wm_input_sizes atom.
This was only used to compute proj_attrib_mask, which was removed by the
previous commit. That makes this dead code.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Kenneth Graunke [Wed, 13 Mar 2013 04:09:35 +0000 (21:09 -0700)]
i965: Remove now dead brw_wm_prog_key::proj_attrib_mask field.
The previous commit removed the last user of this field, so there's no
longer any point in setting it. Removing this should eliminate
state-dependent recompiles, and make the precompile more reliable.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Kenneth Graunke [Wed, 13 Mar 2013 04:09:19 +0000 (21:09 -0700)]
i965: Remove fixed-function texture projection avoidance optimization.
This optimization attempts to avoid extra attribute interpolation
instructions for texture coordinates where the W-component is 1.0.
Unfortunately, it requires a lot of complexity: the brw_wm_input_sizes
state atom (all the brw_vs_constval.c code) needs to run on each draw.
It computes the input_size_masks array, then uses that to compute
proj_attrib_mask. Differences in proj_attrib_mask can cause
state-dependent fragment shader recompiles. We also often fail to guess
proj_attrib_mask for the fragment shader precompile, causing us to
needlessly compile it twice.
Furthermore, this optimization only applies to fixed-function programs;
it does not help modern GLSL-based programs at all. Generally, older
fixed-function programs run fine on modern hardware anyway.
The optimization has existed in some form since the initial commit. When
we rewrote the fragment shader backend, we dropped it for a while. Eric
readded it in commit
eb30820f268608cf451da32de69723036dddbc62 as part of
an attempt to cure a ~1% performance regression caused by converting the
fixed-function fragment shader generation code from Mesa IR to GLSL IR.
However, no performance data was included in the commit message, so it's
unclear whether or not it was successful.
Time has passed, so I decided to re-measure this. Surprisingly,
Eric's OpenArena timedemo actually runs /faster/ after removing this and
the brw_wm_input_sizes atom. On Ivybridge at 1024x768, I measured a
1.39532% +/- 0.91833% increase in FPS (n = 55). On Ironlake, there was
no statistically significant difference (n = 37).
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Kenneth Graunke [Tue, 2 Apr 2013 17:28:07 +0000 (10:28 -0700)]
i965: Use ctx->Stencil._WriteEnabled in DEPTH_STENCIL_STATE.
This is the same computation as the _WriteEnabled flag, so we may as
well use it.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
Kenneth Graunke [Tue, 2 Apr 2013 17:29:37 +0000 (10:29 -0700)]
i965: Fix stencil write enable flag in 3DSTATE_DEPTH_BUFFER on Gen7+.
ctx->Stencil.WriteMask is a statically sized array of 3 elements.
Checking it against 0 actually is a NULL check, and can never fail,
which meant that we always said stencil writes were enabled.
Use the new core Mesa derived state flag to fix this.
NOTE: This is a candidate for stable branches.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
Kenneth Graunke [Tue, 2 Apr 2013 17:22:18 +0000 (10:22 -0700)]
mesa: Add new ctx->Stencil._WriteEnabled derived state flag.
i965 needs to know whether stencil writes are enabled in several places,
and gets the test wrong sometimes. While we could create a function to
compute this, it seems generally useful enough to warrant a new piece of
derived state. Also, all the plumbing is already in place.
NOTE: This is a candidate for stable branches.
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Paul Berry <stereotype441@gmail.com>
Roland Scheidegger [Thu, 4 Apr 2013 21:20:49 +0000 (23:20 +0200)]
gallivm: some minor cube map cleanup
The ar_ge_as_at variable was just very very confusing since the condition
was actually the other way around (as_at_ge_ar). So change the condition
(and the selects depending on it) to match the variable name.
And also change the chosen major axis in case the coord values are the
same. OpenGL doesn't care one bit which one is chosen in this case but
it looks like dx10 would require z chosen over y, and y chosen over x
(previously did x chosen over y, y chosen over z). Since it's all the
same effort just honor dx10's wishes. (Though actually, for some prefered
orderings, we could save one (or two with derivatives) selects since the
tnewx and tnewz (and the corresponding dmax values) are the same.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Eric Anholt [Sat, 1 Dec 2012 00:34:09 +0000 (16:34 -0800)]
i965: Ask the register allocator to round-robin through registers.
The way we were allocating registers before, packing into low register
numbers for Ironlake, resulted in an overly-constrained dependency graph
for instruction scheduling. Improves GLBenchmark 2.1 performance by
4.5% +/- 0.7% (n=26). No difference on my old GLSL demo (n=20). No
difference on nexuiz (n=15).
v2: Fix off-by-one bug that made the change only work for 16-wide on i965.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Zack Rusin [Thu, 4 Apr 2013 04:15:13 +0000 (21:15 -0700)]
llvmpipe: implement ucmp
and add a test for it
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Paul Berry [Tue, 2 Apr 2013 16:51:47 +0000 (09:51 -0700)]
Avoid spurious GCC warnings in STATIC_ASSERT() macro.
GCC 4.8 now warns about typedefs that are local to a scope and not
used anywhere within that scope. This produced spurious warnings with
the STATIC_ASSERT() macro (which used a typedef to provoke a compile
error in the event of an assertion failure).
This patch switches to a simpler technique that avoids the warning.
v2: Avoid GCC-specific syntax. Also update p_compiler.h.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Erik Faye-Lund [Tue, 26 Mar 2013 13:48:45 +0000 (14:48 +0100)]
freedreno: document debug flag
Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
Signed-off-by: Brian Paul <brianp@vmware.com>
Brian Paul [Wed, 3 Apr 2013 19:46:40 +0000 (13:46 -0600)]
st/wgl: add HUD support
v2: fix a few minor issues spotted by Jose.
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Brian Paul [Wed, 3 Apr 2013 19:45:47 +0000 (13:45 -0600)]
st/wgl: make stw_current_context() non-static
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Brian Paul [Wed, 3 Apr 2013 19:36:50 +0000 (13:36 -0600)]
util: add debug_memory_check_block(), debug_memory_tag()
The former just checks that the given block is valid by checking
the header and footer.
The later sets the memory block's tag. With extra debug code, we
can use that for monitoring/checking particular allocations.
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Brian Paul [Wed, 3 Apr 2013 19:33:38 +0000 (13:33 -0600)]
gallium/hud: replace malloc w/ MALLOC
To match the FREE() called used later. Fixes things on Windows.
Reviewed-by: Marek Olšák <maraeo@gmail.com>
Vincent Lejeune [Wed, 3 Apr 2013 19:19:22 +0000 (21:19 +0200)]
r600g/llvm: Workaround for wrong tex.offset_*
Roland Scheidegger [Wed, 3 Apr 2013 22:56:23 +0000 (00:56 +0200)]
gallivm: honor explicit derivatives values for cube maps.
This is trivial now, though need to make sure we pass all the necessary
derivative values (which is 3 each for ddx/ddy not 2).
Passes piglit arb_shader_texture_lod-texgradcube test.
v2: add the forgotten abs() for all incoming derivatives (discovered
by new piglit arb_shader_texture_lod-texgradcube test, though more by
luck as it was failing only for exactly one pixel...).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Roland Scheidegger [Wed, 3 Apr 2013 01:26:22 +0000 (03:26 +0200)]
gallivm: do per-pixel cube face selection (finally!!!)
This proved to be tricky, the problem is that after selection/mirroring
we cannot calculate reasonable derivatives (if not all pixels in a quad
end up on the same face the derivatives could get "randomly" exceedingly
large).
However, it is actually quite easy to simply calculate the derivatives
before selection/mirroring and then transform them similar to
the cube coordinates (they only need selection/projection, but not
mirroring as we're not interested in the sign bit, of course). While
there is a tiny bit more work to do (need to calculate derivs for 3
coords instead of 2, and additional selects) it also simplifies things
somewhat for the coord selection itself (as we save some broadcast aos
shuffles, and we don't need to calculate the average vector) - hence if
derivatives aren't needed this should actually be faster.
Also, this has the benefit that this will (trivially) work for explicit
derivatives too, which we completely ignored before that (will be in a
separate commit for better trackability).
Note that while the way for getting rho looks very different, it should
result in "nearly" the same values as before (the "nearly" is only because
before the code would choose the face based on an "average" vector and hence
the derivatives calculated according to this face, where now (for implicit
derivatives) the derivatives are projected on the face selected for the
first (top-left) pixel in a quad, so not necessarly the same face).
The transformation done might not quite be state-of-the-art, calculating
length(dx,dy) as max(dx,dy) certainly isn't neither but this stays the
same as before (that is I think a better transform would _somehow_ take
the "derivative major axis" into account so that derivative changes in
the major axis wouldn't get ignored).
Should solve some accuracy problems with cubemaps (can easily be seen with
the cubemap demo when switching wrapping/filtering), though we still don't
do seamless filtering to fix it completely (so not per-sample but per-pixel
is certainly better than per-quad and already sufficient for accurate
results with nearest tex filter).
As for performance, it seems to be a tiny bit faster too (maybe 3% or so
with cubemap demo). Which I'd have expected with nearest/nearest filtering
where this will be less instructions, but the difference seems to actually
be larger with linear/linear_mipmap_linear where it is slightly more
instructions, probably the code appears less serialized allowing better
scheduling (on a sandy bridge cpu). It actually seems to be now at least
as fast as the old path using a conditional when using 128bit vectors too
(that is probably more a result of testing with a newer cpu though), for now
that old path is still there but unused.
No piglit regressions.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Roland Scheidegger [Wed, 3 Apr 2013 00:49:56 +0000 (02:49 +0200)]
gallivm: minor rho calculation optimization for 1 or 3 coords
Using a different packing for the single coord case should save a shuffle.
Plus some minor style fixes.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Roland Scheidegger [Tue, 2 Apr 2013 23:06:52 +0000 (01:06 +0200)]
gallivm: use f16c hw support for float->half and half->float conversion
Should be way faster of course on cpus supporting this (includes AMD
Bulldozer and Jaguar cores, Intel Ivy Bridge and up (except budget models)).
Passes piglit fbo-blending-formats GL_ARB_texture_float -auto on Ivy Bridge.
Reviewed-by: Brian Paul <brianp@vmware.com>
Zack Rusin [Sat, 30 Mar 2013 13:21:41 +0000 (06:21 -0700)]
draw/llvmpipe: allow independent so attachments to the vs
When geometry shaders are present, one needs to be able to create
an empty geometry shader with stream output that needs to be
resolved later and attached to the currently bound vertex shader.
Lets add support for it to llvmpipe and draw. draw allows attaching
independent stream output info to any vertex shader and llvmpipe
resolves at draw time which vertex shader the given empty geometry
shader should be linked to.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Sat, 30 Mar 2013 07:21:03 +0000 (00:21 -0700)]
llvmpipe: reset so buffers when not appending
We need to reset the internal state of the so buffers or we'll
keep appending even though we're not supposed to.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Sat, 30 Mar 2013 07:20:05 +0000 (00:20 -0700)]
draw: remove unused function
we use draw_set_mapped_so_targets nowadays
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Sat, 30 Mar 2013 02:33:34 +0000 (19:33 -0700)]
draw/llvm: use an enum instead of magic numbers
I think this was there before and got accidently
removed during a merge. Same code as for the GS
context, which is also using an enum instead of
hardcoded numbers.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Sat, 30 Mar 2013 00:18:42 +0000 (17:18 -0700)]
draw/gs: cleanup some debugging code
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Fri, 29 Mar 2013 11:52:29 +0000 (04:52 -0700)]
draw/so: maintain an exact number of written vertices
It's quite helpful during the rendering when we know
exactly the count of the vertices available in the
buffer.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Fri, 29 Mar 2013 11:50:32 +0000 (04:50 -0700)]
draw: Implement support for primitive id
We were largely ignoring primitive id.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Thu, 28 Mar 2013 03:13:13 +0000 (20:13 -0700)]
draw/so: Fix bogus assert
We do support so with multiple primitives.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Thu, 28 Mar 2013 03:11:16 +0000 (20:11 -0700)]
draw/gs: Fix memory corruption with multiple primitives
We were flushing with incorrect number of primitives. TGSI exec
can only work with a single primitive at a time. Plus the fetching
with multiple primitives on llvm paths wasn't copying the last
element.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Zack Rusin [Wed, 27 Mar 2013 11:27:59 +0000 (04:27 -0700)]
gallivm: cleanup the gs interface
Instead of void pointers use a base interface.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Brian Paul [Wed, 3 Apr 2013 16:23:57 +0000 (10:23 -0600)]
svga: add new memory-used HUD query
To track the amount of memory used by all pipe_resources (textures
and buffers).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Brian Paul [Wed, 3 Apr 2013 16:23:16 +0000 (10:23 -0600)]
util: add new util_resource_size() function in u_resource.[ch]
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Brian Paul [Wed, 3 Apr 2013 16:21:34 +0000 (10:21 -0600)]
util: move functions from u_resource.c to u_transfer.c
The functions are prototyped in u_transfer.h and are related to the
other functions in u_transfer.c.
The next patch will re-use the u_resource.c file for new code.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Vincent Lejeune [Wed, 3 Apr 2013 16:39:18 +0000 (18:39 +0200)]
r600g/llvm: Do not override llvm provided stack_size
Vincent Lejeune [Tue, 2 Apr 2013 17:19:24 +0000 (19:19 +0200)]
r600g/llvm: Do not change cf_alu inst when adding alus
Marek Olšák [Tue, 2 Apr 2013 22:47:06 +0000 (18:47 -0400)]
radeonsi: add more cases for copying unsupported formats to resource_copy_region
Ported from r600g commit:
8891b2f9c91b2f6c8625184c23a10b8e55875dc0
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
NOTE: This is a candidate for the 9.1 branch.
Brian Paul [Mon, 1 Apr 2013 23:51:43 +0000 (17:51 -0600)]
svga: add HUD queries for number of draw calls, number of fallbacks
The fallbacks count is the number of drawing calls that use a "draw"
module fallback, such as polygon stipple.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Brian Paul [Mon, 1 Apr 2013 23:49:31 +0000 (17:49 -0600)]
svga: refactor occlusion query code
This is in preparation for adding new query types for the HUD.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Brian Paul [Mon, 1 Apr 2013 22:44:50 +0000 (16:44 -0600)]
gallium/hud: try L8 texture for font if I8 format isn't supported
Brian Paul [Wed, 3 Apr 2013 14:19:44 +0000 (08:19 -0600)]
svga: add case for PIPE_CAP_QUERY_PIPELINE_STATISTICS
Brian Paul [Tue, 2 Apr 2013 20:33:42 +0000 (14:33 -0600)]
st/mesa: rewrite comment in st_manager.c
Christoph Bumiller [Wed, 3 Apr 2013 11:19:15 +0000 (13:19 +0200)]
nv50,nvc0: remove MS resolve formats hack
Mesa now allows BlitFramebuffer resolve between RGBA and BGRA.
Christoph Bumiller [Tue, 2 Apr 2013 23:17:46 +0000 (01:17 +0200)]
nvc0: fix 128 bit compressed storage type selection
Christoph Bumiller [Tue, 2 Apr 2013 22:18:55 +0000 (00:18 +0200)]
nvc0: place staging textures in GART and map them directly
Christoph Bumiller [Tue, 2 Apr 2013 22:18:29 +0000 (00:18 +0200)]
nv50: account for pesky prefetch in size calculation of linear textures
Christoph Bumiller [Tue, 2 Apr 2013 14:24:06 +0000 (16:24 +0200)]
nvc0: honour scaled coordiantes setting for linear textures
Christoph Bumiller [Sat, 30 Mar 2013 20:28:30 +0000 (21:28 +0100)]
nvc0: fix for 2d engine R source formats writing RRR1 and not R001
Christoph Bumiller [Sun, 31 Mar 2013 20:10:02 +0000 (22:10 +0200)]
nv50,nvc0: disable DEPTH_RANGE_NEAR/FAR clipping during blit
We send position.z == 0, DEPTH_RANGE may be some arbitrary range
not including 0 (for exmaple in piglit's hiz tests).
Christoph Bumiller [Sat, 30 Mar 2013 13:57:21 +0000 (14:57 +0100)]
st/mesa: fix bitmap,drawpix,drawtex for PIPE_CAP_TGSI_TEXCOORD
NOTE: Changed the semantic index for the drawtex coordinate to
be the texture unit index instead of always 0.
Not sure if this is correct but since the value seems to depend
on the unit it would make sense to use different varying slots.
Christoph Bumiller [Sat, 30 Mar 2013 14:55:20 +0000 (15:55 +0100)]
nouveau: accelerate buffer copies in resource_copy_region
Christoph Bumiller [Mon, 1 Apr 2013 19:46:24 +0000 (21:46 +0200)]
nvc0: demagic some of the NVE4_COMPUTE_UPLOAD methods
It's actually the same as P2MF.
Christoph Bumiller [Tue, 2 Apr 2013 16:24:45 +0000 (18:24 +0200)]
nvc0: read PM counters for each warp scheduler separately
Christoph Bumiller [Mon, 1 Apr 2013 15:25:40 +0000 (17:25 +0200)]
nvc0: add some metrics to driver specific queries
Christoph Bumiller [Fri, 29 Mar 2013 15:30:58 +0000 (16:30 +0100)]
nvc0: add some driver statistics queries
Christoph Bumiller [Sun, 31 Mar 2013 18:10:23 +0000 (20:10 +0200)]
nvc0: disable compressed storage type 0xdb for now
Single-sample color compression doesn't seem that useful anyway.
Christoph Bumiller [Fri, 29 Mar 2013 14:11:16 +0000 (15:11 +0100)]
nvc0: use correct hw query for PRIMITIVES_GENERATED
It was the same as SO_STATISTICS[1] before.
Christoph Bumiller [Fri, 29 Mar 2013 12:50:44 +0000 (13:50 +0100)]
nvc0: use fence to check state of queries that don't write sequence
This still isn't optimal, since the fence will signal a bit late,
but better than checking on the bo, which may never be ready if it
is shared (which is likely).
Christoph Bumiller [Fri, 29 Mar 2013 12:56:35 +0000 (13:56 +0100)]
gallium/hud: add support for PIPE_QUERY_PIPELINE_STATISTICS
Also, renamed "pixels-rendered" to "samples-passed" because the
occlusion counter increments even if colour and depth writes are
disabled, or (on some implementations) for killed fragments that
passed the depth test when PS early_fragment_tests is set.
Christoph Bumiller [Fri, 29 Mar 2013 13:30:49 +0000 (14:30 +0100)]
gallium/docs: fix definition of PIPE_QUERY_SO_STATISTICS
Reviewed-by: Marek Olšák <maraeo@gmail.com>
Christoph Bumiller [Fri, 29 Mar 2013 12:02:49 +0000 (13:02 +0100)]
gallium: add PIPE_CAP_QUERY_PIPELINE_STATISTICS
Reviewed-by: Marek Olšák <maraeo@gmail.com>
Paul Berry [Tue, 26 Mar 2013 20:24:43 +0000 (13:24 -0700)]
i965: Reduce code duplication in handling of depth, stencil, and HiZ.
This patch consolidates duplicate code in the brw_depthbuffer and
gen7_depthbuffer state atoms. Previously, these state atoms contained
5 chunks of code for emitting the _3DSTATE_DEPTH_BUFFER packet (3 for
Gen4-6 and 2 for Gen7). Also a lot of logic for determining the
appropriate buffer setup was duplicated between the Gen4-6 and Gen7
functions.
This refactor splits the code into three separate functions:
brw_emit_depthbuffer(), which determines the appropriate buffer setup
in a mostly generation-independent way, brw_emit_depth_stencil_hiz(),
which emits the appropriate state packets for Gen4-6, and
gen7_emit_depth_stencil_hiz(), which emits the appropriate state
packets for Gen7.
Tested using Piglit on Gen5-7 (no regressions).
v2: Re-word some comments. Fix an assertion that incorrectly
prohibited packed depth/stencil formats on Gen6 (these are allowed
provided that HiZ is disabled).
Reviewed-by: Chad Versace <chad.versace@linux.intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Paul Berry [Tue, 2 Apr 2013 16:35:32 +0000 (09:35 -0700)]
Revert "glsl: Replace constant-index vector array accesses with swizzles"
This reverts commit
dbf94d105a48b7aafb2c8cf64d8b4392d87efea1, which
was working around a bug in the handling of array indexing when
constant folding built-in functions. Now that the constant folding
bug has been fixed, the workaround is no longer needed.
Paul Berry [Fri, 29 Mar 2013 20:34:51 +0000 (13:34 -0700)]
glsl: Fix array indexing when constant folding built-in functions.
Mesa constant-folds built-in functions by using a miniature GLSL
interpreter (see
ir_function_signature::constant_expression_evaluate_expression_list()).
This interpreter had a bug in its handling of array indexing, which
caused expressions like "m[i][j]" (where m is a matrix) to be handled
incorrectly. Specifically, it incorrectly treated j as indexing into
the whole matrix (rather than indexing just into the vector m[i]); as
a result the offset computed for m[i] was lost and m[i][j] was treated
as m[j][0].
Fixes piglit tests inverse-mat[234].{vert,frag}.
NOTE: This is a candidate for the 9.1 and 9.0 branches.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=57436
Roland Scheidegger [Tue, 2 Apr 2013 15:47:30 +0000 (17:47 +0200)]
gallivm: bring back optimized but incorrect float to smallfloat optimizations
Conceptually the same as previously done in float_to_half.
Should cut down number of instructions from 14 to 10 or so, but
will promote some NaNs to Infs, so it's disabled.
It gets a bit tricky though handling all the cases correctly...
Passes basic tests either way (though there are no tests testing special
cases, but some manual tests injecting them seemed promising).
v2: style and comment fixes suggested by Jose
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Roland Scheidegger [Tue, 2 Apr 2013 15:41:44 +0000 (17:41 +0200)]
gallivm: consolidate code for float-to-half and float-to-packed conversion.
This replaces the existing float-to-half implementation.
There are definitely a couple of differences - the old implementation
had unspecified(?) rounding behavior, and could at least in theory
construct Inf values out of NaNs. NaNs and Infs should now always be
properly propagated, and rounding behavior is now towards zero
(note this means too large but non-Infinity values get propagated to max
representable value, not Infinity).
The implementation will definitely not match util code, however (which
does nearest rounding, which also means too large values will get
propagated to Infinity).
Also fix a bogus round mask probably leading to rounding bugs...
v2: fix a logic bug in handling infs/nans.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Vadim Girlin [Tue, 2 Apr 2013 15:33:40 +0000 (19:33 +0400)]
r600g: don't reserve more stack space than required v5
Reduced stack size allows to run more threads in some cases,
improving performance for the shaders that use stack (that is, for the
shaders with control flow instructions). E.g. with unigine-based apps.
v4: implement exact computation taking into account wavefront size
v5: add cases for RV620, RS880
Signed-off-by: Vadim Girlin <vadimgirlin@gmail.com>
Vadim Girlin [Tue, 2 Apr 2013 15:32:26 +0000 (19:32 +0400)]
r600g: fix range handling for tgsi input declarations v2
Signed-off-by: Vadim Girlin <vadimgirlin@gmail.com>
Marek Olšák [Tue, 2 Apr 2013 01:30:09 +0000 (03:30 +0200)]
gallium/hud: do .xxxx swizzling for the font texture in the fragment shader
This allows using L8 and R8 for the font if I8 isn't supported.
Tested-by: Brian Paul <brianp@vmware.com>
Brian Paul [Mon, 1 Apr 2013 22:46:06 +0000 (16:46 -0600)]
hud: flush/unmap the vertex buffer before drawing
The VMware svga driver is picky about making sure the VBO is unmapped
before drawing.
Reviewed-by: Marek Olšák <maraeo@gmail.com>
Brian Paul [Mon, 1 Apr 2013 22:44:01 +0000 (16:44 -0600)]
draw: use pipe_transfer_unmap() to match pipe_transfer_map()
Roland Scheidegger [Tue, 2 Apr 2013 11:20:24 +0000 (13:20 +0200)]
gallivm: fix signed small float to float conversion
Introduced by
5f41e08cf39d585d600aa506cdcd2f5380c60ddd,
just a silly typo.
Fixes https://bugs.freedesktop.org/show_bug.cgi?id=62921.
Christian König [Fri, 22 Mar 2013 14:59:22 +0000 (15:59 +0100)]
radeonsi: add instance divisor support v3
v2: reduce key size, don't copy key around to much.
v3: remove key size reduction
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Christian König [Thu, 21 Mar 2013 17:30:23 +0000 (18:30 +0100)]
radeonsi: add start instance support
This works different than on R600, we need to add the start instance manually.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
Christian König [Thu, 21 Mar 2013 17:02:52 +0000 (18:02 +0100)]
radeonsi: add instanceid support
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
Christian König [Thu, 21 Mar 2013 16:37:37 +0000 (17:37 +0100)]
radeon/llvm: move system value fetching to common code
This should be used by both SI and R600.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Tested-by: Michel Dänzer <michel.daenzer@amd.com>
Michel Dänzer [Wed, 27 Mar 2013 11:43:32 +0000 (12:43 +0100)]
radeonsi: Handle arbitrary 2-byte formats in resource_copy_region
Fixes mplayer -vo vdpau OSD.
NOTE: This is a candidate for the 9.1 branch.
Reported-by: Igor Vagulin <igor.vagulin@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Tested-by: Christian König <christian.koenig@amd.com>
Maarten Lankhorst [Sun, 24 Mar 2013 13:37:41 +0000 (14:37 +0100)]
nvc0: Fix fd leak in nvc0_create_decoder
NOTE: This is a candidate for the 9.0 and 9.1 branches.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Aras Pranckevicius [Fri, 1 Mar 2013 10:05:11 +0000 (12:05 +0200)]
GLSL: fix lower_jumps to report progress properly
A fix for lower_jumps progress reporting, very much like similar in
c1e591eed.
NOTE: This is a candidate for stable branches.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Eric Anholt [Wed, 20 Mar 2013 00:45:02 +0000 (17:45 -0700)]
i965/fs: Allow CSE on pre-gen7 varying-index uniform loads
All the other expression types allowed here have inst->mlen == 0, and this
one has implied MRF writes for all of its payload, so nothing else in the
implementation should need to change.
Reduces SEND messages for loading from pull constants in kwin's Lanczos
shader from 16 to 6. (Due to a deficiency in constant propagation, I
can't use the hack I did in the previous commit to test the performance
change)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=61554
NOTE: This is a candidate for the 9.1 branch.
Eric Anholt [Mon, 18 Mar 2013 17:16:42 +0000 (10:16 -0700)]
i965/fs: Use LD messages for pre-gen7 varying-index uniform loads
This comes at a minor performance cost at the moment (-3.2% +/- 0.2%, n=14 on
my GM45 forced to load all uniforms through the varying-index path), but we
get a whole vec4 at a time to reuse in the next commit.
v2: Fix comment about channels in the other message.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
NOTE: This is a candidate for the 9.1 branch.
Eric Anholt [Wed, 20 Mar 2013 00:36:10 +0000 (17:36 -0700)]
i965/fs: Don't double-emit SEND dependency workarounds at control flow.
We weren't setting needs_dep[i] in the loops, so we'd continue on to
potentially add the same workaround MOVs to the later basic block
boundaries, too. We can either set needs_dep[i] to exit through the
normal path, or we can just return since we know we're done.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Eric Anholt [Mon, 18 Mar 2013 18:30:57 +0000 (11:30 -0700)]
i965/fs: Bake regs_written into the IR instead of recomputing it later.
For sampler messages, it depends on the target gen, and on gen4
SIMD16-sampler-on-SIMD8-execution we were returning 4 instead of 8 like we
should.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
NOTE: This is a candidate for the 9.1 branch.
Eric Anholt [Mon, 18 Mar 2013 18:26:17 +0000 (11:26 -0700)]
i965/fs: Clean up the setup of gen4 simd16 message destinations.
I think this makes it much more obvious what's going on here.
NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Eric Anholt [Fri, 15 Mar 2013 21:43:28 +0000 (14:43 -0700)]
i965/fs: Do CSE on gen7's varying-index pull constant loads.
This is our first CSE on a regs_written() > 1 instruction, so it takes a
bit of extra fixup. Reduces the number of loads on kwin's Lanczos shader
from 12 to 2.
v2: Fix compiler warning (false positive on possibly-uninitialized variable)
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=61554
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> (v1)
NOTE: This is a candidate for the 9.1 branch.
Eric Anholt [Wed, 13 Mar 2013 21:48:55 +0000 (14:48 -0700)]
i965/fs: Improve performance of varying-index uniform loads on IVB.
Like we have done for the VS and for constant-index uniform loads, we use
the sampler engine to get caching in front of the L3 to avoid tickling the
IVB L3 bug. This is also a bit of a functional change, as we're now
loading a vec4 instead of a single dword, though we're not taking
advantage of the other 3 components of the vec4 (yet).
With the driver hacked to always take the varying-index path for all
uniforms, improves performance of my old GLSL demo by 315% +/- 2% (n=4).
This a major fix for some blur shaders in compositors from the
varying-index uniforms support I introduced in 9.1.
v2: Move old offset computation into the pre-gen7 path.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=61554
NOTE: This is a candidate for the 9.1 branch.
Eric Anholt [Fri, 15 Mar 2013 21:31:46 +0000 (14:31 -0700)]
i965/fs: Avoid inappropriate optimization with regs_written > 1.
Right now we don't have anything with regs_written() > 1 and !inst->mlen,
but that's about to change.
NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Eric Anholt [Thu, 14 Mar 2013 21:41:37 +0000 (14:41 -0700)]
i965: Make the fragment shader pull constants index by dwords, not vec4s.
We want to load vec4s, since loading a vec4 instead of a dword is
basically no increased latency. But for variable indexed access, the
previous requirement of aligned vec4s for a sampler LD was hard to
implement.
Note that this change only affects those messages that use the surface
format, like sampler LDs, but not to the untyped data cache loads we've
used in other cases.
No significant performance difference on my GLSL demo with uniforms forced
to take the varying pull constants path (n=4).
NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Eric Anholt [Wed, 20 Mar 2013 17:46:20 +0000 (10:46 -0700)]
i965: Make the constant surface interface take a normal byte size.
This puts the rounding-up logic into the function itself instead of all
the callers having to manage it. Also drop an "unused" comment in gen4,
as the stride *is* used for texbos (and will be for uniforms soon).
NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Eric Anholt [Wed, 13 Mar 2013 19:27:17 +0000 (12:27 -0700)]
i965/fs: Move varying uniform offset compuation into the helper func.
I'm going to want to change the math for gen7 using sampler LD
instructions in a way that gets CSE to occur like we'd hope.
NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Eric Anholt [Wed, 13 Mar 2013 19:17:25 +0000 (12:17 -0700)]
i965/fs: Remove creation of a MOV instruction that's never used.
We weren't inserting it into the list, so it did nothing. This line was
replaced by the MOV/MUL block above.
NOTE: This is a candidate for the 9.1 branch.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Eric Anholt [Fri, 15 Mar 2013 21:21:30 +0000 (14:21 -0700)]
i965/fs: Allow constant propagation into MACH.
This happens quite a bit with varying-index uniform loads. We could also
do better by avoiding the MACH entirely, but there's no reason not to at
least take this step.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Vincent Lejeune [Mon, 1 Apr 2013 21:50:20 +0000 (23:50 +0200)]
r600g/llvm: Update LLVM_REVISION.txt
Vincent Lejeune [Sat, 30 Mar 2013 01:09:15 +0000 (02:09 +0100)]
r600g/llvm: Use stack_size provided from llvm.
Vincent Lejeune [Sat, 30 Mar 2013 19:05:45 +0000 (20:05 +0100)]
r600g/llvm: uses function attribute to pass shader type
Vincent Lejeune [Tue, 26 Mar 2013 14:00:18 +0000 (15:00 +0100)]
r600g/llvm: Add support for cf_alu native encode
Haixia Shi [Mon, 1 Apr 2013 20:24:55 +0000 (13:24 -0700)]
ACTIVE_UNIFORM_MAX_LENGTH should include 3 extra characters for arrays.
If the active uniform is an array, then the length of the uniform name should
include the three extra characters for the "[0]" suffix, which is required by
the GL 4.2 spec to be appended to the uniform name in glGetActiveUniform().
This avoids the situation where the output buffer does not have enough space
to hold the "[0]" suffix, resulting in an incomplete array specification like
"foobar[0".
NOTE: This is a candidate for the 9.1 branch.
Change-Id: I41e87ba347a7169eec8c575596cc3416adbe0728
Signed-off-by: Haixia Shi <hshi@chromium.org>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>