git.libre-soc.org Git - mesa.git/log

glsl_parser_extras.cpp: fixup gl vs mem contexts again.

This should fix:
https://bugs.freedesktop.org/show_bug.cgi?id=58039

Tested-by: Darxus on bug 58039
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

i965: Move BRW_MAX_GRF and similar defines to brw_reg.h.

These don't really belong in brw_structs.h.

Reviewed-by: Eric Anholt <eric@anholt.net>

i965: Split struct brw_reg out from brw_eu.h into its own header.

struct brw_instruction and the related instruction emitting code won't
be useful on Gen8+, as the instruction encoding changed. However, the
struct brw_reg code is still extremely valuable.

While we're at it, fix up some style points:
- s/GLuint/unsigned/g
- s/GLint/int/g
- s/GLshort/int16_t/g
- s/GLushort/uint16_t/g
- s/INLINE/inline/g
- Replace tabs with spaces
- Put return types on a separate line from the function name/parameters
- Remove trailing whitespace
- Remove extraneous whitespace around function parameters

Reviewed-by: Eric Anholt <eric@anholt.net>

docs: add ARB_texture_buffer_object_rgb32

Signed-off-by: Dave Airlie <airlied@redhat.com>

st/mesa: add texture buffer object rgb32 support.

This checks if the pipe driver can support RGB32 formats.

Reviewed-by: Marek Olšák <maraeo@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

mesa: add support for ARB_texture_buffer_object_rgb32

This adds the extensions + the tex buffer support for checking
the formats.

There is a piglit test enhancement sent to that list.

Reviewed-by: Marek Olšák <maraeo@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

glsl: avoid using gl context as a memory context

Not sure what was going on here, but running piglit with debug builds
might be a good plan :-)

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>

i965: Add missing autoconf bits so test_vec4_register_coalesce will build

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Tested-by: Eric Anholt <eric@anholt.net>

i965: Generalize VS compute-to-MRF for compute-to-another-GRF, too.

No statistically significant performance difference on glbenchmark 2.7
(n=60). It reduces cycles spent in the vertex shader by 3.3% +/- 0.8%
(n=5), but that's only about .3% of all cycles spent according to the
fixed shader_time.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/vs: Extend opt_compute_to_mrf to handle limited "reswizzling"

The way our visitor works, scalar expression/swizzle results that get
stored in channels other than .x will have an intermediate MOV from
their result in the .x channel to the real .y (or whatever) channel, and
similarly for vec2/vec3 results.

By knowing how to adjust DP4-type instructions for optimizing out a
swizzled MOV, we can reduce instructions in common matrix multiplication
cases.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/vs: Add a unit test for opt_compute_to_mrf().

The compute-to-mrf code is really twitchy, and it's hard to construct
GLSL testcases for it. This unit test is also really hard to work with
(for example, if your instruction is removed by dead code elimination,
you end up inspecting something irrelevant), but I did use it for
debugging some of the commits to follow.

I called it test_vec4_register_coalesce because the compute-to-mrf code
is about to morph into that.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Drop an unnecessary _safe on a list walk.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Add a note explaining a detail of register_coalesce_2().

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: Also consider HALTs a potential block end.

The final halt of the fragment shader turns off the remaining channels,
then jumps such that everything is turned back on. So, we can have our
last ENDIF of the shader point at that directly.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: Jump to the end of the next outer conditional block on ENDIFs.

From the Ivybridge PRM, Volume 4, Part 3, section 6.24 (page 172):

"The endif instruction is also used to hop out of nested conditionals by
jumping to the end of the next outer conditional block when all
channels are disabled."

Also:
"Pseudocode:
Evaluate(WrEn);
if ( WrEn == 0 ) {  // all channels false
   Jump(IP + JIP);
}"

First, ENDIF re-enables any channels that were disabled because they
didn't match the conditional.  If any channels are active, it proceeds
to the next instruction (IP + 16).  However, if they're all disabled,
there's no point in walking through all of the instructions that have no
effect---it can jump to the next instruction that might re-enable some
channels (an ELSE, ENDIF, or WHILE).

Previously, we always set JIP on ENDIF instructions to 2 (which is
measured in 8-byte units).  This made it do Jump(IP + 16), which just
meant it would go to the next instruction even if all channels were off.

It turns out that walking over instructions while all the channels are
disabled like this is worse than just instruction dispatch overhead: if
there are texturing messages, it still costs a couple hundred cycles to
not-actually-read from the texture results.

This patch finds the next instruction that could re-enable channels and
sets JIP accordingly.

Reviewed-by: Eric Anholt <eric@anholt.net>

i965: expose ARB_texture_cube_map_array

V3: Put enable in an existing block rather than making a new
one for no good reason.

Signed-off-by: Chris Forbes <chrisf@ijw.co.nz>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Fix setup for textureGrad(samplerCubeArray, coord, dPdx, dPdy)

Caught by tex_grad-01.frag.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Move the failure for gen7 16-wide intdiv to emit_math().

The cube map array code adds another caller of emit_math(), which
needs this check.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: fs: Add fixup for textureSize on Gen6/7

V2: Moved up into emit(ir_texture *) to avoid duplication and fix
ordering for Gen7; Gen6 math quirks moved into previous patches.

Tested on Gen6 only; passes all the cube_map_array piglits.

V3: Fixed weird whitespace
V4: Use sampler->type; otherwise broken on arrays of samplers.
v5: Minor style fixes (by anholt)

Signed-off-by: Chris Forbes <chrisf@ijw.co.nz>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: fs: fix gen6+ math operands in one place

V4: Fix various style nits as pointed out by Eric, and expand IMM
operands on both Gen6 and Gen7.
v5: minor style nits (by anholt)

Signed-off-by: Chris Forbes <chrisf@ijw.co.nz>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: vs: Add fixup for textureSize with cube array samplers

V3: Fixed weird whitespace
V4: Use sampler's type rather than variable's type; otherwise broken
with arrays of samplers. (Thanks Eric)
v5: Fix a couple more style nits (by anholt)

Signed-off-by: Chris Forbes <chrisf@ijw.co.nz>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/vs: Fix gen6+ math operand quirks in one place

This causes immediate values to get moved to a temp on gen7, which is needed
for an upcoming change but hadn't happened in the visitor until then.

v2: Drop gen > 7 checks (doesn't exist), and style-fix comments (changes by
anholt).

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: Add various plumbing for cubemap arrays

V4: Fixed style nits

Signed-off-by: Chris Forbes <chrisf@ijw.co.nz>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Add empirically-determined instruction latencies for gen7.

v2: Actually switch on the other math instructions mentioned in the
comment.
v3: Add timing data for textureSize(), and clean up some long comment
lines.

Testing shader_time of fs16 shaders on a few frames of various apps:
nexuiz improved by 2.9% +/- 1.5% (n=10)
no difference on GLB2.5 (n=36, outliers removed)
no difference on GLB2.7 (n=25)
etqw improved by 2.6% +/- 2.2% (n=25)
no difference on lightsmark (n=25)

Acked-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Fix the clock increment in scheduling.

I've tested this to be true with various ALU ops on gen7 (with the
exception of MADs, which go at either 3 or 4 cycles per dispatch).

Acked-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Move the old gen4 bspec-based scheduling info to a helper func.

For gen7 everything changes, and we have actual information on latency.

Acked-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Set up gen7 UBO loads as sends from GRFs.

This gives the instruction scheduler a chance to schedule between the
loads, whereas before it was restricted due to the dependencies between
the MRFs for setting them up.

For one shader in gles3conform, it goes from getting stuck in register
allocation for as long as anybody's bothered to leave it running down
to 23 seconds, thanks to the LIFO scheduling.

Acked-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Before reg alloc, schedule instructions to reduce live ranges.

This came from an idea by Ben Segovia.  16-wide pixel shaders are very
important for latency hiding on i965, so we want to try really hard to
get them.  If scheduling an instruction makes some set of instructions
available, those are probably the ones that make the instruction's
result dead.  By choosing those first, we'll have a tendency to reduce
the amount of live data as opposed to creating more.

Previously, we were sometimes getting this behavior out of the
scheduler, which was what produced the scheduler's original performance
wins on lightsmark.  Unfortunately, that was mostly an accident of the
lame instruction latency information that I had, which made it
impossible to fix the actual scheduling for performance.  Now that we've
fixed the scheduling for setup for register allocation, we can safely
update the latency parameters for the final schedule.

In shader-db, we lose 37 16-wide shaders, but gain 90 new ones.  4
shaders that were spilling change how many registers spill, for a
reduction of 70/3899 instructions.

v2: Simplify the new loop.

Acked-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Add some optional debug printfs to scheduling.

Seeing when instructions become available to schedule is really useful.

Acked-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Schedule instructions both before and after register allocation.

Acked-by: Kenneth Graunke <kenneth@whitecape.org>

i965: Make sure that the shader_time report at context destroy happens.

Otherwise, you end up with some report from within a second of context
destroy, which is now what you really want for testing the impact of
changes

i965: Print a total time for the different shader stages.

Sometimes I've got a patch for a performance optimization that's not
showing a statistically significant performance difference on reported
FPS, but still seems like a good idea because it ought to reduce time
spent in the shader. If I can see the total number of cycles spent in
the shader stage being optimized, it may show that the patch is still
worthwhile (or point out that it's actually broken in some way).

i965: Scale shader_time to compensate for resets.

Some shaders experience resets more than others, which skews the numbers
reported. Attempt to correct for this by linearly scaling according to
the number of resets that happen.

Note that will not be accurate if invocations of shaders have varying
times and longer invocations are more likely to reset. However, this
should at least be better than the previous situation.

i965: Adjust the split between shader_time_end() and shader_time_write().

I'm about to emit other kinds of writes besides time deltas, and it
turns out with the frequency of resets, we couldn't really use the old
time delta write() function more than once in a shader.

glsl/linker: Pack between varyings.

This patch implements varying packing between varyings.

Previously, each varying occupied components 0 through N-1 of its
assigned varying slot, so there was no way to pack two varyings into
the same slot.  For example, if the varyings were a float, a vec2, a
vec3, and another vec2, they would be stored as follows:

<----slot1----> <----slot2----> <----slot3----> <----slot4---->  slots
  *   *   *   *   *   *   *   *   *   *   *   *   *   *   *   *
flt  x   x   x  <vec2->  x   x  <--vec3--->  x  <vec2->  x   x   varyings

(Each * represents a varying component, and the "x"s represent wasted
space).

This change packs the varyings together to eliminate wasted space
between varyings, like so:

<----slot1----> <----slot2----> <----slot3----> <----slot4---->  slots
  *   *   *   *   *   *   *   *   *   *   *   *   *   *   *   *
<vec2-> <vec2-> flt <--vec3--->  x   x   x   x   x   x   x   x   varyings

Note that we take advantage of the sort order introduced in previous
patches (vec4's first, then vec2's, then scalars, then vec3's) to
minimize how often a varying is "double parked" (split across varying
slots).

Reviewed-by: Eric Anholt <eric@anholt.net>
v2: Skip varying packing if ctx->Const.DisableVaryingPacking is true.

glsl/linker: Pack within compound varyings.

This patch implements varying packing within varyings that are
composed of multiple vectors of size less than 4 (e.g. arrays of
vec2's, or matrices with height less than 4).

Previously, such varyings used up a full 4-wide varying slot for each
constituent vector, meaning that some of the components of each
varying slot went unused.  For example, a mat4x3 would be stored as
follows:

<----slot1----> <----slot2----> <----slot3----> <----slot4---->  slots
  *   *   *   *   *   *   *   *   *   *   *   *   *   *   *   *
<-column1->  x  <-column2->  x  <-column3->  x  <-column4->  x   matrix

(Each * represents a varying component, and the "x"s represent wasted
space).  In addition to wasting precious varying components, this
layout complicated transform feedback, since the constituents of the
varying are expected to be output to the transform feedback buffer
contiguously (e.g. without gaps between the columns, in the case of a
matrix).

This change packs the constituents of each varying together so that
all wasted space is at the end.  For the mat4x3 example, this looks
like so:

<----slot1----> <----slot2----> <----slot3----> <----slot4---->  slots
  *   *   *   *   *   *   *   *   *   *   *   *   *   *   *   *
<-column1-> <-column2-> <-column3-> <-column4->  x   x   x   x   matrix

Note that matrix columns 2 and 3 now cross a boundary between varying
slots (a characteristic I call "double parking" of a varying).

We don't bother trying to eliminate the wasted space at the end of the
varying, since the patch that follows will take care of that.

Since compiler back-ends don't (yet) support this packed layout, the
lower_packed_varyings function is used to rewrite the shader into a
form where each varying occupies a full varying slot.  Later, if we
add native back-end support for varying packing, we can make this
lowering pass optional.

Reviewed-by: Eric Anholt <eric@anholt.net>
v2: Skip varying packing if ctx->Const.DisableVaryingPacking is true.

gallium: Disable varying packing on hardware with <=8 texture indirections.

In practice this will disable varying packing on R300, R400, i915g,
and nv30.

Reviewed-by: Marek Olšák <maraeo@gmail.com>

mesa: Add an option so driver can opt out of varying packing.

On hardware that supports a limited number of texture indirections,
varying packing will comsume an extra texture indirection, since ALU
operations are needed in the fragment shader to unpack the varyings
before any texturing can be done.

This patch introduces a new driver option,
ctx->Const.DisableVaryingPacking, which can be used by a driver to opt
out of varying packing if the extra texture indirection is costly
enough to outweigh the advantages of packing varyings.

Reviewed-by: Marek Olšák <maraeo@gmail.com>

glsl: Add a lowering pass for packing varyings.

This lowering pass generates GLSL code that manually packs varyings
into vec4 slots, for the benefit of back-ends that don't support
packed varyings natively.

No functional change--the lowering pass is not yet used.

Reviewed-by: Eric Anholt <eric@anholt.net>
v2: Don't use ir_hierarchical_visitor--just loop over instructions
directly. Also, make the names of the packed varyings include the
names of the original varyings that were packed into them.

glsl/linker: Sort varyings by packing class, then vector size.

This patch paves the way for varying packing by adding a sorting step
before varying assignment, which sorts the varyings into an order that
increases the likelihood of being able to find an efficient packing.

First, varyings are sorted into "packing classes" by considering
attributes that can't be mixed during varying packing--at the moment
this includes base type (float/int/uint/bool) and interpolation mode
(smooth/noperspective/flat/centroid), though later we will hopefully
be able to relax some of these restrictions. The number of packing
classes places an upper limit on the amount of space that must be
wasted by varying packing, since in theory a shader might nave 4n+1
components worth of varyings in each of m packing classes, resulting
in 3m components worth of wasted space.

Then, within each packing class, varyings are sorted by vector size,
with vec4's coming first, then vec2's, then scalars, and then finally
vec3's. The motivation for this order is that it ensures that the
only vectors that might be "double parked" (with part of the vector in
one varying slot and the remainder in another) are vec3's.

Note that the varyings aren't actually packed yet, merely placed in an
order that will facilitate packing.

Reviewed-by: Eric Anholt <eric@anholt.net>

glsl/linker: Subdivide the first phase of varying assignment.

This patch further subdivides the loop that assigns varying locations
into two phases: one phase to match up the varyings between shader
stages, and one phase to assign them varying locations.

In between the two phases the matched varyings are stored in a new
data structure called varying_matches. This will free us to be able
to assign varying locations in any order, which will pave the way for
packing varyings.

Note that the new varying_matches::assign_locations() function returns
the number of varying slots that were used; this return value will be
used in a future patch.

Reviewed-by: Eric Anholt <eric@anholt.net>

glsl/linker: Defer recording transform feedback locations.

This patch subdivides the loop that assigns varying locations into two
phases: one phase to match up varyings between shader stages (and
assign them varying locations), and a second phase to record the
varying assignments for use by transform feedback.

This paves the way for varying packing, which will require us to
further subdivide the first phase.

In addition, it lets us avoid a clumsy O(n^2) algorithm, since we can
now record the locations of all transform feedback varyings in a
single pass through the tfeedback_decls array, rather than have to
iterate through the array after assigning each varying.

Reviewed-by: Eric Anholt <eric@anholt.net>

glsl: Create a field to store fractional varying locations.

Currently, the location of each varying is recorded in ir_variable as
a multiple of the size of a vec4. In order to pack varyings, we need
to be able to record, e.g. that a vec2 is stored in the second half of
a varying slot rather than the first half.

This patch introduces a field ir_variable::location_frac, which
represents the offset within a vec4 where a varying's value is stored.
Varyings that are not subject to packing will always have a
location_frac value of zero.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

glsl/linker: Make separate ir_variable field to mean "unmatched".

Previously, the linker used a value of -1 in ir_variable::location to
denote a generic input or output of the shader that had not yet been
matched up to a variable in another pipeline stage.

This patch introduces a new ir_variable field,
is_unmatched_generic_inout, for that purpose.

In future patches, this will allow us to separate the process of
matching varyings between shader stages from the processes of
assigning locations to those varying. That will in turn pave the way
for packing varyings.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

glsl/linker: Always invalidate shader ins/outs, even in corner cases.

Previously, link_invalidate_variable_locations() was only called
during assign_attribute_or_color_locations() and
assign_varying_locations(). This meant that in the corner case when
there was only a vertex shader, and varyings were being captured by
transform feedback, link_invalidate_variable_locations() wasn't being
called for the varyings.

This patch migrates the calls to link_invalidate_variable_locations()
to link_shaders(), so that they will be called in all circumstances.
In addition, it modifies the call semantics so that
link_invalidate_variable_locations() need only be called once per
shader stage (rather than once for inputs and once for outputs).

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

glsl/lower_clip_distance: Update symbol table.

This patch modifies the clip distance lowering pass so that the new
symbol it generates (glClipDistanceMESA) is added to the shader's
symbol table.

This will allow a later patch to modify the linker so that it finds
transform feedback varyings using the symbol table rather than having
to iterate through all the declarations in the shader.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

android: build fix for libmesa_glsl_utils

hash_table.c compilation requires ralloc.h include path

Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Chad Versace <chad.versace@linux.intel.com>

mesa: minor indentation fixes in texcompress_etc.c

mesa: remove old swrast-based compressed texel fetch code

swrast: use new core Mesa compressed texel fetch functions

mesa: reimplement _mesa_decompress_image() using new tex fetch code

mesa: added _mesa_get_compressed_fetch_func()

mesa: add new texel fetch code for etc formats

mesa: add new texel fetch code for rgtc formats

mesa: add new texel fetch code for fxt formats

mesa: add new texel fetch code for dxt formats

mesa: add compressed_fetch_func typedef

This is a first step in removing the swrast-related code in core
Mesa's texture compression files.

swrast: merge get_texel_fetch_func() and set_fetch_functions()

No real need for separate functions anymore.

swrast: make _mesa_get_texel_fetch_func() static

Not called from any other file.

draw/llvmpipe: fix transform feedback position + enable other extensions

This builds on the previous draw/softpipe patch.

So llvmpipe does streamout calls after clip/viewport stages,
but we have the pre-clip position stored for later use, so
when we are doing transform feedback, and its the position vertex
grab the vertex from the stored pre clip position.

The perfect fix is too probably add a codegen transform feedback
stage in between shader and clip stages, but this is good enough
for now.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

draw: add support for later transform feedback extensions

This adds support to draw for the new features of transform feedback.

a) fix count_from_stream_output, using max_index+1 for now but it looks
like it should be valid as its derived from the vertex elements/vbo.

b) fix striding and dst offsets in output buffers - was just wrong before.

c) fix crash if tfb is suspended (so.num_targets == 0)

This also enables the new features on softpipe. It should be possible
to enable them on llvmpipe as well after this commit, but would need
to schedule piglit runs.

Signed-off-by: Dave Airlie <airlied@redhat.com>

clover: Fix build since removal of pipe_surface::usage

by commit 25409c6da8163d9acb386511aef0c11577c7aadb

r600g/radeonsi: Silence warnings

Reviewed-by: Tom Stellard <thomas.stellard@amd.com>

clover: Add support for compiler flags

Reviewed-by: Francisco Jerez <currojerez@riseup.net>

clover: Don't erase build info of devices not being built

Every call to _cl_program::build() was erasing the binaries and logs for
every device associated with the program. This is incorrect because
it is possible to build a program for only a subset of devices and so
any device not being build should not have this information erased.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>

r600g: use load_ar checks with llvm output.

Reviewed-by: Tom Stellard <thomas.stellard@amd.com>

build: Fix AX_PROG_{CC,CXX}_FOR_BUILD macros

Override the cross_compiling and ac_tool_prefix variables by reassigning
to them instead of redefining the macros. Redefining them will actually
cause the variable names to be replaced instead of their content.

Furthermore push the definition of CPPFLAGS before running the checks
for the build tools to avoid the host CPPFLAGS from leaking into the
build CPPFLAGS.

While at it drop the redefinition of AC_TRY_COMPILER which hasn't been
used since autoconf 2.50 and make sure that all definitions are properly
popped when done (LDFLAGS, ac_cv_prog_CPP, ac_cv_prog_CXXCPP).

Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>

gallivm: fix texel fetch for array textures

Since we don't call lp_build_sample_common() in the texel fetch path we missed
the layer fixup code. If someone would have tried to do texelFetch with array
textures it would have crashed for sure.
Not really tested (can't run the piglit test being able to use texelFetch with
array samplers for now with llvmpipe).

Reviewed-by: José Fonseca <jfonseca@vmware.com>

mesa: Fix computation of default vertex attrib stride for 2_10_10_10 formats.

Previously, if the client program didn't specify a stride when setting
up a vertex attribute, we used _mesa_sizeof_type() to compute the size
of the type, and multiplied it by the number of components.

This didn't work for the 2_10_10_10 formats, since _mesa_sizeof_type()
returns -1 for those types, resulting in all kinds of havoc, since it
was causing the hardware to be programmed with a negative stride
value.

This patch adds a new function _mesa_bytes_per_vertex_attrib(), which
is similar to the existing function _mesa_bytes_per_pixel(), but which
computes the size of a vertex attribute based on the type and the
number of formats. For packed formats (currently only the 2_10_10_10
formats), it verifies that the number of components is correct and
returns the size of the packed format. For unpacked formats, it
returns the size of the type times the number of components.

In addition, this patch adds an assertion so that if we ever forget to
update _mesa_bytes_per_vertex_attrib() when adding a new vertex
format, we'll see the problem quickly rather than having to debug a
subtle conformance test failure.

Fixes GLES3 conformance tests
vertex_type_2_10_10_10_rev_{conversion,divisor,stride_pointer}.test.

Reviewed-by: Brian Paul <brianp@vmware.com>

mesa/uniform_query: Don't write to *params if there is an error

The GL 3.1 and ES 3.0 specs say of glGetActiveUniformsiv:
"If an error occurs, nothing will be written to params."

So, make a pass through the indices and check that they're valid before
the pass that actually writes to params. Checking pname happens on the
first iteration of the second loop.

Fixes es3conform's getactiveuniformsiv_for_nonexistent_uniform_indices
test.

NOTE: This is a candidate for the 9.0 branch.
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

mesa: print unsigned values with %u

Otherwise messages say silly things like
glGetActiveUniformBlockiv(block index -1 >= 0)

Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: Fix disassembly of jump targets on Gen7.

Gen7 stores the JIP/UIP bits in different places.

Reviewed-by: Eric Anholt <eric@anholt.net>

i965: Make try_rewrite_rhs_to_dst compare VGRF size to regs written.

try_rewrite_rhs_to_dst is a quick optimization to avoid generating new
temporaries (and MOVs from those temporaries to the dest) for every
expression tree we visit.  By generating better code in simple cases, we
reduce the burden on later optimization passes like register coalescing.

Previously, we compared inst->regs_written() to lhs->vector_elements
to make sure the instruction generating our value wrote the same number
of components as our destination register.

However, this fails in some cases.  One example is texturing (which
produces a vec4) into gl_FragData[i].  Technically, gl_FragData[i] is
also a vec4.  However, the destination VGRF actually has size 4n (where
n is the size of the array).

split_virtual_grfs() can't split VGRFs that are used by SEND messages
which require contiguous destination registers (like texturing), and
register allocation needs all VGRFs to have sizes between 1 and 4.

Amnesia: The Dark Descent hits this case: a texturing instruction
(4 components) gets rewritten to the gl_FragData output register
(which was 4*3 = 12 components), causing the register allocator to
hit the "we rely on split_virtual_grfs" assertion.

This makes it possible to play Amnesia.

Reviewed-by: Eric Anholt <eric@anholt.net>

configure.ac: Disable compiler optimizations when --enable-debug is set

Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Reviewed-by: Dan Nicholson <dbn.lists@gmail.com>
Reviewed-by: Chad Versace <chad.versace@linux.intel.com>

softpipe: remove unused corner0 variable

llvmpipe: remove unneeded draw_flush() call

This is redundant since we're calling draw_bind_fragment_shader()
which already does a flush.

v2: the redundant flush in llvmpipe_set_constant_buffer() has
already been removed by commit 3427466e6dbbb8db7c1ecda6b3859ca1cc5827a3

Reviewed-by: José Fonseca <jfonseca@vmware.com>

r600g: suballocate memory for fetch shaders from a large buffer

Fetch shaders are usually destroyed at the context destruction by the state
tracker, so we can put them all in a large buffer without wasting memory.

This reduces the number of relocations sent to the kernel a little bit.

Tested-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

r600g: suballocate memory for the STRMOUT_BUFFER_FILLED_SIZE register

Instead of having a 4-byte buffer for each streamout target, we suballocate
each dword from a 4K buffer.

This further reduces the overall number of relocations.

Tested-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

gallium/util: add a simple allocator for suballocating from a large buffer

Tested-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

r600g: use u_upload_mgr for allocating staging transfer buffers

u_upload_mgr suballocates memory from a large buffer and maps the allocated
range (unsychronized), which is perfect for short-lived staging buffers.

This reduces the number of relocations sent to the kernel.

Tested-by: Aaron Watry <awatry@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

winsys/radeon: don't use BIND flags, add a flag for the cache bufmgr instead

st/dri: add a way to force MSAA on with an environment variable

There are 2 ways. I prefer the former:
GALLIUM_MSAA=n
__GL_FSAA_MODE=n

Tested with ETQW, which doesn't support MSAA on Linux. This is
the only way to get MSAA there.

Reviewed-by: Brian Paul <brianp@vmware.com>

mesa: don't advertise ARB_texture_buffer_object in legacy contexts

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

mesa: disallow creation of GL 3.1 compatibility contexts

Death to driver-specific hacks!

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

gallium: remove pipe_surface::usage

Not really used by anybody now.

Reviewed-by: Brian Paul <brianp@vmware.com>

svga: stop using pipe_surface::usage

There are only 2 possible usages: render target and depth stencil.
Both can be derived from the surface format, so the flag is redundant.

And it's going away...

Reviewed-by: Brian Paul <brianp@vmware.com>

gallium/util: move util_try_blit_via_copy_region to u_surface.c

Reviewed-by: Brian Paul <brianp@vmware.com>

gallium/cso: don't use the pipe_error return type where it's not needed

Reviewed-by: Brian Paul <brianp@vmware.com>

gallium: manage render condition in cso_context and fix postprocessing w/ it

Reviewed-by: Brian Paul <brianp@vmware.com>

st/mesa: remove a weird msaa hack

It doesn't work and it's not clear how it's supposed to work.

Reviewed-by: Brian Paul <brianp@vmware.com>

softpipe: implement seamless cubemap support. (v1.1)

This adds seamless sampling for cubemap boundaries if requested.

The corner case averaging is messy but seems like it should be spec
compliant.

The face direction stuff is also a bit messy, I've no idea if that could
or should be simpler, or even if all my directions are fully correct!

v1.1: update comments, drop unneeded seamless calls for nearest, fix
if statement layout.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

gallium: fix cap warnings for tbo cap.

Signed-off-by: Dave Airlie <airlied@redhat.com>

glsl_to_tgsi: emit multi-level structs and arrays properly.

This follow the code from the i965 driver, and emits the structs
and arrays recursively.

This fixes an assert in the two UBO tests
fs-struct-copy-complicated and
vs-struct-copy-complicated

These tests now pass on softpipe, with no regressions.

Signed-off-by: Dave Airlie <airlied@redhat.com>

llvmpipe: don't use user constant buffers

This fixes some use-after-free issues. I haven't measured any real
performance difference with a handful of Mesa demos.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>

llvmpipe: support pipe_resource-based constant buffers

Before this we only supported user-based constant buffers.

First, we basically plumb pipe_constant_buffer objects through llvmpipe
rather than pipe_resource objects.

Second, update llvmpipe_set_constant_buffer() and try_update_scene_state()
so they understand both resource- and user-based constant buffers.

The problem with user constant buffers is the potential for use-after-free,
as seen in some WebGL tests. The next patch will flip the switch for
resource-based const buffers.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>

util: add util_copy_constant_buffer() helper function

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>

i965/fs: Improve performance of shaders that start out with a discard.

I had tried this in the past, but ran into trouble with applications
that sample from undiscarded pixels in the same subspan.  To fix that
issue, only jump to the end for an entire subspan at a time.

Improves GLbenchmark 2.7 (1024x768) performance by 7.9 +/- 1.5% (n=8).

v2: Drop the br variable in the jump instruction -- if I ever do jumps
    pre-gen6, it'll be a different code block anyway since we don't have
    HALT until gen6.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Rewrite discards to use a flag subreg to track discarded pixels.

This makes much more sense on gen6+, and will also prove useful for
early exit of shaders on discard.

v2: fix up a stale comment from before converting gen4-5.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/fs: Add an instruction flag for choosing the flag subregister.

We're going to redo discard handling to track discards in the other flag
subregister, saving instructions in the discard and allowing predicated
jumps out to the end of the shader.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: Let brw_flag_reg() choose the flag reg and subreg.

We're about to start using the f0.1 subregister.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>