git.libre-soc.org Git - mesa.git/log

i965/vec4: Optimize sqrt+inv into rsq.

Transform

   sqrt a, b
   rcp  c, a

into

   sqrt a, b
   rsq  c, b

In most cases the sqrt's result is still used, so the improvement here
is that we've broken a dependency between these instructions. Leads to
80 fewer INV instructions and 80 more RSQ.

Occasionally the sqrt's result is no longer used, leading to:

instructions in affected programs:     5005 -> 4949 (-1.12%)

Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

i965/vec4: Call opt_algebraic after opt_cse.

The next patch adds an algebraic optimization for the pattern

   sqrt a, b
   rcp  c, a

and turns it into

   sqrt a, b
   rsq  c, b

but many vertex shaders do

   a = sqrt(b);
   var1 /= a;
   var2 /= a;

which generates

   sqrt a, b
   rcp  c, a
   rcp  d, a

If we apply the algebraic optimization before CSE, we'll end up with

   sqrt a, b
   rsq  c, b
   rcp  d, a

Applying CSE combines the RCP instructions, preventing this from
happening.

No shader-db changes.

Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

i965/fs: Extend predicated break pass to predicate WHILE.

Helps a handful of programs in Serious Sam 3 that use do-while loops.

instructions in affected programs: 16114 -> 16075 (-0.24%)

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

gallivm: Fix build for LLVM 3.2

Do not rely on LLVMMCJITMemoryManagerRef being available.
The c binding to the memory manager objects only appeared
on llvm-3.4.
The change is based on an initial patch of Brian Paul.

Reviewed-by: Brian Paul <brianp@vmware.com>
Tested-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Signed-off-by: Mathias Froehlich <Mathias.Froehlich@web.de>

freedreno: destroy transfer pool after blitter

Blitter can still have transfers hanging around which it frees in
util_blitter_destroy(). So let it clean up before we yank the
transfer_pool from under it.

Signed-off-by: Rob Clark <robclark@freedesktop.org>

freedreno/lowering: fix token calculation for lowering

Indirect registers consume an additional token. Try to clean up the
token calculation math a bit, and fix it at the same time.

Signed-off-by: Rob Clark <robclark@freedesktop.org>

i965/fs: Don't make a name for a vector splitting temporary

If the name is just going to get dropped, don't bother making it. If
the name is made, release it sooner (rather than later).

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

glsl: Don't make a name for the function return variable

If the name is just going to get dropped, don't bother making it. If
the name is made, release it sooner (rather than later).

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

glsl: Don't allocate a name for ir_var_temporary variables

Valgrind massif results for a trimmed apitrace of dota2:

                  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
Before (32-bit): 74 40,578,719,715       67,762,208       62,263,404     5,498,804            0
After  (32-bit): 52 40,565,579,466       66,359,800       61,187,818     5,171,982            0

Before (64-bit): 74 37,129,541,061       95,195,160       87,369,671     7,825,489            0
After  (64-bit): 76 37,134,691,404       93,271,352       85,900,223     7,371,129            0

A real savings of 1.0MiB on 32-bit and 1.4MiB on 64-bit.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

glsl: Use ir_var_temporary for compiler generated temporaries

These few places were using ir_var_auto for seemingly no reason. The
names were not added to the symbol table.

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

glsl: Add context-level controls for whether temporaries have real names

No change Valgrind massif results for a trimmed apitrace of dota2.

v2: Minor rebase on _mesa_init_constants changes.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

glsl: Never put ir_var_temporary variables in the symbol table

Later patches will give every ir_var_temporary the same name in release
builds. Adding a bunch of variables named "compiler_temp" to the symbol
table can only cause problems.

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

glsl: Add the possibility for ir_variable to have a non-ralloced name

Specifically, ir_var_temporary variables constructed with a NULL name
will all have the name "compiler_temp" in static storage.

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

glsl: Store ir_variable_data::_num_state_slots and ::binding in 16-bits each

Valgrind massif results for a trimmed apitrace of dota2:

                  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
Before (32-bit): 44 40,577,049,140       68,118,608       62,441,063     5,677,545            0
After  (32-bit): 71 40,583,408,411       67,761,528       62,263,519     5,498,009            0

Before (64-bit): 63 37,122,829,194       95,153,008       87,333,600     7,819,408            0
After  (64-bit): 67 37,123,303,706       95,150,544       87,333,600     7,816,944            0

A real savings of 173KiB on 32-bit and no change on 64-bit.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

glsl: Squish ir_variable::max_ifc_array_access and ::state_slots together

At least one of these pointers must be NULL, and we can determine which
will be NULL by looking at other fields.  Use this information to store
both pointers in the same location.

If anyone can think of a better name for the union than "u", I'm all
ears.

Valgrind massif results for a trimmed apitrace of dota2:

                  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
Before (32-bit): 63 40,574,239,515       68,117,280       62,618,607     5,498,673            0
After  (32-bit): 44 40,577,049,140       68,118,608       62,441,063     5,677,545            0

Before (64-bit): 53 37,126,451,468       95,150,256       87,711,304     7,438,952            0
After  (64-bit): 63 37,122,829,194       95,153,008       87,333,600     7,819,408            0

A real savings of 173KiB on 32-bit and 368KiB on 64-bit.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

glsl: Make ir_variable::num_state_slots and ir_variable::state_slots private

Also move num_state_slots inside ir_variable_data for better packing.

The payoff for this will come in a few more patches.

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

glsl: Make ir_variable::max_ifc_array_access private

The payoff for this will come in a few more patches.

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

glsl: Store ir_variable::depth_layout using 3 bits

warn_extension_index was moved to improve packing.

Valgrind massif results for a trimmed apitrace of dota2:

                  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
Before (32-bit): 73 40,580,476,304       68,488,400       62,796,151     5,692,249            0
After  (32-bit): 73 40,575,751,558       68,116,528       62,618,607     5,497,921            0

Before (64-bit): 71 37,124,890,613       95,889,584       88,089,008     7,800,576            0
After  (64-bit): 62 37,123,578,526       95,150,784       87,711,304     7,439,480            0

A real savings of 173KiB on 32-bit and 368KiB on 64-bit.

v2: Use the enum name with the bit-field and remove the extra casts.
Suggested by Ken.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com> [v1]
Reviewed-by: Tapani Pälli <tapani.palli@intel.com> [v1]

glsl: Replace ir_variable::warn_extension pointer with an 8-bit index

Also move the new warn_extension_index into ir_variable::data.  This
enables slightly better packing.

Valgrind massif results for a trimmed apitrace of dota2:

                  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
Before (32-bit): 82 40,580,040,531       68,488,992       62,973,695     5,515,297            0
After  (32-bit): 73 40,580,476,304       68,488,400       62,796,151     5,692,249            0

Before (64-bit): 65 37,124,013,542       95,892,768       88,466,712     7,426,056            0
After  (64-bit): 71 37,124,890,613       95,889,584       88,089,008     7,800,576            0

A real savings of 173KiB on 32-bit and 368KiB on 64-bit.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

glsl: Use accessors for ir_variable::warn_extension

The payoff for this will come in the next patch.

No change Valgrind massif results for a trimmed apitrace of dota2.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>

glsl: Eliminate unused built-in variables after compilation

After compilation (and before linking) we can eliminate quite a few
built-in variables.  Basically, any uniform or constant (e.g.,
gl_MaxVertexTextureImageUnits) that isn't used (with one exception) can
be eliminated.  System values, vertex shader inputs (with one
exception), and fragment shader outputs that are not used and not
re-declared in the shader text can also be removed.

gl_ModelViewProjectMatrix and gl_Vertex are used by the built-in
function ftransform.  There are some complications with eliminating
these variables (see the comment in the patch), so they are not
eliminated.

Valgrind massif results for a trimmed apitrace of dota2:

                  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
Before (32-bit): 46 40,661,487,174       75,116,800       68,854,065     6,262,735            0
After  (32-bit): 50 40,564,927,443       69,185,408       63,683,871     5,501,537            0

Before (64-bit): 64 37,200,329,700      104,872,672       96,514,546     8,358,126            0
After  (64-bit): 59 36,822,048,449       96,526,888       89,113,000     7,413,888            0

A real savings of 4.9MiB on 32-bit and 7.0MiB on 64-bit.

v2: Don't remove any built-in with Transpose in the name.

v3: Fix comment typo noticed by Anuj.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Suggested-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Acked-by: Anuj Phogat <anuj.phogat@gmail.com>
Cc: Eric Anholt <eric@anholt.net>

glsl: Validate that built-in uniforms have backing state

All built-in uniforms are supposed to be backed by some GL state. The
state_slots field describes this backing state.

This helped me track down a bug in a later patch.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Acked-by: Anuj Phogat <anuj.phogat@gmail.com>

vc4: Don't forget to store stencil along with depth when storing either.

Otherwise, we'd replace the stencil in our packed depth/stencil with 0s.
Fixes about 50 piglit tests.

llvmpipe: Reuse llvmpipes LLVMContext in the draw context.

Reuse the LLVMContext already allocated in llvmpipe_context
for draw_llvm if ppossible. This should decrease the memory
footprint of an llvmpipe context.

v2: Fix compile with llvm disabled.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Signed-off-by: Mathias Froehlich <Mathias.Froehlich@web.de>

llvmpipe: Make a llvmpipe OpenGL context thread safe.

This fixes the remaining problem with the recently introduced
global jit memory manager. This change again uses a memory manager
that is local to gallivm_state. This implementation still frees
the majority of the memory immediately after compilation.
Only the generated code is deferred until this code is no longer used.

This change and the previous one using private LLVMContext instances
I can now safely run several independent OpenGL contexts driven
by llvmpipe from different threads.

v3: Rebase on llvm-3.6 compile fixes.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Signed-off-by: Mathias Froehlich <Mathias.Froehlich@web.de>

llvmpipe: Use two LLVMContexts per OpenGL context instead of a global one.

This is one step to make llvmpipe thread safe as mandated by the OpenGL
standard. Using the global LLVMContext is obviously a problem for
that kind of use pattern. The patch introduces two LLVMContext
instances that are private to an OpenGL context and used for all
compiles. One is put into struct draw_llvm and the other
one into struct llvmpipe_context.

Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Signed-off-by: Mathias Froehlich <Mathias.Froehlich@web.de>

i965/brw_reg: Make the accumulator register take an explicit width.

The big pile of patches I just pushed regresses about 25 piglit tests on
SNB. This fixes the regressions.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>

llvmpipe: move lp_jit_screen_init() call after allocation of screen object

The screen argument isn't actually used by lp_jit_screen_init() at this
time, but let's move the call so that we pass a valid pointer.

v2: don't leak screen if lp_jit_screen_init() fails.

Reviewed-by: Roland Scheidegger <sroland@vmware.com>

tgsi: fix Semantic.Name assignment in tgsi_transform_input_decl()

Assign the sem_name parameter, not TGSI_SEMANTIC_GENERIC.
Fixes polygon stipple regression.

Reviewed-by: Charmaine Lee <charmainel@vmware.com>

util: simplify PIPE_TEXTURE_CUBE case in util_max_layer()

For cube resources, the array_size value should be 6. So handle
that case as we do for array texture resources. But assert that
array_size==6 just to be safe.

Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>

softpipe: don't special case PIPE_TEXTURE_CUBE in softpipe_resource_layout()

As with the previous patch for llvmpipe.

Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>

llvmpipe: remove special case for PIPE_TEXTURE_CUBE in llvmpipe_texture_layout()

layers (aka array_size) should be 6 for cube textures so we don't need
to special-case it. But add an assertion just to be safe.

Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>

gallium: add doc note about cube textures and can_create_resource()

Just to be clear, and echo the description for resource_create().

Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>

st/mesa: remove unneded PIPE_TEXTURE_CUBE check in st_texture_create()

Earlier in the function we assert layers==6 for PIPE_TEXTURE_CUBE so
there's no reason to special-case the pt.array_size = layers assignment.

Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>

mesa: Drop the always-software-primitive-restart paths.

The core sw primitive restart code is still around, because i965 uses it
in some cases, but there are no drivers that want it on all the time.

Reviewed-by: Rob Clark <robdclark@gmail.com>

gallium: Drop software-only primitive restart support.

The drivers not flagging primitive restart support are r300 swtcl, svga,
nv30, and vc4.

The point of primitive restart is to slightly reduce draw call overhead
for apps by batching multiple draws. If we do an extra pass to read the
index buffer and split back into multiple draws, we've entirely missed the
point. This is particularly bad for drivers that otherwise have hardware
IB reads, where the readback is probably uncached.

Reviewed-by: Rob Clark <robdclark@gmail.com>

i965/fs: Properly calculate the number of instructions in calculate_register_pressure

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use the GRF for FB writes on gen >= 7

   On gen 7, the MRF was removed and we gained the ability to do send
   instructions directly from the GRF.  This commit enables that
   functinoality for FB writes.

   v2: Make handling of components more sane.

i965/fs: Force a high register for the final FB write

   v2: Renamed the array for the range mappings and added a comment

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Handle COMPR4 in LOAD_PAYLOAD

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Constant propagate into LOAD_PAYLOAD

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Add split_virtual_grfs and compute_to_mrf after lower_load_payload

If we are going to use LOAD_PAYLOAD operations to fill MRF registers, then
we will need this.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Add a an optional source to the FS_OPCODE_FB_WRITE instruction

Previously, we were use the base_mrf parameter of fs_inst to store the MRF
location. In preparation for doing FB writes from the GRF, we now also
allow you to set inst->base_mrf to -1 and provide a source register.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use the GRF for UNTYPED_SURFACE_READ instructions

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use the GRF for UNTYPED_ATOMIC instructions

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Add a function for getting a component of a 8 or 16-wide register

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use the instruction execution size directly for texture generation

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use exec_size instead of force_uncompressed in dump_instruction

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use instruction execution sizes instead of heuristics

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use instruction execution sizes to set compression state

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Remove unneeded uses of force_uncompressed

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Derive force_uncompressed from instruction exec_size

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Make fs_reg::effective_width take fs_inst* instead of fs_visitor*

   Now that we have execution sizes, we can use that instead of the
   dispatch width.  This way it also works for 8-wide instructions in
   SIMD16.

i965/fs: Make effective_width a variable instead of a function

i965/fs: Preserve effective width in constant propagation

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Better guess the width of LOAD_PAYLOAD

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Add an exec_size field to fs_inst

   This will, eventually, allow us to manage execution sizes of
   instructions in a much more natural way from the fs_visitor level.

i965/fs: Explicitly set instruction execute size a couple of places

i965/blorp: Explicitly set instruction execute sizes

   Since blorp is all 16-wide and nothing isn't, in general, very careful
   about register width, we'll just set it all explicitly.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Determine partial writes based on the destination width

Now that we track both halves of a 16-wide vgrf, we no longer need to worry
about force_sechalf or force_uncompressed. The only real issue is if the
destination is too small.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Fix a bug in register coalesce

This commit fixes a bug in register coalesce that happens when one register
is moved to another the proper number of times but the channels are
re-arranged. When this happens, the previous code would happily coalesce
the registers regardless of the fact that the channel mappins were wrong.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Rework GEN5 texturing code to use fs_reg and offset()

Now that offset() can properly handle MRF registers, we can use an MRF
fs_reg and let offset() handle incrementing it correctly for different
dispatch widths. While this doesn't have any noticeable effect currently,
it does ensure that the destination register is 16-wide which will be
necessary later when we start detecting execution sizes based on source and
destination registers.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs_reg: Allocate double the number of vgrfs in SIMD16 mode

This is actually the squash of a bunch of different changes.  Individual
commit titles follow:

i965/fs: Always 2-align registers SIMD16 for gen <= 5

i965/fs: Use the register width when applying offsets

   This reworks both byte_offset() and offset() to be more intelligent.
   The byte_offset() function now supports offsets bigger than 32. The
   offset() function uses the byte_offset() function together with the
   register width and the type size to offset the register by the correct
   amount.

i965/fs: Change regs_read to be in hardware registers

i965/fs: Change regs_written to be actual hardware registers

i965/fs: Properly handle register widths in LOAD_PAYLOAD

   The LOAD_PAYLOAD instruction is a bit special because it collects a
   bunch of registers (with possibly different widths) into a single
   payload block.  Once the payload is constructed, it's treated as a
   single block of data and most of the information such as register widths
   doesn't matter anymore.  In particular, the offset of any particular
   source register is the accumulation of the sizes of the previous source
   registers.

i965/fs: Properly set writemasks in LOAD_PAYLOAD

i965/fs: Handle register widths in demote_pull_constants

i965/fs: Get rid of implicit register doubling in the allocator

i965/fs: Reserve enough registers for PLN instructions

i965/fs: Make sources and destinations interfere in 16-wide

i965/fs: Properly handle register widths in CSE

i965/fs: Properly handle register widths in register_coalesce

i965/fs: Properly handle widths in copy propagation

i965/fs: Properly handle register widths in VARYING_PULL_CONSTANT_LOAD

i965/fs: Properly handle register widths and odd register sizes in spilling

i965/fs: Don't waste a register on texture lookups for gen >= 7

   Previously, we were waisting a register in SIMD16 mode because we could
   only allocate registers in pairs.  Now that we can allocate and address
   odd-sized registers, let's get rid of this special-case.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Handle printing of registers better.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965: Explicitly set widths on gen5 math instruction destinations.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Make half() divide the register width by 2 and use it more

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Add a concept of a width to fs_reg

Every register in i965 assembly implicitly has a concept of a "width".
Usually, this is derived from the execution size of the instruction.
However, when writing a compiler it turns out that it is frequently a
useful to have the width explicitly in the register and derive the
execution size of the instruction from the widths of the registers used in
it.

This commit adds a width field to fs_reg along with an effective_width()
helper function.  The effective_width() function tells you how wide the
register effectively is when used in an instruction.  For example, uniform
values have width 1 since the data is not actually repeated, but when used
in an instruction they take on the width of the instruction.  However, for
some instructions (LOAD_PAYLOAD being the notable exception), the width is
not the same.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: A little harmless refactoring of register_coalesce

Just pass the visitor into is_copy_payload() and is_coalesce_candidate()
instead of a register size and the virtual_grf_sizes array. Among other
things, this makes the code more obvious because you don't have to figure
out where src_size came from.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/brw_reg: Add a firsthalf function and use it in the generator

Right now, this function is a no-op but it indicates that we intend to only
use the first half of the 16-wide register.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Copy propagate partial reads.

This commit reworks copy propagation a bit to support propagating the
copying of partial registers.  This comes up every time we have pull
constants because we do a pull constant read immediately followed by a move
to splat the one component of the out to 8 or 16-wide.  This allows us to
eliminate the copy and simply use the one component of the register.

Shader DB results:

total instructions in shared programs: 5044937 -> 5044428 (-0.01%)
instructions in affected programs:     66112 -> 65603 (-0.77%)
GAINED:                                0
LOST:                                  0

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Refactor fs_inst::is_send_from_grf()

A switch statement is much easier to read/edit than a big giant or
statement.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Clean up emit_fb_writes

This splits emit_fb_writes into two functions: emit_fb_writes and
emit_single_fb_write. This reduces the amount of duplicated code in
emit_fb_writes and makes the register number fiddling less arcane.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Print BAD_FILE registers in dump_instruction

Sometimes these show up in LOAD_PAYLOAD instructions and it's nice to be
able to see them.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Make compact_virtual_grfs an optimization pass

Previously we disabled compact_virtual_grfs when dumping optimizations.
The idea here was to make it easier to diff the dumped shader because you
didn't have a sudden renaming. However, sometimes a bug is affected by
compact_virtual_grfs and, when this happens, you want to keep dumping
instructions with compact_virtual_grfs enabled. By turning it into an
optimization pass and dumping it along with the others, we retain the
ability to diff because you can just diff against the compact_virtual_grf
output.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i964/fs: Make immediate fs_reg constructors explicit

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Make null_reg_* const members of fs_visitor instead of globals

We also set the register width equal to the dispatch width. Right now,
this is effectively a no-op since we don't do anything with it. However,
it will be important once we add an actual width field to fs_reg.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use the var_from_vgrf helper function instead of doing it manually

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Fix a bug with dead_code_eliminate on large writes

Previously, if an instruction wrote to more than one register, we
implicitly assumed that it filled the entire register. We never hit this
before because the only time we did multi-register writes was things like
texturing which always wrote to all of the registers. However, with the
upcoming ability to do 16-wide instructions in SIMD8 and things of that
nature, we can have multi-register writes at offsets and we'll hit this.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use the UW type for the destination of VARYING_PULL_CONSTANT_LOAD instructions

Using a floating-point type doesn't usually cause hangs on my HSW, but the
simulator complains about it quite a bit.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Use offset a lot more places

We have this wonderful offset() function for advancing registers, but we're
not using it. Using offset() allows us to do some sanity checking and
avoid manually touching fs_reg::reg_offset. In a few commits, we will make
offset do even more nifty things for us.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: fix a comment in compact_virtual_grfs

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Rewrite fs_visitor::split_virtual_grfs

The original vgrf splitting code was written with the assumption that vgrfs
came in two types: those that can be split into single registers and those
that can't be split at all It was very conservative and bailed as soon as
more than one element of a register was read or written. This won't work
once we start allowing a regular MOV or ADD operation to operate on
multiple registers. This rewrite allows for the case where a vgrf of size
5 may appropriately be split in to one register of size 1 and two registers
of size 2.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Acked-by: Matt Turner <mattst88@gmail.com>

i965/fs_live_variables: Use var_from_vgrf insead of repeating the calculation

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>

i965/fs: Manually generate the meta fast-clear shader

Previously, we were generating the fast-clear shader from GLSL.  The
problem is that fast clears require that we use a replicated write rather
than a regular write instruction.  In order to get this we had a
complicated and somewhat fragile optimization pass that looked for places
where we can use a replicated write and used it.  Since replicated writes
have a lot of restrictions, we only ever use them for fast-clear
operations.

This commit replaces the optimization pass with a function that just
generates the shader we want.  This is a) less code, b) less fragile than
the optimization pass, and c) generates a more efficient shader.

Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
Acked-by: Kenneth Graunke <kenneth@whitecape.org>

radeonsi: Pass the slice size to si_dma_copy_buffer

Otherwise some parts of tiled slices can be missed.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: Catch more cases that can't be handled by si_dma_copy_buffer/tile

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: Fix si_dma_copy(_tile) for compressed formats

Fixes GPUVM faults when running the piglit test "getteximage-formats
init-by-rendering" with R600_DEBUG=forcedma on SI.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: Fix tiling mode index for stencil resources

We are currently only dealing with depth-only or stencil-only resources
here, not with resources having both depth and stencil[0]. In both cases,
the tiling mode index is in the tile_mode field, not in the
stencil_tile_mode field.

[0] Add an assertion for that.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

ilo: fix format of edge flag pointer

The VE format of edge flag pointers was changed in
780ce576bb1781f027797039693b98253ee4813e.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>

ilo: add a pass to finalize ilo_ve_state

Add finalize_vertex_elements() to finalize ilo_ve_state. This fixes a
potential issue with URB entry allocation for VS and move the complexity of
gen6_3DSTATE_VERTEX_ELEMENTS() to the new function.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>

ilo: precalculate aligned depth buffer size

To replace the hacky zs_align_surface().

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>

ilo: use dynamic bo for rectlist vertices

The size is always 24 bytes. We can upload them to the dynamic buffer.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>

st/xa: Fix regression in xa_yuv_planar_blit()

Commit "st/xa: scissor to help tilers" broke xa_yuv_planar_blit() and vmwgfx
textured video. Fix this by implementing scissors also in the yuv draw path.

Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Sinclair Yeh <syeh@vmware.com>
Cc: Rob Clark <robclark@freedesktop.org>
Cc: "10.2 10.3" <mesa-stable@lists.freedesktop.org>

i965: Delete intel_chipset.h.

Unused; it was replaced by include/pci_ids/i965_pci_ids.h long ago.

Acked-by: Matt Turner <mattst88@gmail.com>

driconf: Correct and update Catalan translation

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Eric Anholt <eric@anholt.net>

driconf: Update Spanish translation

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Eric Anholt <eric@anholt.net>

driconf: Synchronize po files

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Eric Anholt <eric@anholt.net>

vc4: Don't try to do stores to buffers that aren't bound.

The code was kind of mixed up what buffers were getting stored in the case
that a resolve bit was unset (which are set based on the GL state at draw
time) and the buffer wasn't actually bound. In particular, depth-only
rendering would store the color buffer contents, which happen to be
pointing at the depth buffer.

Thanks to clearing out the resolve bits for things we really can't
resolve, now I can drop the safety checks for buffer presence around the
actual stores.

Fixes 42 piglit tests.

vc4: Shove some depth comparison bits down to where they're used.

i965: Use BRW_MATH_DATA_SCALAR when source regioning is scalar.

Notice the mistaken (but harmless) argument swapping in brw_math_invert().

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965/compaction: Move variable declarations to their uses.

Tested-by: Mark Janes <mark.a.janes@intel.com>

i965/compaction: Simplify jump target code.

My attempts to clarify the code with _compacted/_uncompacted prefixed
variables apparently failed. Hopefully this is clearer.

In any case, the previous code wasn't clear enough to gcc to let it
optimize division by a power of two into a shift. No problems now.

Also, the previous code (in the ADD case) didn't work on 32-bit x86, due
to complicated set of interactions best summed up as unsigned division
and compiler optimizations.

Tested-by: Mark Janes <mark.a.janes@intel.com>

freedreno/a3xx: re-emit shaders on variant change

We need to keep track if a state change other than frag/vert shader
state will trigger us to need a different shader variant, and if
necessary mark the appropriate shader state as dirty. Otherwise we will
forget to re-emit the shader state.

Signed-off-by: Rob Clark <robclark@freedesktop.org>

freedreno/ir3: add some cmdline args

Signed-off-by: Rob Clark <robclark@freedesktop.org>

freedreno/a3xx: add support to emulate GL_CLAMP

Signed-off-by: Rob Clark <robclark@freedesktop.org>