git.libre-soc.org Git - mesa.git/log

nir/algebraic: Simplify fsqrt domain guard

All Gen7+ platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17228376 -> 17225365 (-0.02%)
instructions in affected programs: 280732 -> 277721 (-1.07%)
helped: 1072
HURT: 0
helped stats (abs) min: 1 max: 12 x̄: 2.81 x̃: 2
helped stats (rel) min: 0.16% max: 5.10% x̄: 1.43% x̃: 1.07%
95% mean confidence interval for instructions value: -2.92 -2.70
95% mean confidence interval for instructions %-change: -1.48% -1.37%
Instructions are helped.

total cycles in shared programs: 360935690 -> 360842788 (-0.03%)
cycles in affected programs: 7838017 -> 7745115 (-1.19%)
helped: 1569
HURT: 69
helped stats (abs) min: 1 max: 1198 x̄: 63.53 x̃: 20
helped stats (rel) min: 0.06% max: 26.17% x̄: 3.44% x̃: 2.12%
HURT stats (abs) min: 1 max: 2820 x̄: 98.22 x̃: 47
HURT stats (rel) min: 0.05% max: 16.67% x̄: 3.50% x̃: 2.31%
95% mean confidence interval for cycles value: -63.55 -49.89
95% mean confidence interval for cycles %-change: -3.33% -2.96%
Cycles are helped.

No changes on any other platform.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

nir/search: Don't compare 8-bit or 1-bit constants with floats

Without this, adding an algebraic rule like

(('bcsel', ('flt', a, 0.0), 0.0, ...), ...),

will cause assertion failures inside nir_src_comp_as_float in
GTF-GL46.gtf21.GL.lessThan.lessThan_vec3_frag (and related tests) from
the OpenGL CTS and shaders/closed/steam/witcher-2/511.shader_test from
shader-db.

All of these cases have some code that ends up like

('bcsel', ('flt', a, 0.0), 'b@1', ...)

When the 'b@1' is tested, nir_src_comp_as_float fails because there's
no such thing as a 1-bit float.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

nir/algebraic: Recognize open-coded fsat with modifiers

This change also enables a later change (nir/algebraic: Replace
1-fsat(a) with fsat(1-a)) to affect more shaders.

Almost all of the affected shaders are in Bioshock Infinite, and all of
those shaders all require GLSL 4.10.

All Intel platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17228584 -> 17228376 (<.01%)
instructions in affected programs: 31438 -> 31230 (-0.66%)
helped: 105
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 1.98 x̃: 1
helped stats (rel) min: 0.08% max: 1.53% x̄: 0.73% x̃: 0.70%
95% mean confidence interval for instructions value: -2.20 -1.76
95% mean confidence interval for instructions %-change: -0.80% -0.67%
Instructions are helped.

total cycles in shared programs: 360936431 -> 360935690 (<.01%)
cycles in affected programs: 420100 -> 419359 (-0.18%)
helped: 71
HURT: 21
helped stats (abs) min: 1 max: 160 x̄: 19.28 x̃: 10
helped stats (rel) min: <.01% max: 9.78% x̄: 0.95% x̃: 0.48%
HURT stats (abs) min: 1 max: 198 x̄: 29.90 x̃: 10
HURT stats (rel) min: 0.05% max: 8.36% x̄: 1.24% x̃: 0.90%
95% mean confidence interval for cycles value: -16.77 0.66
95% mean confidence interval for cycles %-change: -0.85% -0.06%
Inconclusive result (value mean confidence interval includes 0).

Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

nir/algebraic: Push unary operations into source operands of fsat source

Pushing a unary operation, like fneg, into the operation that generates
its operand allows the fsat to be applied to the inner instruction
instead of on a separate instruction that performs the unary operation.
This changes

    fmul ssa_100, ssa_99, ssa_98
    fmov.sat ssa_101, -ssa_100

into

    fmul.sat ssa_100, -ssa_99, ssa_98

Ice Lake, Skylake, and Broadwell had similar results. (Ice Lake shown)
total instructions in shared programs: 17228658 -> 17228584 (<.01%)
instructions in affected programs: 3163 -> 3089 (-2.34%)
helped: 49
HURT: 0
helped stats (abs) min: 1 max: 2 x̄: 1.51 x̃: 2
helped stats (rel) min: 0.58% max: 9.09% x̄: 3.69% x̃: 3.51%
95% mean confidence interval for instructions value: -1.66 -1.37
95% mean confidence interval for instructions %-change: -4.37% -3.00%
Instructions are helped.

total cycles in shared programs: 360937144 -> 360936431 (<.01%)
cycles in affected programs: 24029 -> 23316 (-2.97%)
helped: 47
HURT: 2
helped stats (abs) min: 4 max: 18 x̄: 15.34 x̃: 16
helped stats (rel) min: 0.69% max: 6.18% x̄: 3.78% x̃: 4.27%
HURT stats (abs)   min: 4 max: 4 x̄: 4.00 x̃: 4
HURT stats (rel)   min: 0.34% max: 0.67% x̄: 0.50% x̃: 0.50%
95% mean confidence interval for cycles value: -16.05 -13.05
95% mean confidence interval for cycles %-change: -4.07% -3.15%
Cycles are helped.

All Gen7 and earlier platforms had similar results. (Haswell shown)
total instructions in shared programs: 13536059 -> 13535884 (<.01%)
instructions in affected programs: 8797 -> 8622 (-1.99%)
helped: 150
HURT: 0
helped stats (abs) min: 1 max: 2 x̄: 1.17 x̃: 1
helped stats (rel) min: 0.40% max: 11.11% x̄: 3.51% x̃: 1.96%
95% mean confidence interval for instructions value: -1.23 -1.11
95% mean confidence interval for instructions %-change: -3.97% -3.05%
Instructions are helped.

total cycles in shared programs: 357696119 -> 357694193 (<.01%)
cycles in affected programs: 50216 -> 48290 (-3.84%)
helped: 109
HURT: 14
helped stats (abs) min: 2 max: 92 x̄: 18.97 x̃: 16
helped stats (rel) min: 0.26% max: 19.09% x̄: 7.37% x̃: 5.37%
HURT stats (abs)   min: 2 max: 26 x̄: 10.14 x̃: 5
HURT stats (rel)   min: 0.18% max: 4.73% x̄: 1.84% x̃: 0.92%
95% mean confidence interval for cycles value: -19.27 -12.05
95% mean confidence interval for cycles %-change: -7.34% -5.31%
Cycles are helped.

Reviewed-by: Matt Turner <mattst88@gmail.com>

nir/algebraic: Recognize open-coded flrp(a, b, fsat(c))

All Gen6+ GPUs had similar results. (Skylake shown)
total instructions in shared programs: 15336712 -> 15336622 (<.01%)
instructions in affected programs: 3952 -> 3862 (-2.28%)
helped: 24
HURT: 0
helped stats (abs) min: 3 max: 5 x̄: 3.75 x̃: 4
helped stats (rel) min: 1.75% max: 2.70% x̄: 2.34% x̃: 2.46%
95% mean confidence interval for instructions value: -4.06 -3.44
95% mean confidence interval for instructions %-change: -2.47% -2.22%
Instructions are helped.

total cycles in shared programs: 355722052 -> 355721235 (<.01%)
cycles in affected programs: 27326 -> 26509 (-2.99%)
helped: 20
HURT: 4
helped stats (abs) min: 1 max: 227 x̄: 44.75 x̃: 14
helped stats (rel) min: 0.12% max: 22.95% x̄: 3.83% x̃: 1.23%
HURT stats (abs) min: 2 max: 64 x̄: 19.50 x̃: 6
HURT stats (rel) min: 0.21% max: 3.63% x̄: 1.24% x̃: 0.55%
95% mean confidence interval for cycles value: -61.61 -6.47
95% mean confidence interval for cycles %-change: -5.59% -0.39%
Cycles are helped.

No changes on Ice Lake, Iron Lake, or GM45.

Reviewed-by: Matt Turner <mattst88@gmail.com>

intel/fs: Allow cmod propagation to instructions with saturate modifier

v2: Add unit tests.  Suggested by Matt.

All Intel GPUs had similar results. (Ice Lake shown)
total instructions in shared programs: 17229441 -> 17228658 (<.01%)
instructions in affected programs: 159574 -> 158791 (-0.49%)
helped: 489
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 1.60 x̃: 1
helped stats (rel) min: 0.07% max: 2.70% x̄: 0.61% x̃: 0.59%
95% mean confidence interval for instructions value: -1.72 -1.48
95% mean confidence interval for instructions %-change: -0.64% -0.58%
Instructions are helped.

total cycles in shared programs: 360944149 -> 360937144 (<.01%)
cycles in affected programs: 1072195 -> 1065190 (-0.65%)
helped: 254
HURT: 27
helped stats (abs) min: 2 max: 234 x̄: 30.51 x̃: 9
helped stats (rel) min: 0.04% max: 8.99% x̄: 0.75% x̃: 0.24%
HURT stats (abs)   min: 2 max: 83 x̄: 27.56 x̃: 24
HURT stats (rel)   min: 0.09% max: 3.79% x̄: 1.28% x̃: 1.16%
95% mean confidence interval for cycles value: -30.11 -19.75
95% mean confidence interval for cycles %-change: -0.70% -0.41%
Cycles are helped.

Reviewed-by: Matt Turner <mattst88@gmail.com> [v1]

nir/algebraic: Add missing ffma(-1, a, b) pattern

All Gen7+ platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17229439 -> 17229377 (<.01%)
instructions in affected programs: 9859 -> 9797 (-0.63%)
helped: 41
HURT: 0
helped stats (abs) min: 1 max: 6 x̄: 1.51 x̃: 1
helped stats (rel) min: 0.08% max: 11.54% x̄: 1.65% x̃: 0.67%
95% mean confidence interval for instructions value: -1.88 -1.14
95% mean confidence interval for instructions %-change: -2.48% -0.81%
Instructions are helped.

total cycles in shared programs: 360944145 -> 360942989 (<.01%)
cycles in affected programs: 178167 -> 177011 (-0.65%)
helped: 36
HURT: 19
helped stats (abs) min: 1 max: 222 x̄: 38.03 x̃: 5
helped stats (rel) min: 0.01% max: 31.01% x̄: 4.01% x̃: 0.45%
HURT stats (abs) min: 1 max: 34 x̄: 11.21 x̃: 6
HURT stats (rel) min: 0.03% max: 2.74% x̄: 0.72% x̃: 0.50%
95% mean confidence interval for cycles value: -36.01 -6.02
95% mean confidence interval for cycles %-change: -4.18% -0.57%
Cycles are helped.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

nir: Mark ffma as 2src_commutative

This doesn't make any real difference now, but future work (not in this
series) will add a LOT of ffma patterns. Having to duplicate all of
them for ffma(a, b, c) and ffma(b, a, c) is just terrible.

No shader-db changes on any Intel platform.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

nir: Add support for 2src_commutative ops that have 3 sources

v2: Instead of handling 3 sources as a special case, generalize with
loops to N sources. Suggested by Jason.

v3: Further generalize by only checking that number of sources is >= 2.
Suggested by Jason.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

nir: Rename commutative to 2src_commutative

The meaning of the new name is that the first two sources are
commutative.  Since this is only currently applied to two-source
operations, there is no change.

A future change will mark ffma as 2src_commutative.

It is also possible that future work will add 3src_commutative for
opcodes like fmin3.

v2: s/commutative_2src/2src_commutative/g.  I had originally considered
this, but I discarded it because I did't want to deal with identifiers
that (should) start with 2.  Jason suggested it in review, so we decided
that _2src_commutative would be used in nir_opcodes.py.  Also add some
comments documenting what 2src_commutative means.  Also suggested by
Jason.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

intel/fs/ra: Spill without destroying the interference graph

Instead of re-building the interference graph every time we spill, we
modify it in place so we can avoid recalculating liveness and the whole
O(n^2) interference graph building process.  We make a simplifying
assumption in order to do so which is that all spill/fill temporary
registers live for the entire duration of the instruction around which
we're spilling.  This isn't quite true because a spill into the source
of an instruction doesn't need to interfere with its destination, for
instance.  Not re-calculating liveness also means that we aren't
adjusting spill costs based on the new liveness.  The combination of
these things results in a bit of churn in spilling.  It takes a large
cut out of the run-time of shader-db on my laptop.

Shader-db results on Kaby Lake:

    total instructions in shared programs: 15311224 -> 15311360 (<.01%)
    instructions in affected programs: 77027 -> 77163 (0.18%)
    helped: 11
    HURT: 18

    total cycles in shared programs: 355544739 -> 355830749 (0.08%)
    cycles in affected programs: 203273745 -> 203559755 (0.14%)
    helped: 234
    HURT: 190

    total spills in shared programs: 12049 -> 12042 (-0.06%)
    spills in affected programs: 2465 -> 2458 (-0.28%)
    helped: 9
    HURT: 16

    total fills in shared programs: 25112 -> 25165 (0.21%)
    fills in affected programs: 6819 -> 6872 (0.78%)
    helped: 11
    HURT: 16

    Total CPU time (seconds): 2469.68 -> 2360.22 (-4.43%)

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Put the VGRFs at the end of the nodes

This is slightly less convenient in some places but it will make it much
easier when we want to start adding nodes dynamically.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Re-arrange interference setup

The old code was arranged by the type of interference being added.  It
would set up payload registers and then add payload interference for all
VGRFs.  It would set up MRFs and add MRF interference for all VGRFs.
This commit re-arranges things to be organized differently.  It first
creates and sets up all RA nodes and then groups interference into two
new categories:  live range and instruction interference.  Once all the
RA nodes have been set up, it walks the list of VGRFs and sets up their
live range interference and then walks the list of instructions and sets
up instruction interference.  This new arrangement will be advantageous
for a future patch but, at the moment, it cuts 2% off the run-time of
shader-db on my laptop.

Shader-db results on Kaby Lake:

    total instructions in shared programs: 15311224 -> 15311224 (0.00%)
    instructions in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    total cycles in shared programs: 355544739 -> 355544739 (0.00%)
    cycles in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    Total CPU time (seconds): 2523.45 -> 2469.68 (-2.13%)

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Do the spill loop inside RA

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Only add MRF hack interference if we're spilling

The only use of the MRF hack these days is for spilling and there we
don't need the precise MRF usage information.  If we're spilling then we
know pretty well how many MRFs are going to be used.  It is possible if
the only things that are spilled have fewer SIMD channels than the
dispatch width of the shader that this may be more MRFs than needed.
That's a risk we're willing to takd.

Shader-db results on Kaby Lake:

    total instructions in shared programs: 15311100 -> 15311224 (<.01%)
    instructions in affected programs: 16664 -> 16788 (0.74%)
    helped: 1
    HURT: 5

    total cycles in shared programs: 355543197 -> 355544739 (<.01%)
    cycles in affected programs: 731864 -> 733406 (0.21%)
    helped: 3
    HURT: 6

The hurt shaders are all SIMD32 compute shaders where we reserve enough
space for a 32-wide spill/fill but don't need it.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Pull the guts of RA into its own class

This accomplishes two things. First, it makes interfaces which are
really private to RA private to RA. Second, it gives us a place to
store some common stuff as we go through the algorithm.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Move assign_regs further down in the file

It's the main function from which all the other functions are called.
It belongs at the bottom.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Split building the interference graph into a helper

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Initialize grf_used with first_non_payload_grf

There's no reason why we need to use the calculated payload_node_count
value which is just first_non_payload_grf aligned up. The grf_used
value will be aligned up to 16 anyway (which is a much bigger alignment)
before being handed off to hardware.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Stop adding RA interference to too many SENDS nodes

We only have one node per VGRF so this was adding way too much
interference.  No idea how we didn't catch this before.

Shader-db results on Kaby Lake:

    total instructions in shared programs: 15311100 -> 15311100 (0.00%)
    instructions in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    total cycles in shared programs: 355468050 -> 355543197 (0.02%)
    cycles in affected programs: 2472492 -> 2547639 (3.04%)
    helped: 17
    HURT: 20

Fixes: 014edff0d20d "intel/fs: Add interference between SENDS sources"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

util/ra: Assert nodes are in-bounds in add_node_interference

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel/fs/ra: Only add dest interference to sources that exist

Fixes: 83dedb6354d "i965: Add src/dst interference for certain"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

util/ra: Don't destroy the graph in ra_allocate()

We want to be able to call ra_allocate() and, when it fails, mutate the
graph and try again rather than re-building the graph from scratch.
This commit moves all the scratch bits except the final register
allocation (which is really an out value not scratch) into sub-structs
named "tmp" to make it clear which things are scratch. It also adds
bits to the ra_select() initialization loop to initialize things (since
we can't trust rzalloc anymore) and copy q_test and forced_reg over.

Reviewed-by: Eric Anholt <eric@anholt.net>

util/ra: Add a helper for resetting a node's interference

Reviewed-by: Eric Anholt <eric@anholt.net>

util/ra: Add helpers for adding nodes to an interference graph

Reviewed-by: Eric Anholt <eric@anholt.net>

util/ralloc: Add helpers for growing zero-initialized memory

Unfortunately, we can't quite follow the standard C conventions for
these because ralloc doesn't know the sizes of pointers.

Reviewed-by: Eric Anholt <eric@anholt.net>

intel/fs: Stop doing extra RA calls

In the last phase of the schedule and RA loop, the RA call is redundant
if we spill.  Immediately afterwards, we're going to see that we
couldn't allocate without spilling and call back into RA and tell it to
go ahead and spill.  We've known about it for a while but we've always
brushed over it on the theory that, if you're going to spill, you'll be
calling RA a bunch anyway and what does one extra RA hurt?  As it turns
out, it hurts more than you'd expect.  Because the RA interference graph
gets sparser with each spill and the RA algorithm is more efficient on
sparser graphs, the RA call that we're duplicating is actually the most
expensive call in the RA-and-spill loop.

There's another extra RA call we do that's a bit harder to see which
this also removes.  If we try to compile a shader that isn't the minimum
dispatch width and it fails to allocate without spilling we call fail()
to set an error but then go ahead and do the first spilling RA pass and
only after that's complete do we detect the fail and bail out.  By
making minimum dispatch widths part of the spill condition, we side-step
this problem.

Getting rid of these extra spills takes the compile time of a nasty
Aztec Ruins shader from about 28 seconds to about 26 seconds on my
laptop.  It also makes shader-db 1.5% faster

Shader-db results on Kaby Lake:

    total instructions in shared programs: 15311100 -> 15311100 (0.00%)
    instructions in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    total cycles in shared programs: 355468050 -> 355468050 (0.00%)
    cycles in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    Total CPU time (seconds): 2524.31 -> 2486.63 (-1.49%)

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

util/ra: Improve the performance of ra_simplify

The most expensive part of register allocation is the ra_simplify step
which is a fixed-point algorithm with a worst-case complexity of O(n^2)
which adds the registers to a stack which we then use later to do the
actual allocation.  This commit uses bit sets and changes the core loop
of ra_simplify to first walk 32-node chunks and then walk each chunk.
This lets us skip whole 32-node chunks in one go based on bit operations
and compute the minimum q value potentially 32x as fast.  Of course, the
algorithm still has the same fundamental O(n^2) run-time but the
constant is now much lower.

In the nasty Aztec Ruins compute shader, this shaves a full four seconds
off the 30s compile time for a release build of mesa.  In a debug build
(needed for accurate stack traces), perf says that ra_select takes 20%
of runtime before this patch and only 5-6% of runtime after this patch.
It also makes shader-db runs faster.

Shader-db results on Kaby Lake:

    total instructions in shared programs: 15311100 -> 15311100 (0.00%)
    instructions in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    total cycles in shared programs: 355468050 -> 355468050 (0.00%)
    cycles in affected programs: 0 -> 0
    helped: 0
    HURT: 0

    Total CPU time (seconds): 2602.37 -> 2524.31 (-3.00%)

Reviewed-by: Eric Anholt <eric@anholt.net>

util/ra: Only update q_total if the reg is not assigned

We only use q_total if the reg is not assigned so there's no point in
updating it if the reg is not assigned. This has no known perf benefit
but it will reduce churn in a future commit.

Reviewed-by: Eric Anholt <eric@anholt.net>

util/ra: Only update best_optimistic_node if !progress

This shaves about half a second off the 30 second compile time of one of
the compute shaders in Aztec ruins.

Reviewed-by: Eric Anholt <eric@anholt.net>

util/ra: Make in_stack a bitset in the graph

Reviewed-by: Eric Anholt <eric@anholt.net>

util/ra: Get rid of tabs

Reviewed-by: Eric Anholt <eric@anholt.net>

virgl: clean up virgl_res_needs_flush

Add comments and some minor cleanups.

v2: document the function

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Alexandros Frantzis <alexandros.frantzis@collabora.com> (v1)
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>
Signed-off-by: Chia-I Wu <olvaffe@gmail.com>

virgl: comment on a sync issue in transfers

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Alexandros Frantzis <alexandros.frantzis@collabora.com>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>

virgl: PIPE_TRANSFER_READ does not imply flush

virgl_res_needs_flush should suffice.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Alexandros Frantzis <alexandros.frantzis@collabora.com>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>

virgl: do not skip readback because of explicit flush

Both apps and we (see virgl_buffer_transfer_flush_region) might
flush regions that are unmodified. We have to read back for those
flushes.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Alexandros Frantzis <alexandros.frantzis@collabora.com>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>

virgl: remove unused virgl_transfer_inline_write

It currently has no user and is probably incorrect (resource_wait is
required in some more cases). Remove it so that we can focus on
transfers first.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Reviewed-by: Alexandros Frantzis <alexandros.frantzis@collabora.com>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>

iris/resource: Drop redundant checks for aux support

Drop some checks that are already done by ISL.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

iris/resource: Fall back to no aux if creation fails

No surface requires an auxiliary surface to operate correctly. Fall back
to an uncompressed surface if mesa fails to create and allocate an
auxiliary surface. This enables adding more restrictions to ISL without
having to update iris.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

i965/miptree: Refactor intel_miptree_supports_ccs_e()

Update and rename this function to format_supports_ccs_e() to better
match its behavior.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

i965/miptree: Drop intel_*_supports_hiz()

intel_tiling_supports_hiz() and intel_miptree_supports_hiz() duplicate
much the work done by isl_surf_get_hiz_surf(). Replace them with simple
expressions.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

isl: Add restrictions to isl_surf_get_hiz_surf()

Import some restrictions from intel_tiling_supports_hiz() and
intel_miptree_supports_hiz().

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

i965/miptree: Drop intel_*_supports_ccs()

intel_tiling_supports_ccs() and intel_miptree_supports_ccs() duplicate
much the work done by isl_surf_get_ccs_surf(). Drop them both and index
a boolean array to choose CCS_D in intel_miptree_choose_aux_usage().

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

isl: Add restriction and comments to isl_surf_get_ccs_surf()

Import some restrictions and comments from intel_miptree_supports_ccs().

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

i965/miptree: Drop intel_miptree_supports_mcs()

This function duplicates much the work done by isl_surf_get_mcs_surf().
Replace it with a simple expression.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

isl: Modify restrictions in isl_surf_get_mcs_surf()

Import some restrictions from intel_miptree_supports_mcs() and don't
assume that the caller knows which device generations are supported.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

i965/miptree: Fall back to no aux if creation fails

No surface requires an auxiliary surface to operate correctly. Fall back
to an uncompressed surface if mesa fails to create and allocate an
auxiliary surface. This enables adding more restrictions to ISL without
having to update i965.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

mesa: Set _NEW_VARYING_VP_INPUTS iff varying_vp_inputs are set.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>

mesa: Avoid setting _NEW_VARYING_VP_INPUTS in non fixed function mode.

Instead of checking the API variant on entry of set_varying_vp_inputs
to check if we can ever be interrested in fixed function processing
or not, we can check if we are actually fixed function processing.
To check this we can use the immediately updated
gl_context::VertexProgram._VPMode value that tells us if we have a
user provided shader program or if we are in fixed function processing
either through an internal TNL shader of directly through hardware.
When doing so, we also need to recheck the varying_vp_inputs variable
at the time gl_context::VertexProgram._VPMode is set to VP_MODE_FF.
Put asserts at the consumers of gl_context::varying_vp_inputs to make
sure gl_context::VertexProgram._VPMode is set to VP_MODE_FF. By that
gl_context::varying_vp_inputs should be up to date then.

By not looking at the opengl api for this decision we should actually
catch more cases where we can avoid setting a state change flag, including
the ones where we cannot get into VP_MODE_FF by the choice of the api.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>

mesa: Fix test for setting the _NEW_VARYING_VP_INPUTS flag.

The precondition stated in the comment is not true. The values mentioned are
only set from _mesa_update_state which in turn may not yet be called.
For now set the _NEW_VARYING_VP_INPUTS flag a bit more often, we will narrow
that down to a minimum again in a later patch.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>

mesa: Make _mesa_set_varying_vp_inputs static in state.c.

Is no longer used outside that file.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>

mesa: Fix old outdated variable name in a comment.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>

mesa/vbo: Update Comment to what is actually happening.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Mathias Fröhlich <Mathias.Froehlich@web.de>

wayland/egl: Ensure correct buffer size when allocating

Whenever a buffer is allocated, e.g. by the first draw call or EGL call after a
buffer swap, make sure the size is up to date. Prior to this commit, we
failed to do so when querying the buffer age, or swapping buffers
without any prior EGL call or draw call.

Signed-off-by: Jonas Ådahl <jadahl@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>

egl: check if a window/pixmap is already used on surface creation

The spec says we can't create another surface if we already created a
surface with the given window or pixmap. Implement this check.

This behavior is exercised by piglit/egl-create-surface.

Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>

egl: store the native surface pointer in struct _egl_surface

Each platform stores this in a different place:
  - platform_drm uses dri2_surf->gbm_surf->base
  - platform_android uses dri2_surf->window
  - platform_wayland uses dri2_surf->wl_win
  - platform_x11 uses dri2_surf->drawable
  - platform_x11_dri3 uses dri3_surf->loader_drawable.drawable
  - haiku doesn't even store it!

We need access to the native surface since the specification asks us
to refuse creating a new surface if there's already an EGLSurface
associated with native_surface.

An alternative to this patch would be to create a new
API.GetNativeWindow callback that each platform would have to
implement. While that's something we can definitely do, I prefer
this approach.

Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>

radv: add support for VK_KHR_uniform_buffer_standard_layout

Nothing to do.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

softpipe/buffer: load only as many components as the the buffer resource type provides

Otherwise we risk to read past the end of the buffer.

In addition, change the loop counters to unsigned to be consistent
with the types.

Fixes: afa8707ba93a7d226a76319acda2a8dd89524db7
softpipe: add SSBO/shader atomics support.

Signed-off-by: Gert Wollny <gert.wollny@collabora.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>

panfrost: ci: Reduce batch size to 3000

As with the previous value of 5000 we seemed to be reaching OOM in some
circumstances.

Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>

panfrost: ci: Update expectations

Since last Friday, these two tests have been fixed:

dEQP-GLES2.functional.shaders.functions.control_flow.return_in_nested_loop_fragment
dEQP-GLES2.functional.shaders.linkage.varying_7

Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Acked-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>

freedreno: Fix warning on printing a uint64_t using %llx.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>

freedreno: Silence compiler warnings about "*" in boolean context.

It sure looks like we just want both of them to be nonzero, and && is
probably going to be cheaper than * anyway.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>

freedreno: Silence compiler warnings about uninit 'layers'

My gcc can't see that the uninitialized value from the PIPE_BUFFER case
isn't used from the !PIPE_BUFFER cases later.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>

freedreno: Quiet compiler warnings on 64-bit.

__u64 is a ulonglong on x86_64, not uint64_t, so my gcc was complaining
about the wrong type being passed in.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>

freedreno: Make emacs indent the way robclark's eclipse does.

The .editorconfig helps with the tabs, but we've got this
two-tabs-from-previous-indentation line continuation style that requires
whacking the c-file-offsets. This will throw emacs warnings when first
opening a file in the directory, press '!' to shut it up for the future.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>

freedreno: Make .editorconfig match .dir-locals.el.

The editorconfig takes precedence over dir-locals in emacs26 with
editorconfig enabled, so the /.editorconfig was affecting these
directories.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>

anv: Implement VK_KHR_uniform_buffer_standard_layout

There's no real work to do here since we already support scalar block
layout which is a direct superset of what this extension allows.

Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

vulkan: Update the XML and headers to 1.1.108

Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

tu/entrypoints: Import copy

It's used without being imported

nv50/ir/nir: make use of SYSTEM_VALUE_MAX when iterating read sysvals

Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Pierre Moreau <dev@pmoreau.org>

nv50/ir/nir: prefer to shift 1ull instead of 1ll

Signed-off-by: Karol Herbst <kherbst@redhat.com>
Suggested-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Pierre Moreau <dev@pmoreau.org>

radv: Clean up signalled and submitted fields from winsys fences.

Other types like syncobj do not need it, so lets make things a bit more uniform.

Also reduce confusion what the signalled/submitted referred to (especially with
imported fences)

Reviewed-by: Dave Airlie <airlied@redhat.com>

radv: bump reported version to 1.1.107

VK_AMD_draw_indirect_count has been promoted with the suffix
changed to KHR.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

v3d: Use driconf to expose non-MSAA texture limits for Xorg.

The V3D 4.2 HW has a limit to MSAA texture sizes of 4096. With non-MSAA,
we can go up to 7680 (actually probably 8138, but that hasn't been
validated by the HW team). Exposing 7680 in X11 will allow dual 4k displays.

gallium: Redefine the max texture 2d cap from _LEVELS to _SIZE.

The _LEVELS assumes that the max is always power of two. For V3D 4.2, we
can support up to 7680 non-power-of-two MSAA textures, which will let X11
support dual 4k displays on newer hardware.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

mesa: Replace MaxTextureLevels with MaxTextureSize.

In most places (glGetInteger, max_legal_texture_dimensions), we wanted the
number of pixels, not the number of levels. Number of levels is easily
recovered with util_next_power_of_two() and ffs(). More importantly, for
V3D we want to be able to expose a non-power-of-two maximum texture size
to cover 2x4k displays on HW that can't quite do 8192 wide.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

mesa: Remove proxy image checks for maximum level.

We've already verified this by _mesa_legal_texture_dimensions() before
this call.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

mesa: Reuse _mesa_max_texture_levels() instead of open-coding it.

The shared function has some extension presence checks, but other than
that has the same switch statement contents.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

intel/tools: Fix build with glibc < 2.27.

glibc < 2.27 defines OVERFLOW in /usr/include/math.h.

This patch fixes this build error.

In file included from ../include/c99_math.h:37:0,
                 from ../src/util/u_math.h:44,
                 from ../src/mesa/main/macros.h:35,
                 from ../src/intel/compiler/brw_reg.h:47,
                 from ../src/intel/tools/i965_asm.h:32,
                 from ../src/intel/tools/i965_gram.y:29:
src/intel/tools/i965_gram.tab.c:562:5: error: expected identifier before numeric constant
     OVERFLOW = 412,
     ^

Fixes: 70308a5a8a80 ("intel/tools: New i965 instruction assembler tool")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110656
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Acked-by: Eric Engestrom <eric@engestrom.ch>

st/mesa: enable the ST_DEBUG env var in release and debugoptimized builds

Useful for dumping shaders.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

radeonsi: overhaul the vertex fetch fixup mechanism

The overall goal is to support unaligned loads from vertex buffers
natively on SI.

In the unaligned case, we fall back to the general case implementation in
ac_build_opencoded_load_format. Since this function is fully general,
we will also use it going forward for cases requiring fully manual format
conversions of dwords anyway.

This requires a different encoding of the fix_fetch array, which will now
contain the entire format information if a fixup is required.

Having to check the alignment of vertex buffers is awkward. To keep the
impact on the fast path minimal, the si_context will keep track of which
vertex buffers are (not) at least dword-aligned, while the
si_vertex_elements will note which vertex buffers have some (at most dword)
alignment requirement. Vertex buffers should be dword-aligned most of the
time, which allows a fast early-out in almost all cases.

Add the radeonsi_vs_fetch_always_opencode configuration variable for
testing purposes. Note that it can only be used reliably on LLVM >= 9,
because support for byte and short load is required.

v2:
- add a missing check to si_bind_vertex_elements

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: store sctx->vertex_elements in a local in si_shader_selector_key_vs

Purely as a shorthand in the remainder of the function.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

amd/common: add ac_build_opencoded_fetch_format

Implement software emulation of buffer_load_format for all types required
by vertex buffer fetches.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

nir/validate: Use a single set for SSA def validation

The current SSA def validation we do in nir_validate validates three
things:

1. That each SSA def is only ever used in the function in which it is
    defined.

2. That an nir_src exists in an SSA def's use list if and only if it
    points to that SSA def.

3. That each nir_src is in the correct use list (uses or if_uses) based
    on whether it's an if condition or not.

The way we were doing this before was that we had a hash table which
provided a map from SSA def to a small ssa_def_validate_state data
structure which contained a pointer to the nir_function_impl and two
hash sets, one for each use list.  This meant piles of allocation and
creating of little hash sets.  It also meant one hash lookup for each
SSA def plus one per use as well as two per src (because we have to look
up the ssa_def_validate_state and then look up the use.)  It also
involved a second walk over the instructions as a post-validate step.

This commit changes us to use a single low-collision hash set of SSA
sources for all of this by being a bit more clever.  We accomplish the
objectives above as follows:

1. The list is clear when we start validating a function.  If the
    nir_src references an SSA def which is defined in a different
    function, it simply won't be in the set.

2. When validating the SSA defs, we walk the uses and verify that they
    have is_ssa set and that the SSA def points to the SSA def we're
    validating.  This catches the case of a nir_src being in the wrong
    list.  We then put the nir_src in the set and, when we validate the
    nir_src, we assert that it's in the set.  This takes care of any
    cases where a nir_src isn't in the use list.  After checking that
    the nir_src is in the set, we remove it from the set and, at the end
    of nir_function_impl validation, we assert that the set is empty.
    This takes care of any cases where a nir_src is in a use list but
    the instruction is no longer in the shader.

3. When we put a nir_src in the set, we set the bottom bit of the
    pointer to 1 if it's the condition of an if.  This lets us detect
    whether or not a nir_src is in the right list.

When running shader-db with an optimized debug build of mesa on my
laptop, I get the following shader-db CPU times:

   With NIR_VALIDATE=0       3033.34 seconds
   Before this commit       20224.83 seconds
   After this commit         6255.50 seconds

Assuming shader-db is a representative sampling of GLSL shaders, this
means that making this change yields an 81% reduction in the time spent
in nir_validate.  It still isn't cheap but enabling validation now only
increases compile times by 2x instead of 6.6x.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

util/set: Add a helper to resize a set

Often times you don't know how big a set will be and you want the code
to just grow it as needed. However, sometimes you do know and you can
avoid a lot of rehashing if you just specify a size up-front.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

util/set: Add a search_and_add function

This function is identical to _mesa_set_add except that it takes an
extra out parameter that lets the caller detect if a replacement
happened.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

nir/validate: Use a ralloc context for our temporary data

All of our hash tables and sets are already using ralloc. There's
really no good reason why we don't just make a ralloc context rather
than try to remember to clean everything up manually.

Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

lima: add Allwinner H5 support

The H5 hardware variant requires a specific plb_max_blk number. This
value can't be probed at the hardware level.

Signed-off-by: Patrick Lerda <patrick9876@free.fr>
Reviewed-by: Qiang Yu <yuq825@gmail.com>

lima: refactor plb_max_blk

Move plb_max_blk to lima_screen, and add a new debug option:
LIMA_PLB_MAX_BLK

Signed-off-by: Patrick Lerda <patrick9876@free.fr>
Reviewed-by: Qiang Yu <yuq825@gmail.com>

radv: Do not use extra descriptor space for the 3rd plane.

While ImageFormatProperties returns the number of internal descriptors,
it turns out that applications do not need to actually allocate more
descriptors in the descriptor pool.

So if we make descriptors with more planes larger we have to be
convervative and always allocate space for the larger descriptors
which is a waste given the low usage of this ext.

So let us make use of the fact that 3plane formats all have the
same formats & dimensions for the last two planes. This way we
only need the first half of the descriptor of the 3rd plane and
can share the second half of the second plane.

This allows us to use 16 bytes for the descriptor which nicely
fits into the 16 bytes that are unused right next to the sampler.

Fixes: 5564c38212a "radv: Update descriptor sets for multiple planes."
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>

radv: Add support for icd loader interface v4.

Adds support for physical device functions unknown to the loader.

Acked-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>

panfrost/midgard: Handle csel correctly

We use an algebraic pass for the csel optimizations, and use proper
vectorized csel ops (i/fcsel_v) for mixed, rather lowering.

To avoid regressions along the way, we fix an issue with the copy
propagation pass (it should not attempt to propagate constants).
Similarly, we take care to break bundles when using csel to fix some
scheduler corner cases.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>

iris: Implement ARB_indirect_parameters

iris_draw_vbo is divided into two functions to remove unnecessary
operations from the loop. This implementation of ARB_indirect_parameters
takes into account NV_conditional_render by saving MI_PREDICATE_RESULT
at the start of a draw call and restoring it at the end also the result
of NV_conditional_render is taken into account when computing predicates
that limit draw calls for ARB_indirect_parameters in a similar way
to 1952fd8d in ANV.

v2: Optimize indirect draws (suggested by Kenneth Graunke)
v3: (by Kenneth Graunke)
- Fix an issue where indirect draws wouldn't set patch information
before updating the compiled TCS.
- Move some code back to iris_draw_vbo to avoid duplicating it.
- Fix minor indentation issues.

Signed-off-by: Illia Iorin <illia.iorin@globallogic.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

iris: Split iris_update_draw_info into two functions.

Shader draw parameters need updating on each iteration of a multidraw
loop, but the primitive based information only needs to be updated once.

Also, patch information needs to be recorded before filling out the TCS
program key, as it determines the number of HS instances.

nir: Fix wrong sign in lower_rcp

The nested fma calls were supposed to implement

x_new = x + x * (1 - x*src),

but instead current code is equivalent to

x_new = x - x * (1 - x*src).

The result is that Newton-Raphson steps don't improve precision at all.
This patch fixes this problem.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110435
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

intel: drop misleading driver name from gen_get_device_info()

radv: clear vertex bindings while resetting command buffer

Only vertex inputs accessed by vertex shader must have valid buffers
bound.

Signed-off-by: Józef Kucia <joseph.kucia@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Fixes: 5010436e09f "radv: bail out when binding the same vertex buffers"

st/mesa: fix 2 crashes in st_tgsi_lower_yuv

src/mesa/state_tracker/st_tgsi_lower_yuv.c:68: void reg_dst(struct
tgsi_full_dst_register *, const struct tgsi_full_dst_register *, unsigned
int): assertion "dst->Register.WriteMask" failed

The second crash was due to insufficient allocated size for TGSI
instructions.

Cc: 19.0 19.1 <mesa-stable@lists.freedesktop.org>
Reviewed-by: Rob Clark <robdclark@gmail.com>

iris: Use full ways for L3 cache setup on Icelake.

Anuj fixed this in i965 and anv, but the fix never landed in iris.
Fixes tessellation corruption on Icelake. Thanks to Rafael for
bisecting this and tracking it down.

Fixes: d0996d5fab6 iris: Emit default L3 config for the render pipeline
Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>

anv: Fix limits when VK_EXT_descriptor_indexing is used

Update various limits in
VkPhysicalDeviceDescriptorIndexingPropertiesEXT that were previously
zero to their values from VkPhysicalDeviceLimits.  When using
VK_EXT_descriptor_indexing, the former limits will apply to all the
descriptor layout sets -- not only those using the new feature bits.

For the reference, VK_EXT_descriptor_indexing says

    "There are new descriptor set layout and descriptor pool creation
    flags that are required to opt in to the update-after-bind
    functionality, and there are separate maxPerStage* and
    maxDescriptorSet* limits that apply to these descriptor set
    layouts which may be much higher than the pre-existing limits. The
    old limits only count descriptors in non-updateAfterBind
    descriptor set layouts, and the new limits count descriptors in
    all descriptor set layouts in the pipeline layout."

Fixes: 6e230d7607f "anv: Implement VK_EXT_descriptor_indexing"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>