git.libre-soc.org Git - mesa.git/log

winsys/amdgpu: handle cs_add_fence_dependency for deferred/unsubmitted fences

The idea is to fix the following interleaving of operations
that can arise from deferred fences:

Thread 1 / Context 1          Thread 2 / Context 2
--------------------          --------------------
f = deferred flush
<------- application-side synchronization ------->
                               fence_server_sync(f)
                               ...
                               flush()
flush()

We will now stall in fence_server_sync until the flush of context 1
has completed.

This scenario was unlikely to occur previously, because applications
seem to be doing

Thread 1 / Context 1          Thread 2 / Context 2
--------------------          --------------------
f = glFenceSync()
glFlush()
<------- application-side synchronization ------->
                               glWaitSync(f)

... and indeed they probably *have* to use this ordering to avoid
deadlocks in the GLX model, where all GL operations conceptually
go through a single connection to the X server. However, it's less
clear whether applications have to do this with other WSI (i.e. EGL).
Besides, even this sequence of GL commands can be translated into
the Gallium-level sequence outlined above when Gallium threading
and asynchronous flushes are used. So it makes sense to be more
robust.

As a side effect, we no longer busy-wait on submission_in_progress.

We won't enable asynchronous flushes on radeon, but add a
cs_add_fence_dependency stub anyway to document the potential
issue.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

gallium: add PIPE_FLUSH_{TOP,BOTTOM}_OF_PIPE bits

These bits are intended to be used by the ddebug hang detection and are
named in analogy to the Vulkan stage bits (and the corresponding Radeon
pipeline event).

Hang detection needs fences on the granularity of individual commands,
which nothing else really covers. The closest alternative would have
been PIPE_QUERY_GPU_FINISHED, but (a) queries are a per-context object
and we really want a per-screen object, (b) queries don't offer a
wait with timeout, and (c) in any case, PIPE_QUERY_GPU_FINISHED is
meant to imply that GPU caches are flushed, which the new bits
explicitly aren't.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

gallium: add PIPE_FLUSH_ASYNC and PIPE_FLUSH_HINT_FINISH

Also document some subtleties of pipe_context::flush.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

util/u_queue: add util_queue_fence_wait_timeout

v2:
- style fixes
- fix missing timeout handling in futex path

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

threads: update for late C11 changes

C11 threads were changed to use struct timespec instead of xtime, and
thrd_sleep got a second argument.

See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1554.htm and
http://en.cppreference.com/w/c/thread/{thrd_sleep,cnd_timedwait,mtx_timedlock}

Note that cnd_timedwait is spec'd to be relative to TIME_UTC / CLOCK_REALTIME.

v2: Fix Windows build errors. Tested with a default Appveyor config
    that uses Visual Studio 2013. Judging from Brian's email and
    random internet sources, Visual Studio 2015 does have timespec
    and timespec_get, hence the _MSC_VER-based guard which I have
    not tested.

Cc: Jose Fonseca <jfonseca@vmware.com>
Cc: Brian Paul <brianp@vmware.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)

gallium: remove unused and deprecated u_time.h

Cc: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>

util: move os_time.[ch] to src/util

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: always use async compiles when creating shader/compute states

With Gallium threaded contexts, creating shader/compute states is
effectively a screen operation, so we should not use context state.

In particular, this allows us to avoid using the context's LLVM
TargetMachine.

This isn't an issue yet because u_threaded_context filters out non-async
debug callbacks, and we disable threaded contexts for debug contexts.
However, we may want to change that in the future.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: fix potential use-after-free of debug callbacks

Found by inspection.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: move pipe debug callback to si_context

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

u_queue: add util_queue_finish for waiting for previously added jobs

Schedule one job for every thread, and wait on a barrier inside the job
execution function.

v2: avoid alloca (fixes Windows build error)

Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)

util: move pipe_barrier into src/util and rename to util_barrier

The #if guard is probably not 100% equivalent to the previous PIPE_OS
check, but if anything it should be an over-approximation (are there
pthread implementations without barriers?), so people will get either
a good implementation or compile errors that are easy to fix.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

gallium: add async debug message forwarding helper

v2: use util_vasprintf for Windows portability

Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)

st/mesa: guard sampler views changes with a mutex

Some locking is unfortunately required, because well-formed GL programs
can have multiple threads racing to access the same texture, e.g.: two
threads/contexts rendering from the same texture, or one thread destroying
a context while the other is rendering from or modifying a texture.

Since even the simple mutex caused noticable slowdowns in the piglit
drawoverhead micro-benchmark, this patch uses a slightly more involved
approach to keep locks out of the fast path:

- the initial lookup of sampler views happens without taking a lock
- a per-texture lock is only taken when we have to modify the sampler
  view(s)
- since each thread mostly operates only on the entry corresponding to
  its context, the main issue is re-allocation of the sampler view array
  when it needs to be grown, but the old copy is not freed

Old copies of the sampler views array are kept around in a linked list
until the entire texture object is deleted. The total memory wasted
in this way is roughly equal to the size of the current sampler views
array.

Fixes non-deterministic memory corruption in some
dEQP-EGL.functional.sharing.gles2.multithread.* tests, e.g.
dEQP-EGL.functional.sharing.gles2.multithread.simple.images.texture_source.create_texture_render

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

st/mesa: re-arrange st_finalize_texture

Move the early-out for surface-based textures earlier. This narrows the
scope of the locking added in a follow-up commit.

Fix one remaining case of initializing a surface-based texture
without properly finalizing it.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

gallium: clarify the constraints on sampler_view_destroy

r600 expects the context that created the sampler view to still be alive
(there is a per-context list of sampler views).

svga currently bails when the context of destruction is not the same as
creation.

The GL state tracker, which is the only one that runs into the
multi-context subtleties (due to share groups), already guarantees that
sampler views are destroyed before their context of creation is destroyed.

Most drivers are context-agnostic, so the warning message in
pipe_sampler_view_release doesn't really make sense.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: reduce the scope of sel->mutex in si_shader_select_with_key

We only need the lock to guard changes in the variant linked list. The
actual compilation can happen outside the lock, since we use the ready
fence as a guard.

v2: fix double-unlock

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

radeonsi: use ready fences on all shaders, not just optimized ones

There's a race condition between si_shader_select_with_key and
si_bind_XX_shader:

  Thread 1                         Thread 2
  --------                         --------
  si_shader_select_with_key
    begin compiling the first
    variant
    (guarded by sel->mutex)
                                   si_bind_XX_shader
                                     select first_variant by default
                                     as state->current
                                   si_shader_select_with_key
                                     match state->current and early-out

Since thread 2 never takes sel->mutex, it may go on rendering without a
PM4 for that shader, for example.

The solution taken by this patch is to broaden the scope of
shader->optimized_ready to a fence shader->ready that applies to
all shaders. This does not hurt the fast path (if anything it makes
it faster, because we don't explicitly check is_optimized).

It will also allow reducing the scope of sel->mutex locks, but this is
deferred to a later commit for better bisectability.

Fixes dEQP-EGL.functional.sharing.gles2.multithread.simple.buffers.bufferdata_render

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

u_queue: add a futex-based implementation of fences

Fences are now 4 bytes instead of 96 bytes (on my 64-bit system).

Signaling a fence is a single atomic operation in the fast case plus a
syscall in the slow case.

Testing if a fence is signaled is the same as before (a simple comparison),
but waiting on a fence is now no more expensive than just testing it in
the fast (already signaled) case.

v2:
- style fixes
- use p_atomic_xxx macros with the right barriers

Acked-by: Marek Olšák <marek.olsak@amd.com>

u_queue: add util_queue_fence_reset

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

u_queue: export util_queue_fence_signal

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

u_queue: group fence functions together

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

util/u_atomic: add p_atomic_xchg

The closest to it in the old-style gcc builtins is __sync_lock_test_and_set,
however, that is only guaranteed to work with values 0 and 1 and only
provides an acquire barrier. I also don't know about other OSes, so we
provide a simple & stupid emulation via p_atomic_cmpxchg.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>

util: move futex helpers into futex.h

v2: style fixes

Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)

glsl: Make #pragma STDGL invariant(all) only modify outputs.

According to the GLSL ES 3.20, GLSL 4.50, and GLSL 1.20 specs:

   "To force all output variables to be invariant, use the pragma

       #pragma STDGL invariant(all)

    before all declarations in a shader."

Notably, this is only supposed to affect output variables.  Furthermore,

   "Only variables output from a shader can be candidates for invariance."

It looks like this has been wrong since we first supported the pragma in
2011 (commit 86b4398cd158024f6be9fa830554a11c2a7ebe0c).

Fixes dEQP-GLES2.functional.shaders.preprocessor.pragmas.pragma_fragment.

v2: Now that all cases are identical (other than compute shaders, which
    have no output variables anyway), we can drop the switch statement
    entirely.  We also don't need the current_function == NULL check;
    this was a hold over from when we had a single var_mode_out for both
    function parameters and shader varyings, in the bad old days.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>

i965: expose SRGB visuals and turn on EGL_KHR_gl_colorspace

Patch exposes sRGB visuals and adds DRI integer query support for
__DRI2_RENDERER_HAS_FRAMEBUFFER_SRGB. Further changes make sure that
we mark if the app explicitly wanted sRGB and for these framebuffers
we don't turn sRGB off in intel_gles3_srgb_workaround. This way we
keep compatibility for existing applications relying on default sRGB
and ony add more visual support.

With this change, following dEQP tests start to pass:

   dEQP-EGL.functional.wide_color.window_8888_colorspace_srgb
   dEQP-EGL.functional.wide_color.pbuffer_8888_colorspace_srgb

v2: some code cleanup (Emil Velikov)
    update num_formats correctly (reported by deveee@gmail.com)

v3: cleanup, remove redundant is_srgb
    rename explicit_srgb as 'need_srgb' to follow style better

Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com> (v2)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102264
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102354
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102503

glsl: Transform fb buffers are only active if a variable uses them

The GL spec will soon be revised to clarify that a buffer binding for
a transform feedback buffer is only required if a variable is actually
defined to use the buffer binding point. Previously a declaration for
the default transform buffer would make it require a binding even if
nothing was declared to use the default buffer.

Affects:
KHR-GL44/45.enhanced_layouts.xfb_stride_of_empty_list
KHR-GL44/45.enhanced_layouts.xfb_stride_of_empty_list_and_api

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Cc: mesa-stable@lists.freedesktop.org

intel/nir: Use the correct indirect lowering masks in link_shaders

Previously, if we were linking a vec4 VS with a SIMD8/16 FS, we wouldn't
lower indirects on the fragment shader which is wrong. Instead of using
a single indirect mask, take advantage of our new little helper.

Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
Cc: mesa-stable@lists.freedesktop.org

r600g: use SIMPLE_FLOAT for blending to enable some optimizations

Radeonsi also sets this flag. Seems to avoid pulling up the desintation
RT value when the dst blend factor is zero if it's not otherwise being
loaded. Among other things, it allows blending to overwrite infinity/NaN
values in the destination RT.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

nv50: make blending work so that zero wins in a multiplication

This matches nvc0 behavior, tested with the fbo-float-nan piglit.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Tobias Klausmann<tobias.johannes.klausmann@mni.thm.de>

glsl: Minor cleanups after previous commit

I think it's more clear to only call emit_access once. The only
difference between the two calls is the value of size_mul used for the
offset parameter... but you really have to look at it to be sure.

The s/is_64bit/is_double/ change is because there are no int64_t or
uint64_t matrix types.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

glsl: Use more link_calculate_matrix_stride in lower_buffer_access

I was going to squash this with the previous commit, but there's a lot
of churn in that commit.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

glsl: Use link_calculate_matrix_stride in lower_buffer_access and friends

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

glsl: Refactor matrix stride calculation into a utility function

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

glsl/linker: Optimize swizzles again after linking

Without this, the SPIR-V generator has to deal with a bunch of junk
like:

(swiz z (swiz xxx (swiz x (var_ref packed:binormal.z,light_dir))))

It seems better to cull that stuff out than to add code to deal with
it. The problem is the way swizzles to and from scalars have to be
handled in SPIR-V.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>

glsl: Combine nop-swizzle optimization with swizzle-swizzle optimization

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: <thomashelland90@gmail.com>

glsl: Make the swizzle-swizzle optimization greedy

If there is a long sequence of swizzled swizzles, compact all of them
down to a single swizzle.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: <thomashelland90@gmail.com>

glsl: Remove program_resource_visitor::visit_field(const glsl_struct_field *)

I could not find any remaining users.

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

glsl: Silence unused parameter warning

glsl/lower_shared_reference.cpp: In member function ‘virtual void
{anonymous}::lower_shared_reference_visitor::insert_buffer_access(void*,
ir_dereference*, const glsl_type*, ir_rvalue*, unsigned int, int)’:

glsl/lower_shared_reference.cpp:244:58: warning: unused parameter
‘channel’ [-Wunused-parameter]
int channel)
^~~~~~~

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

ac/nir: add support for all intrinsics. (v2)

This is derived from tgsi/radeonsi code from the GLSL intrinsics.

This should pre-fix radv for the upcoming spirv patches.

v2: actually use wait_cnt, sleep deprived dad time! (Bas)

Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: Dave Airlie <airlied@redhat.com>

amdgpu: use simple mtx

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

mesa: use simple mtx in core mesa

Results from x11perf -copywinwin10 on Eric's SKL:
4.33338% ± 0.905054% (n=40)

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Tested-by: Yogesh Marathe <yogesh.marathe@intel.com>

mesa: Add new fast mtx_t mutex type for basic use cases

While modern pthread mutexes are very fast, they still incur a call to an
external DSO and overhead of the generality and features of pthread mutexes.
Most mutexes in mesa only needs lock/unlock, and the idea here is that we can
inline the atomic operation and make the fast case just two intructions.
Mutexes are subtle and finicky to implement, so we carefully copy the
implementation from Ulrich Dreppers well-written and well-reviewed paper:

  "Futexes Are Tricky"
  http://www.akkadia.org/drepper/futex.pdf

We implement "mutex3", which gives us a mutex that has no syscalls on
uncontended lock or unlock.  Further, the uncontended case boils down to a
cmpxchg and an untaken branch and the uncontended unlock is just a locked decr
and an untaken branch.  We use __builtin_expect() to indicate that contention
is unlikely so that gcc will put the contention code out of the main code
flow.

A fast mutex only supports lock/unlock, can't be recursive or used with
condition variables.  We keep the pthread mutex implementation around as
for the few places where we use condition variables or recursive locking.
For platforms or compilers where futex and atomics aren't available,
simple_mtx_t falls back to the pthread mutex.

The pthread mutex lock/unlock overhead shows up on benchmarks for CPU bound
applications.  Most CPU bound cases are helped and some of our internal
bind_buffer_object heavy benchmarks gain up to 10%.

Signed-off-by: Kristian Høgsberg <krh@bitplanet.net>
Signed-off-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

mesa: rework how we free gl_shader_program_data

When I introduced gl_shader_program_data one of the intentions was to
fix a bug where a failed linking attempt freed data required by a
currently active program. However I seem to have failed to finish
hooking up the final steps required to have the data hang around.

Here we create a fresh instance of gl_shader_program_data every
time we link. gl_program has a reference to gl_shader_program_data
so it will be freed once the program is no longer active.

Cc: "17.2 17.3" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Neil Roberts <nroberts@igalia.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102177

glsl: use the correct parent when allocating program data members

Cc: "17.2 17.3" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

glsl: drop cache_fallback

This turned out to be a dead end, it is much easier and less error
prone to just cache the IR used by the drivers backend e.g. TGSI or
NIR.

Cc: "17.2 17.3" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

i965: properly initialize brw->cs.base.stage to MESA_SHADER_COMPUTE

This has a bit of a surprising effect:

For the render pipeline, the upload_sampler_state_table atom emits
3DSTATE_BINDING_TABLE_POINTERS_XS.  It tries to avoid this for compute:

   if (GEN_GEN >= 7 && stage_state->stage != MESA_SHADER_COMPUTE) {
      /* Emit a 3DSTATE_SAMPLER_STATE_POINTERS_XS packet. */
      genX(emit_sampler_state_pointers_xs)(brw, stage_state);
   } ...

However, we were failing to initialize brw->cs.base.stage, so it was
left as 0 (MESA_SHADER_VERTEX), causing this condition to break.  We
then emitted 3DSTATE_SAMPLER_STATE_POINTERS_VS in GPGPU mode, when
trying to upload CS samplers.  Nothing good can come of this.

Found by inspection while debugging a GPU hang.  Jordan believes this
helps the Deus Ex: Mankind Divided benchmark mode's stability when
running with shader cache.

Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>

intel/nir: Break the linking code into a helper in brw_nir.c

Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
Cc: mesa-stable@lists.freedesktop.org

intel/nir: Add a helper for getting the NoIndirect mask

Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
Cc: mesa-stable@lists.freedesktop.org

nir: Don't print swizzles when there are more than 4 components

... as can happen with various types like mat4, or else we'll smash the
stack writing past the end of components_local[].

Fixes: 5a0d3e1129b7 ("nir: Print the components referenced for split or
packed shader in/outs.")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

meson: Add threads dependencies to glsl_compiler executable

Fixes compiling the optional standalone glsl compiler.

Reported-by: DrNick (on irc)
Signed-off-by: Dylan Baker <dylanx.c.baker@intel.com>
Reviewed-and-Tested-by: Eric Engestrom <eric.engestrom@imgtec.com>

glsl: Fix typo fragement -> fragment

Fixes: 94d669b0d2f ("glsl: enforce fragment shader input restrictions in
GLSL ES 3.10")

Signed-off-by: Andreas Boll <andreas.boll.dev@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>

broadcom/vc5: Remove unused v3d_compiler.c

Unused since original import of VC5.

Fixes: ade416d0236 ("broadcom: Add VC5 NIR compiler.")
Signed-off-by: Andreas Boll <andreas.boll.dev@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>

broadcom/vc5: Add vc5_drm.h to the release tarball

Fixes: 45bb8f29571 ("broadcom: Add V3D 3.3 gallium driver called "vc5",
for BCM7268.")

Cc: 17.3 <mesa-stable@lists.freedesktop.org>
Signed-off-by: Andreas Boll <andreas.boll.dev@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>

clover: use the unified check for c++11 instead of the gcc version number

So far clover based its test for compiler support on the version of gcc,
while in reality support for c++11 is required. This patch replaces the
version check by the check unified for all modules that require c++11.

Reviewed-by: Emil Velikov <emil.velikov@collabora.com>

swr: Replace the check for c++11 by the unified version

Reviewed-by: Emil Velikov <emil.velikov@collabora.com>

configure: check for -std=c++11 support and enable st/mesa test accordingly

Add a check that tests whether the c++ compiler supports c++11, either
by default, by adding the compiler flag -std=c++11, or by adding a
compiler flag that the user has specified via the environment variable
CXX11_CXXFLAGS.

The test only does a very shallow check of c++11 support, i.e. it tests
whether the define __cplusplus >= 201103L to confirm language support
by the compiler, and it checks whether the header <tuple> is available
to test the availability of the c++11 standard library.

A make file conditional HAVE_STD_CXX11 is provided that is used in this
patch to enable the test in st/mesa if C++11 support is available.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102665
Acked-by: Emil Velikov <emil.velikov@collabora.com>

configure.ac: append to existing initializer override flags

Currently we were overwriting the existing warning flags, instead of
adding new [as applicable].

Fixes c5d2e2d43f6 ("configure: Test for -Wno-initializer-overrides")
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>

configure.ac: append to existing MSVC compat flags

Currently we were overwriting the existing warning flags, instead of
adding new [as applicable].

v2: Add missing space before -Werror (Eric)

Fixes e4b2b69e828 ("configure: Add and use AX_CHECK_COMPILE_FLAG")
Cc: Matt Turner <mattst88@gmail.com>
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Matt Turner <mattst88@gmail.com> (v1)
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>

meson: Allow building glvnd with EGL and non-dri based GLX

Because meson mirrors the auototools logic, it needs the same changes to
allow building glvnd based egl.

v2: - change if to elif (Eric)

Signed-off-by: Dylan Baker <dylanx.c.baker@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Acked-by: Emil Velikov <emil.velikov@collabora.com>

configure.ac: require xcb* for the omx/va/... when using x11 platform

Targets such as omx and va can work w/o anything X related. Mandate the
xcb* dependencies only when the X11 platform is selected.

Reported-by: Lukas Rusak <lorusak@gmail.com>
Fixes: 63e11ac2b5c ("configure: error out if building VA w/o supported
platform")
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Tested-by: Lukas Rusak <lorusak@gmail.com> (v1)

configure.ac: loosen --enable-glvnd check to honour egl

Currently we error out when building GLVND w/o GLX.

That was the original premice before we had EGL. As the commit says,
that error should be reworked to honour both - do so.

v2: Drop noop *);; (Eric)

Reported-by: Lukas Rusak <lorusak@gmail.com>
Fixes: ce562f9e3fa ("EGL: Implement the libglvnd interface for EGL (v3)")
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Tested-by: Lukas Rusak <lorusak@gmail.com> (v1)

egl/android: add a note about .swap_buffers_with_damage

Android implements the API and does the native damage handling itself.
At the same time it
a) does call the vendor's eglSwapBuffersWithDamageKHR
b) does not implement eglSetDamageRegionKHR

There's something strange happening here. For now simply note about the
'lack' of eglSwapBuffersWithDamageKHR support.

Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>

wayland-drm: static inline wayland_drm_buffer_get

The function is effectively a direct function call into
libwayland-server.so.

Thus GBM no longer depends on the wayland-drm static library, making the
build more straight forward. And the resulting binary is a bit smaller.

Note: we need to move struct wayland_drm_callbacks further up,
otherwise we'll get an error since the type is incomplete.

v2: Rebase, beef-up commit message, update meson, move struct
wayland_drm_callbacks.

Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Daniel Stone <daniels@collabora.com> (v1)
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com> # meson bit only
Acked-by: Eric Engestrom <eric.engestrom@imgtec.com> # for the rest
Reviewed-by: Dylan Baker <dylan@pnwbakers.com> # meson

automake: intel: correctly append to the LIBADD variable

Commit 05fc62d89f5 sets the variable, yet it forgot the update the
existing reference to append (instead of assign).

Thus as-is the expat library was discarded from the link chain when
building with Android.

Fixes: 05fc62d89f5 ("automake: intel: move expat handling where it's
used")
Cc: Hongxu Jia <hongxu.jia@windriver.com>
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>

configure: enable the OpenCL ICD by default

Nearly all the distributions* that build Mesa OpenCL, enable the ICD.
Since building a non-ICD driver has the chance of conflicting with
existing OpenCL binary (libOpenCL.so).

Furthermore, some applications expect the library to provide
annotated/versioned symbols.

https://lists.freedesktop.org/archives/mesa-dev/2017-September/171093.html

*Fedora, Suse, Arch, Debian, Ubuntu, FreeBSD use the ICD
Gentoo manages the conflicting files via eselect.

Cc: Matt Turner <mattst88@gmail.com>
Cc: Jan Vesely <jan.vesely@rutgers.edu>
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-By: Aaron Watry <awatry@gmail.com>

targets/opencl: don't hardcode the icd file install to /etc/...

Use $(sysconfdir) instead of hardcoding /etc.

While the OpenCL spec expects the file in /etc, people building their
stack can override that, esp. !Linux users.

Furthermore this removes a fundamental violation, which results in the
system file being overwritten even as one explicitly sets --prefix
and/or DESTDIR.

Cc: mesa-stable@lists.freedesktop.org
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-By: Aaron Watry <awatry@gmail.com>

amd: add amdgpu_asic_addr.h to the sources list

Otherwise it will be missing from the release tarball

Fixes: 7f33e94e43a ("amd/addrlib: update to latest version")
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>

gallivm: Use new LLVM fast-math-flags API

LLVM 6 changed the API on the fast-math-flags:
https://reviews.llvm.org/rL317488

NOTE: This also enables the new flag 'ApproxFunc' to allow for
approximations for library functions (sin, cos, ...). I'm not completly
convinced, that this is something mesa should do.

Signed-off-by: Tobias Droste <tdroste@gmx.de>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-and-Tested-by: Michel Dänzer <michel.daenzer@amd.com>

glsl: add varying resources for arrays of complex types

This patch is mostly a patch done by Ilia Mirkin.

It fixes KHR-GL45.enhanced_layouts.varying_structure_locations.

v2: fix locations for TCS/TES/GS inputs and outputs (Ilia)

CC: Ilia Mirkin <imirkin@alum.mit.edu>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103098
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>

st/glsl_to_nir: use nir_shader_gather_info()

Use the NIR helper rather than the GLSL IR helper to get in/out
masks. This allows us to ignore varyings removed by NIR
optimisations.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

st/glsl_to_nir: generate NIR earlier

We want to use nir_shader_gather_info() the GLSL IR version might
be including varyings that NIR later eliminates. To do this we
need to generate NIR before we we start using the in/out bitmasks.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

st/glsl_to_nir: delay adding built-in uniforms to Parameters list

Delaying adding built-in uniforms until after we convert to NIR
gives us a better chance to optimise them away. Also NIR allows
us to iterate over the uniforms directly so should be faster.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

amd/addrlib: update to latest version

This uses C++11 initializer lists.

I just overwrote all Mesa files with internal addrlib and discarded
hunks that we should probably keep, but I might have missed something.

The code depending on ADDR_AM_BUILD is removed. We can add it back next
time if needed.

Acked-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

braodcom/vc5: Flush the job when it grows over 1GB.

Fixes GL_OUT_OF_MEMORY from streaming-texture-leak (and will hopefully
keep piglit from ooming on my no-swap platform, as well).

broadcom/vc5: Do 16-bit unpacking of integer texture returns properly.

We were doing f16 unpacks, which trashed "1" values. Fixes many piglit
texwrap GL_EXT_texture_integer cases.

broadcom/vc5: Fix pausing of transform feedback.

Gallium disables it by removing the streamout buffers, not by binding a
program that doesn't have TF outputs. Fixes piglit
"ext_transform_feedback2/counting with pause"

broadcom/vc5: Add support for GL_RASTERIZER_DISCARD

Fixes piglit discard-drawarrays.

broadcom/vc5: Fix scheduling for a non-SFU R4 write after a dead R4 write.

The v3d_qpu_writes_r*() were only checking for fixed-function accumulator
writes, not normal ALU writes to those regs.

Fixes fs-discard-exit-2 on simulation (but not HW).

broadcom/vc5: Add partial transform feedback query support.

We have to compute the queries in software, so we're counting the
primitives by hand. We still need to make sure to not increment the
PRIMITIVES_EMITTED if we overflowed, but leave that for later.

broadcom/vc5: Add occlusion query support.

Fixes all of piglit's OQ tests.

intel/fs/nir: Return Q types from brw_reg_type_for_bit_size

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>

intel/fs/nir: Use Q immediates for load_const on gen8+

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>

intel/fs/nir: Setup immediates based on type in i2b and f2b

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>

intel/reg: Add helpers for 64-bit integer immediates

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>

compiler/nir_types: Handle vectors in glsl_get_array_element

Most of NIR doesn't allow doing array indexing on a vector (though it
does on a matrix). However, nir_lower_io handles it just fine and this
behavior is needed for shared variables in Vulkan. This commit makes
glsl_get_array_element do something sensible for vector types and makes
nir_validate happy with them.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

nir: Validate base types on array dereferences

We were already validating that the parent type goes along with the
child type but we weren't actually validating that the parent type is
reasonable. This fixes that.

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

nir,intel/compiler: Use a fixed subgroup size

The GL_ARB_shader_ballot spec says that gl_SubGroupSizeARB is declared
as a uniform.  This means that it cannot change across an invocation
such as a draw call or a compute dispatch.  For compute shaders, we're
ok because we only ever use one dispatch size.  For fragment, however,
the hardware dynamically chooses between SIMD8 and SIMD16 which violates
the spec.  Instead, let's just pick a subgroup size based on the shader
stage.  The fixed size we choose for compute shaders is a bit higher
than strictly needed but there's no real harm in that.  The advantage is
that, if they do anything interesting with the value, NIR will see it as
an immediate and can optimize better.

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

nir/lower_subgroups: Lower ballot intrinsics to the specified bit size

Ballot intrinsics return a bitfield of subgroups.  In GLSL and some
SPIR-V extensions, they return a uint64_t.  In SPV_KHR_shader_ballot,
they return a uvec4.  Also, some back-ends would rather pass around
32-bit values because it's easier than messing with 64-bit all the time.
To solve this mess, we make nir_lower_subgroups take a new parameter
called ballot_bit_size and it lowers whichever thing it gets in from the
source language (uint64_t or uvec4) to a scalar with the specified
number of bits.  This replaces a chunk of the old lowering code.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

nir/builder: Add a nir_imm_intN_t helper

This lets you easily build integer immediates of arbitrary bit size.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

nir/lower_system_values: Lower SUBGROUP_*_MASK based on type

The SUBGROUP_*_MASK system values are uint64_t when coming in from GLSL
but uvec4 when coming in from SPIR-V. Lowering based on type allows us
to nicely handle both.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

nir: Make ballot intrinsics variable-size

This way they can return either a uvec4 or a uint64_t. At the moment,
this is a no-op since we still always return a uint64_t.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

nir: Add a ssa_dest_init_for_type helper

This would be useful a number of places

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

nir: Add a new subgroups lowering pass

This commit pulls nir_lower_read_invocations_to_scalar along with most
of the guts of nir_opt_intrinsics (which mostly does subgroup lowering)
into a new nir_lower_subgroups pass. There are various other bits of
subgroup lowering that we're going to want to do so it makes a bit more
sense to keep it all together in one pass. We also move it in i965 to
happen after nir_lower_system_values to ensure that because we want to
handle the subgroup mask system value intrinsics here.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

intel/fs: Don't use automatic exec size inference

The automatic exec size inference can accidentally mess things up if
we're not careful.  For instance, if we have

add(4)    g38.2<4>D    g38.1<8,2,4>D    g38.2<8,2,4>D

then the destination register will end up having a width of 2 with a
horizontal stride of 4 and a vertical stride of 8.  The EU emit code
sees the width of 2 and decides that we really wanted an exec size of 2
which doesn't do what we wanted.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

intel/fs: Explicitly set EXECUTE_1 where needed

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

intel/eu: Explicitly set EXECUTE_1 where needed

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

intel/eu: Make automatic exec sizes a configurable option

We have had a feature in codegen for some time that tries to
automatically infer the execution size of an instruction from the width
of its destination.  For things such as fixed function GS, clipper, and
SF programs, this is very useful because they tend to have lots of
hand-rolled register setup and trying to specify the exec size all the
time would be prohibitive.  For things that come from a higher-level IR,
however, it's easier to just set the right size all the time and the
automatic exec sizes can, in fact, cause problems.  This commit makes it
optional while enabling it by default.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

intel/fs: Rework zero-length URB write handling

Originally we tried to handle this case based on slots_valid.  However,
there are a number of ways that this can go wrong.  For one, we throw
away any trailing slots which either aren't written or are set to
VARYING_SLOT_PAD.  Second, even if PSIZ is a valid slot, we may not
actually write anything there.  Between the lot of these, it was
possible to end up in a case where we tried to do a regular URB write
but ended up with a length of 1 which is invalid.  This commit moves it
to the end and makes it based on a new boolean flag urb_written.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org

intel/compiler/fs: Set up subgroup invocation as a system value

Subgroup invocation is computed using a vector immediate and some
dispatch-aware arithmetic.  Unfortunately, due to the vector arithmetic,
and the fact that it's frequently read 16-wide, it's not something that
can easily be CSEd by the back-end compiler.  There are a few different
possible approaches to this problem:

1) Emit the code to calculate the subgroup invocation on-the-fly and
    trust NIR to do the CSE.  This is what we were doing.

2) Add a back-end instruction for the subgroup ID.  This has the
    advantage of helping the back-end compiler with CSE but has the
    downside of very poor scheduling for the calculation because it has
    to be emitted in the back-end.

3) Emit the calculation at the top of the program and re-use the
    result.  This gets rid of the CSE problem but comes at the cost of
    an extra live register.

This commit switches us from 1) to 3).  We choose to store the subgroup
invocation values as a W type to reduce the impact of the extra live
register.  Trusting NIR and using 1) was fine but we're soon going to
want to use the subgroup invocation value for other things in the
back-end compiler and this makes it much easier to do without having to
worry about CSE problems.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>