From 5346c1167064d6429c6338974c6342f8346fd34b Mon Sep 17 00:00:00 2001 From: Francisco Jerez Date: Thu, 13 Aug 2015 15:02:05 +0300 Subject: [PATCH] i965: Don't tell the hardware about our UAV access. The hardware documentation relating to the UAV HW-assisted coherency mechanism and UAV access enable bits is scarce and sometimes contradictory, and there's quite some guesswork behind this commit, so let me summarize the background first: HSW and later hardware have infrastructure to support a stricter form of data coherency between shader invocations from separate primitives. The mechanism is controlled by the "Accesses UAV" bits on 3DSTATE_VS, _HS, _DS, _GS and _PS (or _PS_EXTRA on BDW+), and the "UAV Coherency Required" bit on the 3DPRIMITIVE command. Regardless of whether "UAV Coherency Required" is set, the hardware fixed-function units will increment a per-stage semaphore for each request received if "Accesses UAV" is set for the same or any lower stage. An implicit DC flush is emitted by the lowermost stage with "Accesses UAV" set once it's done processing the request, this also happens regardless of the value of "UAV Coherency Required". The completion of the DC flush will cause the same stage and all previous ones to decrement the semaphore, marking the UAV accesses for the primitive as coherent with L3. The "UAV Coherency Required" 3DPRIMITIVE bit will cause a pipeline stall before any threads are dispatched for the first FF stage with "Accesses UAV" set until the semaphore is cleared for the same stage. Effectively this guarantees that UAV memory accesses performed by previous primitives from any stage will be strictly ordered (and thanks to the implicit DC flush visible in memory) with UAV accesses from the following primitives. None of this is required by the usual image, atomic counter and SSBO GL APIs which have very relaxed cross-primitive coherency and ordering requirements, so we don't actually ever set the "UAV Coherency Required" bit -- Ordering with respect to shader invocations from previous stages on the same primitive where there is a data dependency is of course already guaranteed as the spec requires, regardless of this mechanism being enabled. We do set the "Accesses UAV" bits though since my commit ac7664e493655e290783c23a0412b9c70936da50 (which this patch partially reverts), mainly because of comments like the following from the BDW PRM: > 3DSTATE_GS >[...] > 12 Accesses UAV > Format: Enable > This field must be set when GS has a UAV access. There are similar comments in the documentation for the other 3DSTATE_*S commands. The "must" part is misleading and unjustified AFAIK. Most of the "Accesses UAV" bits don't seem to have any side effects other than the implicit DC flushes and the related book-keeping in anticipation for a subsequent primitive with "UAV Coherency Required" set, so in most cases they are unnecessary and may incur a performance penalty. There is an exception though. On Gen8+ the PS_EXTRA UAV access bit influences the calculation of the PS UAV-only and ThreadDispatchEnable signals which on previous generations were set explicitly by the driver, so we cannot always avoid enabling it on the PS stage. The primary motivation for this change is that in fact the hardware coherency mechanism is buggy and will cause a rather non-deterministic hang on Gen8 when VS is the only stage with "Accesses UAV" set and the processing of a request terminates immediately after the implicit DC flush is sent for a previous primitive with no additional vertices being emitted for the second primitive, what will cause the hardware to skip sending a second DC flush and cause the VS to stall indefinitely waiting for a response from the DC (BDWGFX HSD 1912017). This hardware bug can be reproduced on current master with the spec@arb_shader_image_load_store@host-mem-barrier@Indirect/RaW piglit subtest (if you have the patience to run it a few dozen times). The proposed workaround is to insert CS STALLs speculatively between 3DPRIMITIVE commands when "Accesses UAV" is enabled for the VS stage only. Because this would affect one of the hottest paths in the driver and likely decrease performance even further due to the unnecessary serialization, and because we don't actually need the implicit DC flushes, it seems better to just disable them. Cc: 11.0 --- src/mesa/drivers/dri/i965/gen7_gs_state.c | 4 +-- src/mesa/drivers/dri/i965/gen7_vs_state.c | 4 +-- src/mesa/drivers/dri/i965/gen7_wm_state.c | 12 ++++++--- src/mesa/drivers/dri/i965/gen8_gs_state.c | 4 +-- src/mesa/drivers/dri/i965/gen8_ps_state.c | 32 ++++++++++++++++++++--- src/mesa/drivers/dri/i965/gen8_vs_state.c | 4 +-- 6 files changed, 41 insertions(+), 19 deletions(-) diff --git a/src/mesa/drivers/dri/i965/gen7_gs_state.c b/src/mesa/drivers/dri/i965/gen7_gs_state.c index 497ecec8e45..8d6d3fe1d34 100644 --- a/src/mesa/drivers/dri/i965/gen7_gs_state.c +++ b/src/mesa/drivers/dri/i965/gen7_gs_state.c @@ -59,9 +59,7 @@ upload_gs_state(struct brw_context *brw) OUT_BATCH(((ALIGN(stage_state->sampler_count, 4)/4) << GEN6_GS_SAMPLER_COUNT_SHIFT) | ((brw->gs.prog_data->base.base.binding_table.size_bytes / 4) << - GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | - (brw->is_haswell && prog_data->base.nr_image_params ? - HSW_GS_UAV_ACCESS_ENABLE : 0)); + GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); if (brw->gs.prog_data->base.base.total_scratch) { OUT_RELOC(stage_state->scratch_bo, diff --git a/src/mesa/drivers/dri/i965/gen7_vs_state.c b/src/mesa/drivers/dri/i965/gen7_vs_state.c index b7e48585482..a18dc697651 100644 --- a/src/mesa/drivers/dri/i965/gen7_vs_state.c +++ b/src/mesa/drivers/dri/i965/gen7_vs_state.c @@ -126,9 +126,7 @@ upload_vs_state(struct brw_context *brw) ((ALIGN(stage_state->sampler_count, 4)/4) << GEN6_VS_SAMPLER_COUNT_SHIFT) | ((brw->vs.prog_data->base.base.binding_table.size_bytes / 4) << - GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | - (brw->is_haswell && prog_data->base.nr_image_params ? - HSW_VS_UAV_ACCESS_ENABLE : 0)); + GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); if (prog_data->base.total_scratch) { OUT_RELOC(stage_state->scratch_bo, diff --git a/src/mesa/drivers/dri/i965/gen7_wm_state.c b/src/mesa/drivers/dri/i965/gen7_wm_state.c index fd6dab5be8b..06d5e65786b 100644 --- a/src/mesa/drivers/dri/i965/gen7_wm_state.c +++ b/src/mesa/drivers/dri/i965/gen7_wm_state.c @@ -113,7 +113,14 @@ upload_wm_state(struct brw_context *brw) else if (prog_data->base.nr_image_params) dw1 |= GEN7_WM_EARLY_DS_CONTROL_PSEXEC; - /* _NEW_BUFFERS | _NEW_COLOR */ + /* The "UAV access enable" bits are unnecessary on HSW because they only + * seem to have an effect on the HW-assisted coherency mechanism which we + * don't need, and the rasterization-related UAV_ONLY flag and the + * DISPATCH_ENABLE bit can be set independently from it. + * C.f. gen8_upload_ps_extra(). + * + * BRW_NEW_FRAGMENT_PROGRAM | BRW_NEW_FS_PROG_DATA | _NEW_BUFFERS | _NEW_COLOR + */ if (brw->is_haswell && !(brw_color_buffer_write_enabled(brw) || writes_depth) && prog_data->base.nr_image_params) @@ -221,9 +228,6 @@ gen7_upload_ps_state(struct brw_context *brw, _mesa_get_min_invocations_per_fragment(ctx, fp, false); assert(min_inv_per_frag >= 1); - if (brw->is_haswell && prog_data->base.nr_image_params) - dw4 |= HSW_PS_UAV_ACCESS_ENABLE; - if (prog_data->prog_offset_16 || prog_data->no_8) { dw4 |= GEN7_PS_16_DISPATCH_ENABLE; if (!prog_data->no_8 && min_inv_per_frag == 1) { diff --git a/src/mesa/drivers/dri/i965/gen8_gs_state.c b/src/mesa/drivers/dri/i965/gen8_gs_state.c index 4195f4cf4a7..d766ca7bebf 100644 --- a/src/mesa/drivers/dri/i965/gen8_gs_state.c +++ b/src/mesa/drivers/dri/i965/gen8_gs_state.c @@ -52,9 +52,7 @@ gen8_upload_gs_state(struct brw_context *brw) ((ALIGN(stage_state->sampler_count, 4)/4) << GEN6_GS_SAMPLER_COUNT_SHIFT) | ((prog_data->base.binding_table.size_bytes / 4) << - GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | - (prog_data->base.nr_image_params ? - HSW_GS_UAV_ACCESS_ENABLE : 0)); + GEN6_GS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); if (brw->gs.prog_data->base.base.total_scratch) { OUT_RELOC64(stage_state->scratch_bo, diff --git a/src/mesa/drivers/dri/i965/gen8_ps_state.c b/src/mesa/drivers/dri/i965/gen8_ps_state.c index a686fed704f..8f0507413a7 100644 --- a/src/mesa/drivers/dri/i965/gen8_ps_state.c +++ b/src/mesa/drivers/dri/i965/gen8_ps_state.c @@ -25,6 +25,7 @@ #include "program/program.h" #include "brw_state.h" #include "brw_defines.h" +#include "brw_wm.h" #include "intel_batchbuffer.h" void @@ -65,8 +66,33 @@ gen8_upload_ps_extra(struct brw_context *brw, if (brw->gen >= 9 && prog_data->pulls_bary) dw1 |= GEN9_PSX_SHADER_PULLS_BARY; - if (_mesa_active_fragment_shader_has_atomic_ops(&brw->ctx) || - prog_data->base.nr_image_params) + /* The stricter cross-primitive coherency guarantees that the hardware + * gives us with the "Accesses UAV" bit set for at least one shader stage + * and the "UAV coherency required" bit set on the 3DPRIMITIVE command are + * redundant within the current image, atomic counter and SSBO GL APIs, + * which all have very loose ordering and coherency requirements and + * generally rely on the application to insert explicit barriers when a + * shader invocation is expected to see the memory writes performed by the + * invocations of some previous primitive. Regardless of the value of "UAV + * coherency required", the "Accesses UAV" bits will implicitly cause an in + * most cases useless DC flush when the lowermost stage with the bit set + * finishes execution. + * + * It would be nice to disable it, but in some cases we can't because on + * Gen8+ it also has an influence on rasterization via the PS UAV-only + * signal (which could be set independently from the coherency mechanism in + * the 3DSTATE_WM command on Gen7), and because in some cases it will + * determine whether the hardware skips execution of the fragment shader or + * not via the ThreadDispatchEnable signal. However if we know that + * GEN8_PS_BLEND_HAS_WRITEABLE_RT is going to be set and + * GEN8_PSX_PIXEL_SHADER_NO_RT_WRITE is not set it shouldn't make any + * difference so we may just disable it here. + * + * BRW_NEW_FS_PROG_DATA | BRW_NEW_FRAGMENT_PROGRAM | _NEW_BUFFERS | _NEW_COLOR + */ + if ((_mesa_active_fragment_shader_has_atomic_ops(&brw->ctx) || + prog_data->base.nr_image_params) && + !brw_color_buffer_write_enabled(brw)) dw1 |= GEN8_PSX_SHADER_HAS_UAV; BEGIN_BATCH(2); @@ -91,7 +117,7 @@ upload_ps_extra(struct brw_context *brw) const struct brw_tracked_state gen8_ps_extra = { .dirty = { - .mesa = 0, + .mesa = _NEW_BUFFERS | _NEW_COLOR, .brw = BRW_NEW_CONTEXT | BRW_NEW_FRAGMENT_PROGRAM | BRW_NEW_FS_PROG_DATA | diff --git a/src/mesa/drivers/dri/i965/gen8_vs_state.c b/src/mesa/drivers/dri/i965/gen8_vs_state.c index 8b5048bee7e..28f5adddf14 100644 --- a/src/mesa/drivers/dri/i965/gen8_vs_state.c +++ b/src/mesa/drivers/dri/i965/gen8_vs_state.c @@ -53,9 +53,7 @@ upload_vs_state(struct brw_context *brw) ((ALIGN(stage_state->sampler_count, 4) / 4) << GEN6_VS_SAMPLER_COUNT_SHIFT) | ((prog_data->base.binding_table.size_bytes / 4) << - GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT) | - (prog_data->base.nr_image_params ? - HSW_VS_UAV_ACCESS_ENABLE : 0)); + GEN6_VS_BINDING_TABLE_ENTRY_COUNT_SHIFT)); if (prog_data->base.total_scratch) { OUT_RELOC64(stage_state->scratch_bo, -- 2.30.2