From 54d7907c276f2e5ef428ead58721fd82e4b26f40 Mon Sep 17 00:00:00 2001 From: Alyssa Rosenzweig Date: Mon, 11 May 2020 10:02:49 -0400 Subject: [PATCH] nir: Propagate *2*16 conversions into vectors MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit If we have code like: ('f2f16', ('vec2', ('f2f32', 'a@16'), '#b@32')) We would like to eliminate the conversions, but the existing rules can't see into the the (heterogenous) vector. So instead of trying to eliminate in one pass, we add opts to propagate the f2f16 into the vector. Even if nothing further happens, this is often a win since then the created vector is smaller (half2 instead of float2). Hence the above gets transformed to ('vec2', ('f2f16', ('f2f32', 'a@16')), ('f2f16', '#b@32')) Then the existing f2f16(f2f32) rule will kick in for the first component and constant folding will for the second and we'll be left with ('vec2', 'a@16', '#b@16') ...eliminating all conversions. v2: Predicate on !options->vectorize_vec2_16bit. As discussed, this optimization helps greatly on true vector architectures (like Midgard) but wreaks havoc on more modern SIMD-within-a-register architectures (like Bifrost and modern AMD). So let's predicate on that. v3: Extend for integers as well and add a comment explaining the transforms. Results on Midgard (unfortunately a true SIMD architecture): total instructions in shared programs: 51359 -> 50963 (-0.77%) instructions in affected programs: 4523 -> 4127 (-8.76%) helped: 53 HURT: 0 helped stats (abs) min: 1 max: 86 x̄: 7.47 x̃: 6 helped stats (rel) min: 1.71% max: 28.00% x̄: 9.66% x̃: 7.34% 95% mean confidence interval for instructions value: -10.58 -4.36 95% mean confidence interval for instructions %-change: -11.45% -7.88% Instructions are helped. total bundles in shared programs: 25825 -> 25670 (-0.60%) bundles in affected programs: 2057 -> 1902 (-7.54%) helped: 53 HURT: 0 helped stats (abs) min: 1 max: 26 x̄: 2.92 x̃: 2 helped stats (rel) min: 2.86% max: 30.00% x̄: 8.64% x̃: 8.33% 95% mean confidence interval for bundles value: -3.93 -1.92 95% mean confidence interval for bundles %-change: -10.69% -6.59% Bundles are helped. total quadwords in shared programs: 41359 -> 41055 (-0.74%) quadwords in affected programs: 3801 -> 3497 (-8.00%) helped: 57 HURT: 0 helped stats (abs) min: 1 max: 57 x̄: 5.33 x̃: 4 helped stats (rel) min: 1.92% max: 21.05% x̄: 8.22% x̃: 6.67% 95% mean confidence interval for quadwords value: -7.35 -3.32 95% mean confidence interval for quadwords %-change: -9.54% -6.90% Quadwords are helped. total registers in shared programs: 3849 -> 3807 (-1.09%) registers in affected programs: 167 -> 125 (-25.15%) helped: 32 HURT: 1 helped stats (abs) min: 1 max: 3 x̄: 1.34 x̃: 1 helped stats (rel) min: 20.00% max: 50.00% x̄: 26.35% x̃: 20.00% HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 HURT stats (rel) min: 16.67% max: 16.67% x̄: 16.67% x̃: 16.67% 95% mean confidence interval for registers value: -1.54 -1.00 95% mean confidence interval for registers %-change: -29.41% -20.69% Registers are helped. total threads in shared programs: 2471 -> 2520 (1.98%) threads in affected programs: 49 -> 98 (100.00%) helped: 25 HURT: 0 helped stats (abs) min: 1 max: 2 x̄: 1.96 x̃: 2 helped stats (rel) min: 100.00% max: 100.00% x̄: 100.00% x̃: 100.00% 95% mean confidence interval for threads value: 1.88 2.04 95% mean confidence interval for threads %-change: 100.00% 100.00% Threads are [helped]. total spills in shared programs: 168 -> 168 (0.00%) spills in affected programs: 0 -> 0 helped: 0 HURT: 0 total fills in shared programs: 186 -> 186 (0.00%) fills in affected programs: 0 -> 0 helped: 0 HURT: 0 Signed-off-by: Alyssa Rosenzweig Reviewed-by: Marek Olšák Part-of: --- src/compiler/nir/nir_opt_algebraic.py | 36 +++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/src/compiler/nir/nir_opt_algebraic.py b/src/compiler/nir/nir_opt_algebraic.py index f4b621cbd65..575024c04bf 100644 --- a/src/compiler/nir/nir_opt_algebraic.py +++ b/src/compiler/nir/nir_opt_algebraic.py @@ -1777,6 +1777,42 @@ for op in ['frcp', 'frsq', 'fsqrt', 'fexp2', 'flog2', 'fsign', 'fsin', 'fcos']: (('bcsel', a, (op, b), (op + '(is_used_once)', c)), (op, ('bcsel', a, b, c))), ] +# This section contains optimizations to propagate downsizing conversions of +# constructed vectors into vectors of downsized components. Whether this is +# useful depends on the SIMD semantics of the backend. On a true SIMD machine, +# this reduces the register pressure of the vector itself and often enables the +# conversions to be eliminated via other algebraic rules or constant folding. +# In the worst case on a SIMD architecture, the propagated conversions may be +# revectorized via nir_opt_vectorize so instruction count is minimally +# impacted. +# +# On a machine with SIMD-within-a-register only, this actually +# counterintuitively hurts instruction count. These machines are the same that +# require vectorize_vec2_16bit, so we predicate the optimizations on that flag +# not being set. +# +# Finally for scalar architectures, there should be no difference in generated +# code since it all ends up scalarized at the end, but it might minimally help +# compile-times. + +for i in range(2, 4 + 1): + for T in ('f', 'u', 'i'): + vec_inst = ('vec' + str(i),) + + indices = ['a', 'b', 'c', 'd'] + suffix_in = tuple((indices[j] + '@32') for j in range(i)) + + to_16 = '{}2{}16'.format(T, T) + to_mp = '{}2{}mp'.format(T, T) + + out_16 = tuple((to_16, indices[j]) for j in range(i)) + out_mp = tuple((to_mp, indices[j]) for j in range(i)) + + optimizations += [ + ((to_16, vec_inst + suffix_in), vec_inst + out_16, '!options->vectorize_vec2_16bit'), + ((to_mp, vec_inst + suffix_in), vec_inst + out_mp, '!options->vectorize_vec2_16bit') + ] + # This section contains "late" optimizations that should be run before # creating ffmas and calling regular optimizations for the final time. # Optimizations should go here if they help code generation and conflict -- 2.30.2