From ea74a3f548eb321429c371e327e778e63d9128a0 Mon Sep 17 00:00:00 2001 From: Richard Sandiford Date: Wed, 20 Jan 2021 13:16:30 +0000 Subject: [PATCH] vect: Fix VLA SLP invariant optimisation [PR98535] MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit duplicate_and_interleave is the main fallback way of loading a repeating sequence of elements into variable-length vectors. The code handles cases in which the number of elements in the sequence is potentially several times greater than the number of elements in a vector. Let: - NE be the (compile-time) number of elements in the sequence - NR be the (compile-time) number of vector results and - VE be the (run-time) number of elements in each vector The basic approach is to duplicate each element into a separate vector, giving NE vectors in total, then use log2(NE) rows of NE permutes to generate NE results. In the worst case — when VE has no known compile-time factor and NR >= NE — all of these permutes are necessary. However, if VE is known to be a multiple of 2**F, then each of the first F permute rows produces duplicate results; specifically, the high permute for a given pair is the same as the low permute. The code dealt with this by reusing the low result for the high result. This part was OK. However, having duplicate results from one row meant that the next row did duplicate work. The redundancies would be optimised away by later passes, but the code tried to avoid generating them in the first place. This is the part that went wrong. Specifically, NR is typically less than NE when some permutes are redundant, so the code tried to use NR to reduce the amount of work performed. The problem was that, although it correctly calculated a conservative bound on how many results were needed in each row, it chose the wrong results for anything other than the final row. This doesn't usually matter for fully-packed SVE vectors. We first try to coalesce smaller elements into larger ones, so normally VE ends up being 2**VQ (where VQ is the number of 128-bit blocks in an SVE vector). In that situation we'd only apply the faulty optimisation to the final row, i.e. the case it handled correctly. E.g. for things like: void f (long *x) { for (int i = 0; i < 100; i += 8) { x[i] += 1; x[i + 1] += 2; x[i + 2] += 3; x[i + 3] += 4; x[i + 4] += 5; x[i + 5] += 6; x[i + 6] += 7; x[i + 7] += 8; } } (already tested by the testsuite), we'd have 3 rows of permutes producing 4 vector results. The schemne produced: 1st row: 8 results from 4 permutes, highs duplicates of lows 2nd row: 8 results from 8 permutes (half of which are actually redundant) 3rd row: 4 results from 4 permutes However, coalescing elements is trickier for unpacked vectors, and at the moment we don't try to do it (see the GET_MODE_SIZE check in can_duplicate_and_interleave_p). Unpacked vectors therefore stress the code in ways that packed vectors didn't. The patch fixes this by removing the redundancies from each row, rather than trying to work around them later. This also removes the redundant work in the second row of the example above. gcc/ PR tree-optimization/98535 * tree-vect-slp.c (duplicate_and_interleave): Use quick_grow_cleared. If the high and low permutes are the same, remove the high permutes from the working set and only continue with the low ones. --- .../gcc.target/aarch64/sve/pr98535.c | 18 +++++++ gcc/tree-vect-slp.c | 49 +++++++++++-------- 2 files changed, 46 insertions(+), 21 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/pr98535.c diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c new file mode 100644 index 00000000000..6873a38734d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c @@ -0,0 +1,18 @@ +/* { dg-options "-O3 -mtune=neoverse-v1" } */ + +typedef short a; + +typedef struct { + a b, c, d, e; +} f; + +f *g; + +long h; + +void +i() { + f j; + for (; h; h++) + *g++ = j; +} diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 1787ad74268..4465cf7494e 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -5063,7 +5063,7 @@ duplicate_and_interleave (vec_info *vinfo, gimple_seq *seq, tree vector_type, tree_vector_builder partial_elts; auto_vec pieces (nvectors * 2); - pieces.quick_grow (nvectors * 2); + pieces.quick_grow_cleared (nvectors * 2); for (unsigned int i = 0; i < nvectors; ++i) { /* (2) Replace ELTS[0:NELTS] with ELTS'[0:NELTS'], where each element of @@ -5082,53 +5082,60 @@ duplicate_and_interleave (vec_info *vinfo, gimple_seq *seq, tree vector_type, /* (4) Use a tree of VEC_PERM_EXPRs to create a single VM with the correct byte contents. - We need to repeat the following operation log2(nvectors) times: + Conceptually, we need to repeat the following operation log2(nvectors) + times, where hi_start = nvectors / 2: out[i * 2] = VEC_PERM_EXPR (in[i], in[i + hi_start], lo_permute); out[i * 2 + 1] = VEC_PERM_EXPR (in[i], in[i + hi_start], hi_permute); However, if each input repeats every N elements and the VF is - a multiple of N * 2, the HI result is the same as the LO. */ + a multiple of N * 2, the HI result is the same as the LO result. + This will be true for the first N1 iterations of the outer loop, + followed by N2 iterations for which both the LO and HI results + are needed. I.e.: + + N1 + N2 = log2(nvectors) + + Each "N1 iteration" doubles the number of redundant vectors and the + effect of the process as a whole is to have a sequence of nvectors/2**N1 + vectors that repeats 2**N1 times. Rather than generate these redundant + vectors, we halve the number of vectors for each N1 iteration. */ unsigned int in_start = 0; unsigned int out_start = nvectors; - unsigned int hi_start = nvectors / 2; - /* A bound on the number of outputs needed to produce NRESULTS results - in the final iteration. */ - unsigned int noutputs_bound = nvectors * nresults; + unsigned int new_nvectors = nvectors; for (unsigned int in_repeat = 1; in_repeat < nvectors; in_repeat *= 2) { - noutputs_bound /= 2; - unsigned int limit = MIN (noutputs_bound, nvectors); - for (unsigned int i = 0; i < limit; ++i) + unsigned int hi_start = new_nvectors / 2; + unsigned int out_i = 0; + for (unsigned int in_i = 0; in_i < new_nvectors; ++in_i) { - if ((i & 1) != 0 + if ((in_i & 1) != 0 && multiple_p (TYPE_VECTOR_SUBPARTS (new_vector_type), 2 * in_repeat)) - { - pieces[out_start + i] = pieces[out_start + i - 1]; - continue; - } + continue; tree output = make_ssa_name (new_vector_type); - tree input1 = pieces[in_start + (i / 2)]; - tree input2 = pieces[in_start + (i / 2) + hi_start]; + tree input1 = pieces[in_start + (in_i / 2)]; + tree input2 = pieces[in_start + (in_i / 2) + hi_start]; gassign *stmt = gimple_build_assign (output, VEC_PERM_EXPR, input1, input2, - permutes[i & 1]); + permutes[in_i & 1]); gimple_seq_add_stmt (seq, stmt); - pieces[out_start + i] = output; + pieces[out_start + out_i] = output; + out_i += 1; } std::swap (in_start, out_start); + new_nvectors = out_i; } /* (5) Use VIEW_CONVERT_EXPR to cast the final VM to the required type. */ results.reserve (nresults); for (unsigned int i = 0; i < nresults; ++i) - if (i < nvectors) + if (i < new_nvectors) results.quick_push (gimple_build (seq, VIEW_CONVERT_EXPR, vector_type, pieces[in_start + i])); else - results.quick_push (results[i - nvectors]); + results.quick_push (results[i - new_nvectors]); } -- 2.30.2