i386: Add pass_remove_partial_avx_dependency
authorH.J. Lu <hongjiu.lu@intel.com>
Fri, 22 Feb 2019 15:54:08 +0000 (15:54 +0000)
committerH.J. Lu <hjl@gcc.gnu.org>
Fri, 22 Feb 2019 15:54:08 +0000 (07:54 -0800)
commitf14322806253059eb49011c2cfdbae4c85ad47e4
treeec05edc6d020eb2a3f7877c68288cd415f08d978
parent965779b4ad0f0abbdc8ab0addd2fae14165a08f4
i386: Add pass_remove_partial_avx_dependency

With -mavx, for

$ cat foo.i
extern float f;
extern double d;
extern int i;

void
foo (void)
{
  d = f;
  f = i;
}

we need to generate

vxorp[ds] %xmmN, %xmmN, %xmmN
...
vcvtss2sd f(%rip), %xmmN, %xmmX
...
vcvtsi2ss i(%rip), %xmmN, %xmmY

to avoid partial XMM register stall.  This patch adds a pass to generate
a single

vxorps %xmmN, %xmmN, %xmmN

at entry of the nearest dominator for basic blocks with SF/DF conversions,
which is in the fake loop that contains the whole function, instead of
generating one

vxorp[ds] %xmmN, %xmmN, %xmmN

for each SF/DF conversion.

NB: The LCM algorithm isn't appropriate here since it may place a vxorps
inside the loop.  Simple testcase show this:

$ cat badcase.c

extern float f;
extern double d;

void
foo (int n, int k)
{
  for (int j = 0; j != n; j++)
    if (j < k)
      d = f;
}

It generates

    ...
    loop:
      if(j < k)
        vxorps    %xmm0, %xmm0, %xmm0
        vcvtss2sd f(%rip), %xmm0, %xmm0
      ...
    loopend
    ...

This is because LCM only works when there is a certain benifit.  But for
conditional branch, LCM wouldn't move

   vxorps  %xmm0, %xmm0, %xmm0

out of loop.  SPEC CPU 2017 on Intel Xeon with AVX512 shows:

1. The nearest dominator

|RATE |Improvement|
|500.perlbench_r | 0.55% |
|538.imagick_r | 8.43% |
|544.nab_r | 0.71% |

2. LCM

|RATE |Improvement|
|500.perlbench_r | -0.76% |
|538.imagick_r | 7.96%  |
|544.nab_r | -0.13% |

Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512
using

-Ofast -flto -march=skylake-avx512 -funroll-loops

before

commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576
Author: uros <uros@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Jan 31 20:06:42 2019 +0000

            PR target/89071
            * config/i386/i386.md (*extendsfdf2): Split out reg->reg
            alternative to avoid partial SSE register stall for TARGET_AVX.
            (truncdfsf2): Ditto.
            (sse4_1_round<mode>2): Ditto.

    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268427 138bc75d-0d04-0410-961f-82ee72b054a4

are:

|INT RATE |Improvement|
|500.perlbench_r | 0.55% |
|502.gcc_r | 0.14% |
|505.mcf_r | 0.08% |
|523.xalancbmk_r | 0.18% |
|525.x264_r |-0.49% |
|531.deepsjeng_r |-0.04% |
|541.leela_r |-0.26% |
|548.exchange2_r |-0.3% |
|557.xz_r |BuildSame|

|FP RATE |Improvement|
|503.bwaves_r         |-0.29% |
|507.cactuBSSN_r | 0.04% |
|508.namd_r |-0.74% |
|510.parest_r |-0.01% |
|511.povray_r | 2.23% |
|519.lbm_r | 0.1% |
|521.wrf_r | 0.49% |
|526.blender_r | 0.13% |
|527.cam4_r | 0.65% |
|538.imagick_r | 8.43% |
|544.nab_r | 0.71% |
|549.fotonik3d_r | 0.15% |
|554.roms_r | 0.08% |

After commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576, on Skylake client,
impacts on 538.imagick_r with

-fno-unsafe-math-optimizations -march=native -Ofast -funroll-loops -flto

1. Size comparision:

before:

   text    data     bss     dec     hex filename
2436377    8352    4528 2449257  255f69 imagick_r

after:

   text    data     bss     dec     hex filename
2425249    8352    4528 2438129  2533f1 imagick_r

2. Number of vxorps:

before after difference
4948            4135            -19.66%

3. Performance improvement:

|RATE |Improvement|
|538.imagick_r | 5.5%  |

gcc/

2019-02-22  H.J. Lu  <hongjiu.lu@intel.com>
    Hongtao Liu  <hongtao.liu@intel.com>
    Sunil K Pandey  <sunil.k.pandey@intel.com>

PR target/87007
* config/i386/i386-passes.def: Add
pass_remove_partial_avx_dependency.
* config/i386/i386-protos.h
(make_pass_remove_partial_avx_dependency): New.
* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
New function.
(pass_data_remove_partial_avx_dependency): New.
(pass_remove_partial_avx_dependency): Likewise.
(make_pass_remove_partial_avx_dependency): Likewise.
* config/i386/i386.md (avx_partial_xmm_update): New attribute.
(*extendsfdf2): Add avx_partial_xmm_update.
(truncdfsf2): Likewise.
(*float<SWI48:mode><MODEF:mode>2): Likewise.
(SF/DF conversion splitters): Disabled for TARGET_AVX.

gcc/testsuite/

2019-02-22  H.J. Lu  <hongjiu.lu@intel.com>
    Hongtao Liu  <hongtao.liu@intel.com>
    Sunil K Pandey  <sunil.k.pandey@intel.com>

PR target/87007
* gcc.target/i386/pr87007-1.c: New test.
* gcc.target/i386/pr87007-2.c: Likewise.

Co-Authored-By: Hongtao Liu <hongtao.liu@intel.com>
Co-Authored-By: Sunil K Pandey <sunil.k.pandey@intel.com>
From-SVN: r269119
gcc/ChangeLog
gcc/config/i386/i386-passes.def
gcc/config/i386/i386-protos.h
gcc/config/i386/i386.c
gcc/config/i386/i386.md
gcc/testsuite/ChangeLog
gcc/testsuite/gcc.target/i386/pr87007-1.c [new file with mode: 0644]
gcc/testsuite/gcc.target/i386/pr87007-2.c [new file with mode: 0644]