openpower/atomics.mdwn

   1 # Draft proposal for improved atomic operations for the Power ISA
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
   6 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
   7 * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
   8 * [[atomics/discussion]]
   9
  10 # Motivation
  11
  12 Power ISA currently has some issues with its atomic operations support,
  13 which are exacerbated by 3D Data structure processing in 3D
  14 Shader Binaries needing
  15 of the order of 10^5 or greater atomic locks per second per SMP Core.
  16
  17 ## Power ISA's current atomic operations are inefficient
  18
  19 Implementations have a hard time recognizing existing atomic operations
  20 via macro-op fusion because they would often have to detect and fuse a
  21 large number of instructions, including branches. This is contrary
  22 to the RISC paradigm.
  23
  24 There is also the issue that PowerISA's memory fences are unnecessarily
  25 strong, particularly `isync` which is used for a lot of `acquire` and
  26 stronger fences. `isync` forces the cpu to do a full pipeline flush,
  27 which is unnecessary when all that is needed is a memory barrier.
  28
  29 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
  30
  31 ```
  32 # address in r4, addend in r5
  33     sync
  34 loop:
  35     ldarx 3, 0, 4
  36     add 6, 5, 3
  37     stdcx. 6, 0, 4
  38     bne 0, loop
  39     lwsync
  40 # output in r3
  41 ```
  42
  43 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
  44 unnecessarily-strong memory fence:
  45
  46 ```
  47 # address in r3
  48     sync
  49     ld 3, 0(3)
  50     cmpw 0, 3, 3
  51     bne- 0, skip
  52     isync
  53 skip:
  54 # output in r3
  55 ```
  56
  57 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
  58 a loop with 2 branches, and an unnecessarily-strong memory fence:
  59
  60 ```
  61 # address in r4, compared-to value in r5, replacement value in r6
  62     sync
  63 loop:
  64     ldarx 3, 0, 4
  65     cmpd 0, 3, 5
  66     bne 0, not_eq
  67     stdcx. 6, 0, 4
  68     bne 0, loop
  69 not_eq:
  70     isync
  71 # output loaded value in r3, store-occurred flag in cr0.eq
  72 ```
  73
  74 `atomic_load_acquire` is 4 instructions, including a branch and an
  75 unnecessarily-strong memory fence:
  76
  77 ```
  78 # address in r3
  79     ld 3, 0(3)
  80     cmpw 0, 3, 3
  81     bne- skip
  82     isync
  83 skip:
  84 # output in r3
  85 ```
  86
  87 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
  88 TODO
  89
  90 ## Power ISA doesn't align well with C++11 atomics
  91
  92 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
  93
  94 > Existing implementation schemes on Power and ARM are not correct with
  95 > respect to the current memory model definition. These implementation
  96 > schemes can lead to results that are disallowed by the current memory
  97 > model when the user combines acquire/release ordering with seq_cst
  98 > ordering. On some architectures, especially Power and Nvidia GPUs, it
  99 > is expensive to repair the implementations to satisfy the existing
 100 > memory model. Details are discussed in (Lahav et al)
 101 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
 102 > on heavily).
 103
 104 ## Power ISA's Atomic-Memory-Operations have issues
 105
 106 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
 107
 108 They are still missing better fences, combined operation/fence
 109 instructions, and operations on 8/16-bit values, as well as issues with
 110 unnecessary restrictions:
 111
 112 it has only 32-bit and 64-bit atomic operations.
 113
 114 see [[discussion]] for proposed operations and thoughts TODO
 115 remove this sentence
 116
 117
 118 # DRAFT atomic instructions
 119
 120 These two instructions, `lat` and `stat`, are identical
 121 to `lwat/ldat` and `stwat/stdat` except add acquire and
 122 release guaranteed ordering semantics as well as 8 and
 123 16 bit memory widths as well.
 124
 125 AT-Form (TODO)
 126
 127 * lat RT,RA,FC,ew
 128 * lataq RT,RA,FC,ew
 129 * latrl RT,RA,FC,ew
 130 * lataqrl RT,RA,FC,ew
 131 * stat RT,RA,FC,ew
 132 * stataq RT,RA,FC,ew
 133 * statrl RT,RA,FC,ew
 134 * stataqrl RT,RA,FC,ew
 135
 136 **DRAFT** EXT031 and XO, these are near to the existing
 137 atomic memory operations
 138
 139 |0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name       | Form       |
 140 |-- | -- | --- | --- |--|--|---- |------|--|-----------|------------|
 141 |31 | RT | RA  | FC  |lr|sc|ew   |000101|/ |lat[aq][rl]| TODO-Form  |
 142 |31 | RS | RA  | FC  |lr|sc|ew   |100101|/ |stat[aq][rl]| TODO-Form |
 143
 144 * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
 145 * If the `aq` bit is set,
 146   then no later atomic memory operations can be observed
 147   to take place before the AMO in this or other cores.
 148   (A Write-after-Read Memory Hazard is created)
 149 * If the `rl` bit is set, then other cores will not observe the AMO before
 150   memory accesses preceding the AMO.
 151   (A Read-after-Write Memory Hazard is created)
 152 * Setting both the `aq` and the `rl` bit makes the sequence
 153   sequentially consistent, meaning that
 154   it cannot be reordered with respect to earlier or later atomic
 155   memory operations. (Both a RaW and WaR are simultaneously created)
 156 * `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
 157   and `stwat`
 158
 159 read functions v3.1 book II section 4.5.1 p1071
 160
 161     | 00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
 162     | 00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
 163     | 00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
 164     | 00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
 165     | 00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
 166     | 00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
 167     | 00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
 168     | 00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
 169     | 01000 | RT, RT+1 | mem(EA,s) | Swap |
 170     | 10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
 171     | 11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
 172     | 11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
 173     | 11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |
 174
 175 store functions
 176
 177     | 00000 RS mem(EA,s) Store Add
 178     | 00001 RS mem(EA,s) Store XOR
 179     | 00010 RS mem(EA,s) Store OR
 180     | 00011 RS mem(EA,s) Store AND
 181     | 00100 RS mem(EA,s) Store Maximum Unsigned
 182     | 00101 RS mem(EA,s) Store Maximum Signed
 183     | 00110 RS mem(EA,s) Store Minimum Unsigned
 184     | 00111 RS mem(EA,s) Store Minimum Signed
 185     | 11000 RS mem(EA,s) Store Twin
 186
 187 These functions are recognised as being part of the
 188 OpenCAPI Specification.