openpower/atomics.mdwn

   1 # Draft proposal for improved atomic operations for the Power ISA
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
   6 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
   7 * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
   8
   9 # Motivation
  10
  11 Power ISA currently has some issues with its atomic operations support,
  12 which are exacerbated by 3D Data structure processing in 3D
  13 Shader Binaries needing
  14 of the order of 10^5 or greater atomic locks per second per SMP Core.
  15
  16 ## Power ISA's current atomic operations are inefficient
  17
  18 Implementations have a hard time recognizing existing atomic operations
  19 via macro-op fusion because they would often have to detect and fuse a
  20 large number of instructions, including branches. This is contrary
  21 to the RISC paradigm.
  22
  23 There is also the issue that PowerISA's memory fences are unnecessarily
  24 strong, particularly `isync` which is used for a lot of `acquire` and
  25 stronger fences. `isync` forces the cpu to do a full pipeline flush,
  26 which is unnecessary when all that is needed is a memory barrier.
  27
  28 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
  29
  30 ```
  31 # address in r4, addend in r5
  32     sync
  33 loop:
  34     ldarx 3, 0, 4
  35     add 6, 5, 3
  36     stdcx. 6, 0, 4
  37     bne 0, loop
  38     lwsync
  39 # output in r3
  40 ```
  41
  42 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
  43 unnecessarily-strong memory fence:
  44
  45 ```
  46 # address in r3
  47     sync
  48     ld 3, 0(3)
  49     cmpw 0, 3, 3
  50     bne- 0, skip
  51     isync
  52 skip:
  53 # output in r3
  54 ```
  55
  56 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
  57 a loop with 2 branches, and an unnecessarily-strong memory fence:
  58
  59 ```
  60 # address in r4, compared-to value in r5, replacement value in r6
  61     sync
  62 loop:
  63     ldarx 3, 0, 4
  64     cmpd 0, 3, 5
  65     bne 0, not_eq
  66     stdcx. 6, 0, 4
  67     bne 0, loop
  68 not_eq:
  69     isync
  70 # output loaded value in r3, store-occurred flag in cr0.eq
  71 ```
  72
  73 `atomic_load_acquire` is 4 instructions, including a branch and an
  74 unnecessarily-strong memory fence:
  75
  76 ```
  77 # address in r3
  78     ld 3, 0(3)
  79     cmpw 0, 3, 3
  80     bne- skip
  81     isync
  82 skip:
  83 # output in r3
  84 ```
  85
  86 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
  87 TODO
  88
  89 ## Power ISA doesn't align well with C++11 atomics
  90
  91 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
  92
  93 > Existing implementation schemes on Power and ARM are not correct with
  94 > respect to the current memory model definition. These implementation
  95 > schemes can lead to results that are disallowed by the current memory
  96 > model when the user combines acquire/release ordering with seq_cst
  97 > ordering. On some architectures, especially Power and Nvidia GPUs, it
  98 > is expensive to repair the implementations to satisfy the existing
  99 > memory model. Details are discussed in (Lahav et al)
 100 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
 101 > on heavily).
 102
 103 ## Power ISA's Atomic-Memory-Operations have issues
 104
 105 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
 106
 107 They are still missing better fences, combined operation/fence
 108 instructions, and operations on 8/16-bit values, as well as issues with
 109 unnecessary restrictions:
 110
 111 it has only 32-bit and 64-bit atomic operations.
 112
 113 read operations v3.1 book II section 4.5.1 p1071
 114
 115     | 00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
 116     | 00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
 117     | 00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
 118     | 00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
 119     | 00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
 120     | 00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
 121     | 00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
 122     | 00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
 123     | 01000 | RT, RT+1 | mem(EA,s) | Swap |
 124     | 10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
 125     | 11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
 126     | 11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
 127     | 11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |
 128
 129 store operations
 130
 131     | 00000 RS mem(EA,s) Store Add
 132     | 00001 RS mem(EA,s) Store XOR
 133     | 00010 RS mem(EA,s) Store OR
 134     | 00011 RS mem(EA,s) Store AND t
 135     | 00100 RS mem(EA,s) Store Maximum Unsigned
 136     | 00101 RS mem(EA,s) Store Maximum Signed t
 137     | 00110 RS mem(EA,s) Store Minimum Unsigned
 138     | 00111 RS mem(EA,s) Store Minimum Signed
 139     | 11000 RS mem(EA,s) Store Twin
 140
 141 These operations are recognised as being part of the
 142 OpenCAPI Specification.
 143 the operations it has that I was going to propose:
 144
 145 * fetch_add
 146 * fetch_xor
 147 * fetch_or
 148 * fetch_and
 149 * fetch_umax
 150 * fetch_smax
 151 * fetch_umin
 152 * fetch_smin
 153 * exchange
 154
 155 as well as a few I wasn't going to propose (they seem less useful to me):
 156
 157 * compare-and-swap-not-equal
 158 * fetch-and-increment-bounded
 159 * fetch-and-increment-equal
 160 * fetch-and-decrement-bounded
 161 * store-twin
 162
 163 The spec also basically says that the atomic memory operations are only
 164 intended for when you want to do atomic operations on memory, but don't
 165 want that memory to be loaded into your L1 cache.
 166
 167 imho that restriction is specifically *not* wanted, because there are
 168 plenty of cases where atomic operations should happen in your L1 cache.
 169
 170 I'd guess that part of why those atomic operations weren't included in
 171 gcc or clang as the default implementation of atomic operations (when
 172 the appropriate ISA feature is enabled) is because of that restriction.
 173
 174 imho the cpu should be able to (but not required to) predict whether to
 175 send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
 176 it directly in the L1 cache. The prediction could be based on how often
 177 that cache block was accessed from different cpus, e.g. by having a
 178 small saturating counter and a last-accessing-cpu field, where it would
 179 count how many times the same cpu accessed it in a row, sending it to the
 180 L1 cache if that's more than some limit, otherwise doing the operation
 181 in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
 182 tried to access it.
 183
 184 # TODO: add list of proposed instructions
 185
 186 AT-Form (TODO)
 187
 188 * lat RT,RA,FC,ew
 189 * lataq RT,RA,FC,ew
 190 * latrl RT,RA,FC,ew
 191 * lataqrl RT,RA,FC,ew
 192
 193 | 0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31| name       |  Form   |
 194 | -- | -- | --- | --- |--|--|---- |------|--| ----       | ------- |
 195 | NN | RT | RA  | FC  |lr|sc|ew   |xxxxxx|/ | lat[aq][rl]| TODO-Form  |
 196
 197 * If the `aq` bit is set,
 198   then no later atomic memory operations can be observed
 199   to take place before the AMO.
 200 * If the `rl` bit is set, then other cores will not observe the AMO before
 201   memory accesses preceding the AMO.
 202 * Setting both the `aq` and the `rl` bit makes the sequence
 203   sequentially consistent, meaning that
 204   it cannot be reordered with earlier or later atomic
 205   memory operations.