openpower/atomics.mdwn

   1 # Draft proposal for improved atomic operations for the Power ISA
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
   6 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
   7
   8 ## Motivation
   9
  10 Power ISA currently has some issues with its atomic operations support,
  11 which are exacerbated by 3D Data structure processing in 3D
  12 Shader Binaries needing
  13 of the order of 10^5 or greater SMP atomic locks per second.
  14
  15 ### Power ISA's current atomic operations are inefficient
  16
  17 Implementations have a hard time recognizing existing atomic operations
  18 via macro-op fusion because they would often have to detect and fuse a
  19 large number of instructions, including branches. This is contrary
  20 to the RISC paradigm.
  21
  22 There is also the issue that PowerISA's memory fences are unnecessarily
  23 strong, particularly `isync` which is used for a lot of `acquire` and
  24 stronger fences. `isync` forces the cpu to do a full pipeline flush,
  25 which is unnecessary when all that is needed is a memory barrier.
  26
  27 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
  28
  29 ```
  30 # address in r4, addend in r5
  31     sync
  32 loop:
  33     ldarx 3, 0, 4
  34     add 6, 5, 3
  35     stdcx. 6, 0, 4
  36     bne 0, loop
  37     lwsync
  38 # output in r3
  39 ```
  40
  41 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
  42 unnecessarily-strong memory fence:
  43
  44 ```
  45 # address in r3
  46     sync
  47     ld 3, 0(3)
  48     cmpw 0, 3, 3
  49     bne- 0, skip
  50     isync
  51 skip:
  52 # output in r3
  53 ```
  54
  55 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
  56 a loop with 2 branches, and an unnecessarily-strong memory fence:
  57
  58 ```
  59 # address in r4, compared-to value in r5, replacement value in r6
  60     sync
  61 loop:
  62     ldarx 3, 0, 4
  63     cmpd 0, 3, 5
  64     bne 0, not_eq
  65     stdcx. 6, 0, 4
  66     bne 0, loop
  67 not_eq:
  68     isync
  69 # output loaded value in r3, store-occurred flag in cr0.eq
  70 ```
  71
  72 `atomic_load_acquire` is 4 instructions, including a branch and an
  73 unnecessarily-strong memory fence:
  74
  75 ```
  76 # address in r3
  77     ld 3, 0(3)
  78     cmpw 0, 3, 3
  79     bne- skip
  80     isync
  81 skip:
  82 # output in r3
  83 ```
  84
  85 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
  86 TODO
  87
  88 ### Power ISA doesn't align well with C++11 atomics
  89
  90 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
  91
  92 > Existing implementation schemes on Power and ARM are not correct with
  93 > respect to the current memory model definition. These implementation
  94 > schemes can lead to results that are disallowed by the current memory
  95 > model when the user combines acquire/release ordering with seq_cst
  96 > ordering. On some architectures, especially Power and Nvidia GPUs, it
  97 > is expensive to repair the implementations to satisfy the existing
  98 > memory model. Details are discussed in (Lahav et al)
  99 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
 100 > on heavily).
 101
 102 ### Power ISA's Atomic-Memory-Operations have issues
 103
 104 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
 105
 106 They are still missing better fences, combined operation/fence
 107 instructions, and operations on 8/16-bit values, as well as issues with
 108 unnecessary restrictions:
 109
 110 it has only 32-bit and 64-bit atomic operations.
 111
 112 read operations v3.1 book II section 4.5.1 p1071
 113
 114     | 00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
 115     | 00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
 116     | 00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
 117     | 00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
 118     | 00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
 119     | 00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
 120     | 00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
 121     | 00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
 122     | 01000 | RT, RT+1 | mem(EA,s) | Swap |
 123     | 10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
 124     | 11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
 125     | 11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
 126     | 11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |
 127
 128 store operations
 129
 130     | 00000 RS mem(EA,s) Store Add
 131     | 00001 RS mem(EA,s) Store XOR
 132     | 00010 RS mem(EA,s) Store OR
 133     | 00011 RS mem(EA,s) Store AND t
 134     | 00100 RS mem(EA,s) Store Maximum Unsigned
 135     | 00101 RS mem(EA,s) Store Maximum Signed t
 136     | 00110 RS mem(EA,s) Store Minimum Unsigned
 137     | 00111 RS mem(EA,s) Store Minimum Signed
 138     | 11000 RS mem(EA,s) Store Twin
 139
 140 These operations are recognised as being part of the
 141 OpenCAPI Specification.
 142 the operations it has that I was going to propose:
 143
 144 * fetch_add
 145 * fetch_xor
 146 * fetch_or
 147 * fetch_and
 148 * fetch_umax
 149 * fetch_smax
 150 * fetch_umin
 151 * fetch_smin
 152 * exchange
 153
 154 as well as a few I wasn't going to propose (they seem less useful to me):
 155
 156 * compare-and-swap-not-equal
 157 * fetch-and-increment-bounded
 158 * fetch-and-increment-equal
 159 * fetch-and-decrement-bounded
 160 * store-twin
 161
 162 The spec also basically says that the atomic memory operations are only
 163 intended for when you want to do atomic operations on memory, but don't
 164 want that memory to be loaded into your L1 cache.
 165
 166 imho that restriction is specifically *not* wanted, because there are
 167 plenty of cases where atomic operations should happen in your L1 cache.
 168
 169 I'd guess that part of why those atomic operations weren't included in
 170 gcc or clang as the default implementation of atomic operations (when
 171 the appropriate ISA feature is enabled) is because of that restriction.
 172
 173 imho the cpu should be able to (but not required to) predict whether to
 174 send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
 175 it directly in the L1 cache. The prediction could be based on how often
 176 that cache block was accessed from different cpus, e.g. by having a
 177 small saturating counter and a last-accessing-cpu field, where it would
 178 count how many times the same cpu accessed it in a row, sending it to the
 179 L1 cache if that's more than some limit, otherwise doing the operation
 180 in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
 181 tried to access it.
 182
 183 # TODO: add list of proposed instructions