openpower/atomics.mdwn

   1 # Draft proposal for improved atomic operations for the Power ISA
   2
   3 <https://bugs.libre-soc.org/show_bug.cgi?id=236>
   4
   5 ## Motivation
   6
   7 Power ISA currently has some issues with its atomic operations support,
   8 which are exacerbated by 3D Data structure processing needing
   9 of the order of 10^5 or greater SMP atomic locks per second.
  10
  11 ### Power ISA's current atomic operations are inefficient
  12
  13 Implementations have a hard time recognizing existing atomic operations
  14 via macro-op fusion because they would often have to detect and fuse a
  15 large number of instructions, including branches.
  16
  17 There is also the issue that PowerISA's memory fences are unnecessarily
  18 strong, particularly `isync` which is used for a lot of `acquire` and
  19 stronger fences. `isync` forces the cpu to do a full pipeline flush,
  20 which is unnecessary when all that is needed is a memory barrier.
  21
  22 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
  23
  24 ```
  25 # address in r4, addend in r5
  26     sync
  27 loop:
  28     ldarx 3, 0, 4
  29     add 6, 5, 3
  30     stdcx. 6, 0, 4
  31     bne 0, loop
  32     lwsync
  33 # output in r3
  34 ```
  35
  36 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
  37 unnecessarily-strong memory fence:
  38
  39 ```
  40 # address in r3
  41     sync
  42     ld 3, 0(3)
  43     cmpw 0, 3, 3
  44     bne- 0, skip
  45     isync
  46 skip:
  47 # output in r3
  48 ```
  49
  50 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
  51 a loop with 2 branches, and an unnecessarily-strong memory fence:
  52
  53 ```
  54 # address in r4, compared-to value in r5, replacement value in r6
  55     sync
  56 loop:
  57     ldarx 3, 0, 4
  58     cmpd 0, 3, 5
  59     bne 0, not_eq
  60     stdcx. 6, 0, 4
  61     bne 0, loop
  62 not_eq:
  63     isync
  64 # output loaded value in r3, store-occurred flag in cr0.eq
  65 ```
  66
  67 `atomic_load_acquire` is 4 instructions, including a branch and an
  68 unnecessarily-strong memory fence:
  69
  70 ```
  71 # address in r3
  72     ld 3, 0(3)
  73     cmpw 0, 3, 3
  74     bne- skip
  75     isync
  76 skip:
  77 # output in r3
  78 ```
  79
  80 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
  81
  82 ### Power ISA doesn't align well with C++11 atomics
  83
  84 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
  85
  86 > Existing implementation schemes on Power and ARM are not correct with
  87 > respect to the current memory model definition. These implementation
  88 > schemes can lead to results that are disallowed by the current memory
  89 > model when the user combines acquire/release ordering with seq_cst
  90 > ordering. On some architectures, especially Power and Nvidia GPUs, it
  91 > is expensive to repair the implementations to satisfy the existing
  92 > memory model. Details are discussed in (Lahav et al)
  93 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
  94 > on heavily).
  95
  96 ### Power ISA's Atomic-Memory-Operations have issues
  97
  98 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
  99
 100 They are still missing better fences, combined operation/fence
 101 instructions, and operations on 8/16-bit values, as well as issues with
 102 unnecessary restrictions:
 103
 104 it has only 32-bit and 64-bit atomic operations.
 105
 106 the operations it has that I was going to propose:
 107 fetch_add
 108 fetch_xor
 109 fetch_or
 110 fetch_and
 111 fetch_umax
 112 fetch_smax
 113 fetch_umin
 114 fetch_smin
 115 exchange
 116
 117 as well as a few I wasn't going to propose (they seem less useful to me):
 118 compare-and-swap-not-equal
 119 fetch-and-increment-bounded
 120 fetch-and-increment-equal
 121 fetch-and-decrement-bounded
 122 store-twin
 123
 124 The spec also basically says that the atomic memory operations are only
 125 intended for when you want to do atomic operations on memory, but don't
 126 want that memory to be loaded into your L1 cache.
 127
 128 imho that restriction is specifically *not* wanted, because there are
 129 plenty of cases where atomic operations should happen in your L1 cache.
 130
 131 I'd guess that part of why those atomic operations weren't included in
 132 gcc or clang as the default implementation of atomic operations (when
 133 the appropriate ISA feature is enabled) is because of that restriction.
 134
 135 imho the cpu should be able to (but not required to) predict whether to
 136 send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
 137 it directly in the L1 cache. The prediction could be based on how often
 138 that cache block was accessed from different cpus, e.g. by having a
 139 small saturating counter and a last-accessing-cpu field, where it would
 140 count how many times the same cpu accessed it in a row, sending it to the
 141 L1 cache if that's more than some limit, otherwise doing the operation
 142 in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
 143 tried to access it.
 144
 145 # TODO: add list of proposed instructions