openpower/atomics.mdwn

   1 # Draft proposal for improved atomic operations for the Power ISA
   2
   3
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=236>
   8 * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
   9 * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
  10 * [[atomics/discussion]]
  11 * <http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html>
  12
  13 TODO:
  14
  15 * investigate Power ISA 3.1 p1077 eh hint
  16
  17
  18 # Motivation
  19
  20 Power ISA currently has some issues with its atomic operations support,
  21 which are exacerbated by 3D Data structure processing in 3D
  22 Shader Binaries needing
  23 of the order of 10^5 or greater atomic locks per second per SMP Core.
  24
  25 ## Power ISA's current atomic operations are inefficient
  26
  27 Implementations have a hard time recognizing existing atomic operations
  28 via macro-op fusion because they would often have to detect and fuse a
  29 large number of instructions, including branches. This is contrary
  30 to the RISC paradigm.
  31
  32 There is also the issue that PowerISA's memory fences are unnecessarily
  33 strong, particularly `isync` which is used for a lot of `acquire` and
  34 stronger fences. `isync` forces the cpu to do a full pipeline flush,
  35 which is unnecessary when all that is needed is a memory barrier.
  36
  37 `atomic_fetch_add_seq_cst` is 6 instructions including a loop:
  38
  39 ```
  40 # address in r4, addend in r5
  41     sync
  42 loop:
  43     ldarx 3, 0, 4
  44     add 6, 5, 3
  45     stdcx. 6, 0, 4
  46     bne 0, loop
  47     lwsync
  48 # output in r3
  49 ```
  50
  51 `atomic_load_seq_cst` is 5 instructions, including a branch, and an
  52 unnecessarily-strong memory fence:
  53
  54 ```
  55 # address in r3
  56     sync
  57     ld 3, 0(3)
  58     cmpw 0, 3, 3
  59     bne- 0, skip
  60     isync
  61 skip:
  62 # output in r3
  63 ```
  64
  65 `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
  66 a loop with 2 branches, and an unnecessarily-strong memory fence:
  67
  68 ```
  69 # address in r4, compared-to value in r5, replacement value in r6
  70     sync
  71 loop:
  72     ldarx 3, 0, 4
  73     cmpd 0, 3, 5
  74     bne 0, not_eq
  75     stdcx. 6, 0, 4
  76     bne 0, loop
  77 not_eq:
  78     isync
  79 # output loaded value in r3, store-occurred flag in cr0.eq
  80 ```
  81
  82 `atomic_load_acquire` is 4 instructions, including a branch and an
  83 unnecessarily-strong memory fence:
  84
  85 ```
  86 # address in r3
  87     ld 3, 0(3)
  88     cmpw 0, 3, 3
  89     bne- skip
  90     isync
  91 skip:
  92 # output in r3
  93 ```
  94
  95 Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on
  96 TODO
  97
  98 ## Power ISA doesn't align well with C++11 atomics
  99
 100 [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
 101
 102 > Existing implementation schemes on Power and ARM are not correct with
 103 > respect to the current memory model definition. These implementation
 104 > schemes can lead to results that are disallowed by the current memory
 105 > model when the user combines acquire/release ordering with seq_cst
 106 > ordering. On some architectures, especially Power and Nvidia GPUs, it
 107 > is expensive to repair the implementations to satisfy the existing
 108 > memory model. Details are discussed in (Lahav et al)
 109 > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
 110 > on heavily).
 111
 112 ## Power ISA's Atomic-Memory-Operations have issues
 113
 114 PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
 115
 116 They are still missing better fences, combined operation/fence
 117 instructions, and operations on 8/16-bit values, as well as issues with
 118 unnecessary restrictions:
 119
 120 it has only 32-bit and 64-bit atomic operations.
 121
 122 see [[discussion]] for proposed operations and thoughts TODO
 123 remove this sentence
 124
 125
 126 # DRAFT atomic instructions
 127
 128 These two instructions, `lat` and `stat`, are identical
 129 to `lwat/ldat` and `stwat/stdat` except add acquire and
 130 release guaranteed ordering semantics as well as 8 and
 131 16 bit memory widths.
 132
 133 AT-Form (TODO)
 134
 135 * lat. RT,RA,FC,aq,rl,ew
 136 * stat. RS,RA,FC,aq,rl,ew
 137
 138 **DRAFT** EXT031 and XO, these are near to the existing
 139 atomic memory operations
 140
 141 |0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name| Form       |
 142 |-- | -- | --- | --- |--|--|---- |------|--|----|------------|
 143 |31 | RT | RA  | FC  |lr|sc|ew   |000101|Rc|lat | TODO-Form  |
 144 |31 | RS | RA  | FC  |lr|sc|ew   |100101|/ |stat| TODO-Form |
 145
 146 * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
 147 * If the `aq` bit is set,
 148   then no later atomic memory operations can be observed
 149   to take place before the AMO in this or other cores.
 150   (A global Write-after-Read Memory Hazard is created)
 151 * If the `rl` bit is set, then other cores will not observe the AMO before
 152   memory accesses preceding the AMO.
 153   (A global Read-after-Write Memory Hazard is created)
 154 * Setting both the `aq` and the `rl` bit makes the sequence
 155   sequentially consistent, meaning that
 156   it cannot be reordered with respect to earlier or later atomic
 157   memory operations. (Both a RaW and WaR are simultaneously created)
 158 * `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
 159   and `stwat`
 160
 161 read functions v3.1 book II section 4.5.1 p1071
 162
 163 |opcode| regs           | memory                 | description                 |
 164 |------|----------------|------------------------|-----------------------------|
 165 |00000 | RT, RT+1       | mem(EA,s)              | Fetch and Add               |
 166 |00001 | RT, RT+1       | mem(EA,s)              | Fetch and XOR               |
 167 |00010 | RT, RT+1       | mem(EA,s)              | Fetch and OR                |
 168 |00011 | RT, RT+1       | mem(EA,s)              | Fetch and AND               |
 169 |00100 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Unsigned  |
 170 |00101 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Signed    |
 171 |00110 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Unsigned  |
 172 |00111 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Signed    |
 173 |01000 | RT, RT+1       | mem(EA,s)              | Swap                        |
 174 |10000 | RT, RT+1, RT+2 | mem(EA,s)              | Compare and Swap Not Equal  |
 175 |11000 | RT             | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
 176 |11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal               |
 177 |11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded             |
 178
 179 store functions
 180
 181 |opcode| regs | memory    | description                 |
 182 |------|------|-----------|-----------------------------|
 183 |00000 | RS   | mem(EA,s) | Store Add                   |
 184 |00001 | RS   | mem(EA,s) | Store XOR                   |
 185 |00010 | RS   | mem(EA,s) | Store OR                    |
 186 |00011 | RS   | mem(EA,s) | Store AND                   |
 187 |00100 | RS   | mem(EA,s) | Store Maximum Unsigned      |
 188 |00101 | RS   | mem(EA,s) | Store Maximum Signed        |
 189 |00110 | RS   | mem(EA,s) | Store Minimum Unsigned      |
 190 |00111 | RS   | mem(EA,s) | Store Minimum Signed        |
 191 |11000 | RS   | mem(EA,s) | Store Twin                  |
 192
 193 These functions are also recognised as being part of the
 194 OpenCAPI Specification.