# Draft proposal for improved atomic operations for the Power ISA Links: * * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex) * [[atomics/discussion]] # Motivation Power ISA currently has some issues with its atomic operations support, which are exacerbated by 3D Data structure processing in 3D Shader Binaries needing of the order of 10^5 or greater atomic locks per second per SMP Core. ## Power ISA's current atomic operations are inefficient Implementations have a hard time recognizing existing atomic operations via macro-op fusion because they would often have to detect and fuse a large number of instructions, including branches. This is contrary to the RISC paradigm. There is also the issue that PowerISA's memory fences are unnecessarily strong, particularly `isync` which is used for a lot of `acquire` and stronger fences. `isync` forces the cpu to do a full pipeline flush, which is unnecessary when all that is needed is a memory barrier. `atomic_fetch_add_seq_cst` is 6 instructions including a loop: ``` # address in r4, addend in r5 sync loop: ldarx 3, 0, 4 add 6, 5, 3 stdcx. 6, 0, 4 bne 0, loop lwsync # output in r3 ``` `atomic_load_seq_cst` is 5 instructions, including a branch, and an unnecessarily-strong memory fence: ``` # address in r3 sync ld 3, 0(3) cmpw 0, 3, 3 bne- 0, skip isync skip: # output in r3 ``` `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including a loop with 2 branches, and an unnecessarily-strong memory fence: ``` # address in r4, compared-to value in r5, replacement value in r6 sync loop: ldarx 3, 0, 4 cmpd 0, 3, 5 bne 0, not_eq stdcx. 6, 0, 4 bne 0, loop not_eq: isync # output loaded value in r3, store-occurred flag in cr0.eq ``` `atomic_load_acquire` is 4 instructions, including a branch and an unnecessarily-strong memory fence: ``` # address in r3 ld 3, 0(3) cmpw 0, 3, 3 bne- skip isync skip: # output in r3 ``` Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on TODO ## Power ISA doesn't align well with C++11 atomics [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html): > Existing implementation schemes on Power and ARM are not correct with > respect to the current memory model definition. These implementation > schemes can lead to results that are disallowed by the current memory > model when the user combines acquire/release ordering with seq_cst > ordering. On some architectures, especially Power and Nvidia GPUs, it > is expensive to repair the implementations to satisfy the existing > memory model. Details are discussed in (Lahav et al) > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies > on heavily). ## Power ISA's Atomic-Memory-Operations have issues PowerISA v3.1 Book II section 4.5: Atomic Memory Operations They are still missing better fences, combined operation/fence instructions, and operations on 8/16-bit values, as well as issues with unnecessary restrictions: it has only 32-bit and 64-bit atomic operations. see [[discussion]] for proposed operations and thoughts TODO remove this sentence # DRAFT atomic instructions These two instructions, `lat` and `stat`, are identical to `lwat/ldat` and `stwat/stdat` except add acquire and release guaranteed ordering semantics as well as 8 and 16 bit memory widths as well. AT-Form (TODO) * lat RT,RA,FC,ew * lataq RT,RA,FC,ew * latrl RT,RA,FC,ew * lataqrl RT,RA,FC,ew * stat RT,RA,FC,ew * stataq RT,RA,FC,ew * statrl RT,RA,FC,ew * stataqrl RT,RA,FC,ew **DRAFT** EXT031 and XO, these are near to the existing atomic memory operations | 0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31| name | Form | | -- | -- | --- | --- |--|--|---- |------|--| ---- | ------- | | 31 | RT | RA | FC |lr|sc|ew |000101|/ | lat[aq][rl]| TODO-Form | | 31 | RS | RA | FC |lr|sc|ew |100101|/ | stat[aq][rl]| TODO-Form | * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64 * If the `aq` bit is set, then no later atomic memory operations can be observed to take place before the AMO in this or other cores. (A Write-after-Read Memory Hazard is created) * If the `rl` bit is set, then other cores will not observe the AMO before memory accesses preceding the AMO. (A Read-after-Write Memory Hazard is created) * Setting both the `aq` and the `rl` bit makes the sequence sequentially consistent, meaning that it cannot be reordered with respect to earlier or later atomic memory operations. (Both a RaW and WaR are simultaneously created) * `FC` is identical to the Function tables used in Power ISA v3 for `lwat` and `stwat` read functions v3.1 book II section 4.5.1 p1071 | 00000 | RT, RT+1 | mem(EA,s) | Fetch and Add | | 00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR | | 00010 | RT, RT+1 | mem(EA,s) | Fetch and OR | | 00011 | RT, RT+1 | mem(EA,s) | Fetch and AND | | 00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned | | 00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed | | 00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned | | 00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed | | 01000 | RT, RT+1 | mem(EA,s) | Swap | | 10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal | | 11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded | | 11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal | | 11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded | store functions | 00000 RS mem(EA,s) Store Add | 00001 RS mem(EA,s) Store XOR | 00010 RS mem(EA,s) Store OR | 00011 RS mem(EA,s) Store AND | 00100 RS mem(EA,s) Store Maximum Unsigned | 00101 RS mem(EA,s) Store Maximum Signed | 00110 RS mem(EA,s) Store Minimum Unsigned | 00111 RS mem(EA,s) Store Minimum Signed | 11000 RS mem(EA,s) Store Twin These functions are recognised as being part of the OpenCAPI Specification.