# Draft proposal for improved atomic operations for the Power ISA Links: * * [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section * [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex) * [[atomics/discussion]] # Motivation Power ISA currently has some issues with its atomic operations support, which are exacerbated by 3D Data structure processing in 3D Shader Binaries needing of the order of 10^5 or greater atomic locks per second per SMP Core. ## Power ISA's current atomic operations are inefficient Implementations have a hard time recognizing existing atomic operations via macro-op fusion because they would often have to detect and fuse a large number of instructions, including branches. This is contrary to the RISC paradigm. There is also the issue that PowerISA's memory fences are unnecessarily strong, particularly `isync` which is used for a lot of `acquire` and stronger fences. `isync` forces the cpu to do a full pipeline flush, which is unnecessary when all that is needed is a memory barrier. `atomic_fetch_add_seq_cst` is 6 instructions including a loop: ``` # address in r4, addend in r5 sync loop: ldarx 3, 0, 4 add 6, 5, 3 stdcx. 6, 0, 4 bne 0, loop lwsync # output in r3 ``` `atomic_load_seq_cst` is 5 instructions, including a branch, and an unnecessarily-strong memory fence: ``` # address in r3 sync ld 3, 0(3) cmpw 0, 3, 3 bne- 0, skip isync skip: # output in r3 ``` `atomic_compare_exchange_strong_seq_cst` is 7 instructions, including a loop with 2 branches, and an unnecessarily-strong memory fence: ``` # address in r4, compared-to value in r5, replacement value in r6 sync loop: ldarx 3, 0, 4 cmpd 0, 3, 5 bne 0, not_eq stdcx. 6, 0, 4 bne 0, loop not_eq: isync # output loaded value in r3, store-occurred flag in cr0.eq ``` `atomic_load_acquire` is 4 instructions, including a branch and an unnecessarily-strong memory fence: ``` # address in r3 ld 3, 0(3) cmpw 0, 3, 3 bne- skip isync skip: # output in r3 ``` Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on TODO ## Power ISA doesn't align well with C++11 atomics [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html): > Existing implementation schemes on Power and ARM are not correct with > respect to the current memory model definition. These implementation > schemes can lead to results that are disallowed by the current memory > model when the user combines acquire/release ordering with seq_cst > ordering. On some architectures, especially Power and Nvidia GPUs, it > is expensive to repair the implementations to satisfy the existing > memory model. Details are discussed in (Lahav et al) > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies > on heavily). ## Power ISA's Atomic-Memory-Operations have issues PowerISA v3.1 Book II section 4.5: Atomic Memory Operations They are still missing better fences, combined operation/fence instructions, and operations on 8/16-bit values, as well as issues with unnecessary restrictions: it has only 32-bit and 64-bit atomic operations. see [[discussion]] for proposed operations and thoughts TODO remove this sentence # DRAFT atomic instructions These two instructions, `lat` and `stat`, are identical to `lwat/ldat` and `stwat/stdat` except add acquire and release guaranteed ordering semantics as well as 8 and 16 bit memory widths. AT-Form (TODO) * lat RT,RA,FC,ew * lataq RT,RA,FC,ew * latrl RT,RA,FC,ew * lataqrl RT,RA,FC,ew * stat RT,RA,FC,ew * stataq RT,RA,FC,ew * statrl RT,RA,FC,ew * stataqrl RT,RA,FC,ew **DRAFT** EXT031 and XO, these are near to the existing atomic memory operations |0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name | Form | |-- | -- | --- | --- |--|--|---- |------|--|-----------|------------| |31 | RT | RA | FC |lr|sc|ew |000101|/ |lat[aq][rl]| TODO-Form | |31 | RS | RA | FC |lr|sc|ew |100101|/ |stat[aq][rl]| TODO-Form | * `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64 * If the `aq` bit is set, then no later atomic memory operations can be observed to take place before the AMO in this or other cores. (A Write-after-Read Memory Hazard is created) * If the `rl` bit is set, then other cores will not observe the AMO before memory accesses preceding the AMO. (A Read-after-Write Memory Hazard is created) * Setting both the `aq` and the `rl` bit makes the sequence sequentially consistent, meaning that it cannot be reordered with respect to earlier or later atomic memory operations. (Both a RaW and WaR are simultaneously created) * `FC` is identical to the Function tables used in Power ISA v3 for `lwat` and `stwat` read functions v3.1 book II section 4.5.1 p1071 |opcode| regs | memory | description | |------|----------------|------------------------|-----------------------------| |00000 | RT, RT+1 | mem(EA,s) | Fetch and Add | |00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR | |00010 | RT, RT+1 | mem(EA,s) | Fetch and OR | |00011 | RT, RT+1 | mem(EA,s) | Fetch and AND | |00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned | |00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed | |00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned | |00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed | |01000 | RT, RT+1 | mem(EA,s) | Swap | |10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal | |11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded | |11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal | |11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded | store functions |opcode| regs | memory | description | |------|------|-----------|-----------------------------| |00000 | RS | mem(EA,s) | Store Add | |00001 | RS | mem(EA,s) | Store XOR | |00010 | RS | mem(EA,s) | Store OR | |00011 | RS | mem(EA,s) | Store AND | |00100 | RS | mem(EA,s) | Store Maximum Unsigned | |00101 | RS | mem(EA,s) | Store Maximum Signed | |00110 | RS | mem(EA,s) | Store Minimum Unsigned | |00111 | RS | mem(EA,s) | Store Minimum Signed | |11000 | RS | mem(EA,s) | Store Twin | These functions are also recognised as being part of the OpenCAPI Specification.