# Draft proposal for improved atomic operations for the Power ISA

Links:

* <https://bugs.libre-soc.org/show_bug.cgi?id=236>
* [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
* [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
* [[atomics/discussion]]

# Motivation

Power ISA currently has some issues with its atomic operations support,
which are exacerbated by 3D Data structure processing in 3D
Shader Binaries needing
of the order of 10^5 or greater atomic locks per second per SMP Core.

## Power ISA's current atomic operations are inefficient

Implementations have a hard time recognizing existing atomic operations
via macro-op fusion because they would often have to detect and fuse a
large number of instructions, including branches. This is contrary
to the RISC paradigm.

There is also the issue that PowerISA's memory fences are unnecessarily
strong, particularly `isync` which is used for a lot of `acquire` and
stronger fences. `isync` forces the cpu to do a full pipeline flush,
which is unnecessary when all that is needed is a memory barrier.

`atomic_fetch_add_seq_cst` is 6 instructions including a loop:

```
# address in r4, addend in r5
    sync
loop:
    ldarx 3, 0, 4
    add 6, 5, 3
    stdcx. 6, 0, 4
    bne 0, loop
    lwsync
# output in r3
```

`atomic_load_seq_cst` is 5 instructions, including a branch, and an
unnecessarily-strong memory fence:

```
# address in r3
    sync
    ld 3, 0(3)
    cmpw 0, 3, 3
    bne- 0, skip
    isync
skip:
# output in r3
```

`atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
a loop with 2 branches, and an unnecessarily-strong memory fence:

```
# address in r4, compared-to value in r5, replacement value in r6
    sync
loop:
    ldarx 3, 0, 4
    cmpd 0, 3, 5
    bne 0, not_eq
    stdcx. 6, 0, 4
    bne 0, loop
not_eq:
    isync
# output loaded value in r3, store-occurred flag in cr0.eq
```

`atomic_load_acquire` is 4 instructions, including a branch and an
unnecessarily-strong memory fence:

```
# address in r3
    ld 3, 0(3)
    cmpw 0, 3, 3
    bne- skip
    isync
skip:
# output in r3
```

Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on 
TODO

## Power ISA doesn't align well with C++11 atomics

[P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):

> Existing implementation schemes on Power and ARM are not correct with
> respect to the current memory model definition. These implementation
> schemes can lead to results that are disallowed by the current memory
> model when the user combines acquire/release ordering with seq_cst
> ordering. On some architectures, especially Power and Nvidia GPUs, it
> is expensive to repair the implementations to satisfy the existing
> memory model. Details are discussed in (Lahav et al)
> http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
> on heavily).

## Power ISA's Atomic-Memory-Operations have issues

PowerISA v3.1 Book II section 4.5: Atomic Memory Operations

They are still missing better fences, combined operation/fence
instructions, and operations on 8/16-bit values, as well as issues with
unnecessary restrictions:

it has only 32-bit and 64-bit atomic operations.

see [[discussion]] for proposed operations and thoughts TODO
remove this sentence


# DRAFT atomic instructions

These two instructions, `lat` and `stat`, are identical
to `lwat/ldat` and `stwat/stdat` except add acquire and
release guaranteed ordering semantics as well as 8 and
16 bit memory widths.

AT-Form (TODO)

* lat RT,RA,FC,ew
* lataq RT,RA,FC,ew
* latrl RT,RA,FC,ew
* lataqrl RT,RA,FC,ew
* stat RT,RA,FC,ew
* stataq RT,RA,FC,ew
* statrl RT,RA,FC,ew
* stataqrl RT,RA,FC,ew

**DRAFT** EXT031 and XO, these are near to the existing
atomic memory operations

|0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name       | Form       |
|-- | -- | --- | --- |--|--|---- |------|--|-----------|------------|
|31 | RT | RA  | FC  |lr|sc|ew   |000101|/ |lat[aq][rl]| TODO-Form  |
|31 | RS | RA  | FC  |lr|sc|ew   |100101|/ |stat[aq][rl]| TODO-Form |

* `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
* If the `aq` bit is set,
  then no later atomic memory operations can be observed
  to take place before the AMO in this or other cores.
  (A Write-after-Read Memory Hazard is created)
* If the `rl` bit is set, then other cores will not observe the AMO before 
  memory accesses preceding the AMO.
  (A Read-after-Write Memory Hazard is created)
* Setting both the `aq` and the `rl` bit makes the sequence
  sequentially consistent, meaning that
  it cannot be reordered with respect to earlier or later atomic
  memory operations. (Both a RaW and WaR are simultaneously created)
* `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
  and `stwat`

read functions v3.1 book II section 4.5.1 p1071

|opcode| regs           | memory                 | description                 |
|------|----------------|------------------------|-----------------------------|
|00000 | RT, RT+1       | mem(EA,s)              | Fetch and Add               |
|00001 | RT, RT+1       | mem(EA,s)              | Fetch and XOR               |
|00010 | RT, RT+1       | mem(EA,s)              | Fetch and OR                |
|00011 | RT, RT+1       | mem(EA,s)              | Fetch and AND               |
|00100 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Unsigned  |
|00101 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Signed    |
|00110 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Unsigned  |
|00111 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Signed    |
|01000 | RT, RT+1       | mem(EA,s)              | Swap                        |
|10000 | RT, RT+1, RT+2 | mem(EA,s)              | Compare and Swap Not Equal  |
|11000 | RT             | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
|11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal               |
|11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded             |

store functions

|opcode| regs | memory    | description                 |
|------|------|-----------|-----------------------------|
|00000 | RS   | mem(EA,s) | Store Add                   |
|00001 | RS   | mem(EA,s) | Store XOR                   |
|00010 | RS   | mem(EA,s) | Store OR                    |
|00011 | RS   | mem(EA,s) | Store AND                   |
|00100 | RS   | mem(EA,s) | Store Maximum Unsigned      |
|00101 | RS   | mem(EA,s) | Store Maximum Signed        |
|00110 | RS   | mem(EA,s) | Store Minimum Unsigned      |
|00111 | RS   | mem(EA,s) | Store Minimum Signed        |
|11000 | RS   | mem(EA,s) | Store Twin                  |

These functions are also recognised as being part of the
OpenCAPI Specification.