# Draft proposal for improved atomic operations for the Power ISA

Links:

* <https://bugs.libre-soc.org/show_bug.cgi?id=236>
* [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
* [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)

# Motivation

Power ISA currently has some issues with its atomic operations support,
which are exacerbated by 3D Data structure processing in 3D
Shader Binaries needing
of the order of 10^5 or greater atomic locks per second per SMP Core.

## Power ISA's current atomic operations are inefficient

Implementations have a hard time recognizing existing atomic operations
via macro-op fusion because they would often have to detect and fuse a
large number of instructions, including branches. This is contrary
to the RISC paradigm.

There is also the issue that PowerISA's memory fences are unnecessarily
strong, particularly `isync` which is used for a lot of `acquire` and
stronger fences. `isync` forces the cpu to do a full pipeline flush,
which is unnecessary when all that is needed is a memory barrier.

`atomic_fetch_add_seq_cst` is 6 instructions including a loop:

```
# address in r4, addend in r5
    sync
loop:
    ldarx 3, 0, 4
    add 6, 5, 3
    stdcx. 6, 0, 4
    bne 0, loop
    lwsync
# output in r3
```

`atomic_load_seq_cst` is 5 instructions, including a branch, and an
unnecessarily-strong memory fence:

```
# address in r3
    sync
    ld 3, 0(3)
    cmpw 0, 3, 3
    bne- 0, skip
    isync
skip:
# output in r3
```

`atomic_compare_exchange_strong_seq_cst` is 7 instructions, including
a loop with 2 branches, and an unnecessarily-strong memory fence:

```
# address in r4, compared-to value in r5, replacement value in r6
    sync
loop:
    ldarx 3, 0, 4
    cmpd 0, 3, 5
    bne 0, not_eq
    stdcx. 6, 0, 4
    bne 0, loop
not_eq:
    isync
# output loaded value in r3, store-occurred flag in cr0.eq
```

`atomic_load_acquire` is 4 instructions, including a branch and an
unnecessarily-strong memory fence:

```
# address in r3
    ld 3, 0(3)
    cmpw 0, 3, 3
    bne- skip
    isync
skip:
# output in r3
```

Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on 
TODO

## Power ISA doesn't align well with C++11 atomics

[P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):

> Existing implementation schemes on Power and ARM are not correct with
> respect to the current memory model definition. These implementation
> schemes can lead to results that are disallowed by the current memory
> model when the user combines acquire/release ordering with seq_cst
> ordering. On some architectures, especially Power and Nvidia GPUs, it
> is expensive to repair the implementations to satisfy the existing
> memory model. Details are discussed in (Lahav et al)
> http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
> on heavily).

## Power ISA's Atomic-Memory-Operations have issues

PowerISA v3.1 Book II section 4.5: Atomic Memory Operations

They are still missing better fences, combined operation/fence
instructions, and operations on 8/16-bit values, as well as issues with
unnecessary restrictions:

it has only 32-bit and 64-bit atomic operations.

read operations v3.1 book II section 4.5.1 p1071

    | 00000 | RT, RT+1 | mem(EA,s) | Fetch and Add |
    | 00001 | RT, RT+1 | mem(EA,s) | Fetch and XOR |
    | 00010 | RT, RT+1 | mem(EA,s) | Fetch and OR |
    | 00011 | RT, RT+1 | mem(EA,s) | Fetch and AND |
    | 00100 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Unsigned |
    | 00101 | RT, RT+1 | mem(EA,s) | Fetch and Maximum Signed |
    | 00110 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Unsigned |
    | 00111 | RT, RT+1 | mem(EA,s) | Fetch and Minimum Signed |
    | 01000 | RT, RT+1 | mem(EA,s) | Swap |
    | 10000 | RT, RT+1, RT+2 | mem(EA,s) | Compare and Swap Not Equal |
    | 11000 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
    | 11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal |
    | 11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded |

store operations

    | 00000 RS mem(EA,s) Store Add
    | 00001 RS mem(EA,s) Store XOR 
    | 00010 RS mem(EA,s) Store OR
    | 00011 RS mem(EA,s) Store AND t 
    | 00100 RS mem(EA,s) Store Maximum Unsigned
    | 00101 RS mem(EA,s) Store Maximum Signed t
    | 00110 RS mem(EA,s) Store Minimum Unsigned
    | 00111 RS mem(EA,s) Store Minimum Signed
    | 11000 RS mem(EA,s) Store Twin

These operations are recognised as being part of the
OpenCAPI Specification.
the operations it has that I was going to propose:

* fetch_add
* fetch_xor
* fetch_or
* fetch_and
* fetch_umax
* fetch_smax
* fetch_umin
* fetch_smin
* exchange

as well as a few I wasn't going to propose (they seem less useful to me):

* compare-and-swap-not-equal
* fetch-and-increment-bounded
* fetch-and-increment-equal
* fetch-and-decrement-bounded
* store-twin

The spec also basically says that the atomic memory operations are only
intended for when you want to do atomic operations on memory, but don't
want that memory to be loaded into your L1 cache.

imho that restriction is specifically *not* wanted, because there are
plenty of cases where atomic operations should happen in your L1 cache.

I'd guess that part of why those atomic operations weren't included in
gcc or clang as the default implementation of atomic operations (when
the appropriate ISA feature is enabled) is because of that restriction.

imho the cpu should be able to (but not required to) predict whether to
send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
it directly in the L1 cache. The prediction could be based on how often
that cache block was accessed from different cpus, e.g. by having a
small saturating counter and a last-accessing-cpu field, where it would
count how many times the same cpu accessed it in a row, sending it to the
L1 cache if that's more than some limit, otherwise doing the operation
in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
tried to access it.

# TODO: add list of proposed instructions

AT-Form (TODO)

* lat RT,RA,FC,ew
* lataq RT,RA,FC,ew
* latrl RT,RA,FC,ew
* lataqrl RT,RA,FC,ew

| 0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31| name       |  Form   |
| -- | -- | --- | --- |--|--|---- |------|--| ----       | ------- |
| NN | RT | RA  | FC  |lr|sc|ew   |xxxxxx|/ | lat[aq][rl]| TODO-Form  |

* If the `aq` bit is set,
  then no later atomic memory operations can be observed
  to take place before the AMO.
* If the `rl` bit is set, then other cores will not observe the AMO before 
  memory accesses preceding the AMO.
* Setting both the `aq` and the `rl` bit makes the sequence
  sequentially consistent, meaning that
  it cannot be reordered with earlier or later atomic
  memory operations.