add notes on 2024-01-23 meeting. terminated due to harrassment

[libreriscv.git] / openpower / atomics.mdwn
diff --git a/openpower/atomics.mdwn b/openpower/atomics.mdwn

index 53e4fb8a82d511e801a223e9a351d53f7b282b4b..bf1a055e0152d5e791cd94da9dc7a88950592036 100644 (file)
--- a/openpower/atomics.mdwn
+++ b/openpower/atomics.mdwn
@@ -1,18 +1,37 @@
  # Draft proposal for improved atomic operations for the Power ISA
  
-<https://bugs.libre-soc.org/show_bug.cgi?id=236>
+**NOTE THIS PROPOSAL IS NOT BEING SUBMITTED DUE TO
+DISCOVERY DURING INVESTIGATION THAT ATOMICS ARE DESIGNED
+FOR MASSIVE DISTRIBUTED CLUSTERS. SIGNIFICANT ADDITIONAL
+RESEARCH IS REQUIRED SO THIS PROPOSAL IS PUT ON HOLD
+UNTIL BUDGET IS AVAILABLE**
  
-## Motivation
+Links:
+
+* <https://bugs.libre-soc.org/show_bug.cgi?id=236>
+* [OpenCAPI spec](http://opencapi.org/wp-content/uploads/2017/02/OpenCAPI-TL.WGsec_.V3p1.2017Jan27.pdf) p47-49 for AMO section
+* [RISC-V A](https://github.com/riscv/riscv-isa-manual/blob/master/src/a.tex)
+* [[atomics/discussion]]
+* <http://www.rdrop.com/~paulmck/scalability/paper/N2745r.2011.03.04a.html>
+
+TODO:
+
+* investigate Power ISA 3.1 p1077 eh hint
+
+
+# Motivation
  
  Power ISA currently has some issues with its atomic operations support,
-which are exacerbated by 3D Data structure processing needing
-of the order of 10^5 or greater SMP atomic locks per second.
+which are exacerbated by 3D Data structure processing in 3D
+Shader Binaries needing
+of the order of 10^5 or greater atomic locks per second per SMP Core.
  
-### Power ISA's current atomic operations are inefficient
+## Power ISA's current atomic operations are inefficient
  
  Implementations have a hard time recognizing existing atomic operations
  via macro-op fusion because they would often have to detect and fuse a
-large number of instructions, including branches.
+large number of instructions, including branches. This is contrary
+to the RISC paradigm.
  
  There is also the issue that PowerISA's memory fences are unnecessarily
  strong, particularly `isync` which is used for a lot of `acquire` and
@@ -78,8 +97,9 @@ skip:
  ```
  
  Having single atomic operations is useful for implementations that want to send atomic operations to a shared cache since they can be more efficient to execute there, rather than having to move a whole cache block. Relying exclusively on 
+TODO
  
-### Power ISA doesn't align well with C++11 atomics
+## Power ISA doesn't align well with C++11 atomics
  
  [P0668R5: Revising the C++ memory model](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html):
  
@@ -93,7 +113,7 @@ Having single atomic operations is useful for implementations that want to send
  > http://plv.mpi-sws.org/scfix/paper.pdf (which this discussion relies
  > on heavily).
  
-### Power ISA's Atomic-Memory-Operations have issues
+## Power ISA's Atomic-Memory-Operations have issues
  
  PowerISA v3.1 Book II section 4.5: Atomic Memory Operations
  
@@ -103,43 +123,76 @@ unnecessary restrictions:
  
  it has only 32-bit and 64-bit atomic operations.
  
-the operations it has that I was going to propose:
-fetch_add
-fetch_xor
-fetch_or
-fetch_and
-fetch_umax
-fetch_smax
-fetch_umin
-fetch_smin
-exchange
-
-as well as a few I wasn't going to propose (they seem less useful to me):
-compare-and-swap-not-equal
-fetch-and-increment-bounded
-fetch-and-increment-equal
-fetch-and-decrement-bounded
-store-twin
-
-The spec also basically says that the atomic memory operations are only
-intended for when you want to do atomic operations on memory, but don't
-want that memory to be loaded into your L1 cache.
-
-imho that restriction is specifically *not* wanted, because there are
-plenty of cases where atomic operations should happen in your L1 cache.
-
-I'd guess that part of why those atomic operations weren't included in
-gcc or clang as the default implementation of atomic operations (when
-the appropriate ISA feature is enabled) is because of that restriction.
-
-imho the cpu should be able to (but not required to) predict whether to
-send an atomic operation to L2-cache/L3-cache/etc./memory or to execute
-it directly in the L1 cache. The prediction could be based on how often
-that cache block was accessed from different cpus, e.g. by having a
-small saturating counter and a last-accessing-cpu field, where it would
-count how many times the same cpu accessed it in a row, sending it to the
-L1 cache if that's more than some limit, otherwise doing the operation
-in the L2/L3/etc.-cache if the limit wasn't reached or a different cpu
-tried to access it.
-
-# TODO: add list of proposed instructions
+see [[discussion]] for proposed operations and thoughts TODO
+remove this sentence
+
+
+# DRAFT atomic instructions
+
+These two instructions, `lat` and `stat`, are identical
+to `lwat/ldat` and `stwat/stdat` except add acquire and
+release guaranteed ordering semantics as well as 8 and
+16 bit memory widths.
+
+AT-Form (TODO)
+
+* lat. RT,RA,FC,aq,rl,ew
+* stat. RS,RA,FC,aq,rl,ew
+
+**DRAFT** EXT031 and XO, these are near to the existing
+atomic memory operations
+
+|0.5|6.10|11.15|16.20|21|22|23.24|25.30 |31|name| Form       |
+|-- | -- | --- | --- |--|--|---- |------|--|----|------------|
+|31 | RT | RA  | FC  |lr|sc|ew   |000101|Rc|lat | TODO-Form  |
+|31 | RS | RA  | FC  |lr|sc|ew   |100101|/ |stat| TODO-Form |
+
+* `ew` specifies the memory operation width: 0/1/2/3 8/16/32/64
+* If the `aq` bit is set,
+  then no later atomic memory operations can be observed
+  to take place before the AMO in this or other cores.
+  (A global Write-after-Read Memory Hazard is created)
+* If the `rl` bit is set, then other cores will not observe the AMO before 
+  memory accesses preceding the AMO.
+  (A global Read-after-Write Memory Hazard is created)
+* Setting both the `aq` and the `rl` bit makes the sequence
+  sequentially consistent, meaning that
+  it cannot be reordered with respect to earlier or later atomic
+  memory operations. (Both a RaW and WaR are simultaneously created)
+* `FC` is identical to the Function tables used in Power ISA v3 for `lwat`
+  and `stwat`
+
+read functions v3.1 book II section 4.5.1 p1071
+
+|opcode| regs           | memory                 | description                 |
+|------|----------------|------------------------|-----------------------------|
+|00000 | RT, RT+1       | mem(EA,s)              | Fetch and Add               |
+|00001 | RT, RT+1       | mem(EA,s)              | Fetch and XOR               |
+|00010 | RT, RT+1       | mem(EA,s)              | Fetch and OR                |
+|00011 | RT, RT+1       | mem(EA,s)              | Fetch and AND               |
+|00100 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Unsigned  |
+|00101 | RT, RT+1       | mem(EA,s)              | Fetch and Maximum Signed    |
+|00110 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Unsigned  |
+|00111 | RT, RT+1       | mem(EA,s)              | Fetch and Minimum Signed    |
+|01000 | RT, RT+1       | mem(EA,s)              | Swap                        |
+|10000 | RT, RT+1, RT+2 | mem(EA,s)              | Compare and Swap Not Equal  |
+|11000 | RT             | mem(EA,s) mem(EA+s, s) | Fetch and Increment Bounded |
+|11001 | RT | mem(EA,s) mem(EA+s, s) | Fetch and Increment Equal               |
+|11100 | RT | mem(EA-s,s) mem(EA, s) | Fetch and Decrement Bounded             |
+
+store functions
+
+|opcode| regs | memory    | description                 |
+|------|------|-----------|-----------------------------|
+|00000 | RS   | mem(EA,s) | Store Add                   |
+|00001 | RS   | mem(EA,s) | Store XOR                   |
+|00010 | RS   | mem(EA,s) | Store OR                    |
+|00011 | RS   | mem(EA,s) | Store AND                   |
+|00100 | RS   | mem(EA,s) | Store Maximum Unsigned      |
+|00101 | RS   | mem(EA,s) | Store Maximum Signed        |
+|00110 | RS   | mem(EA,s) | Store Minimum Unsigned      |
+|00111 | RS   | mem(EA,s) | Store Minimum Signed        |
+|11000 | RS   | mem(EA,s) | Store Twin                  |
+
+These functions are also recognised as being part of the
+OpenCAPI Specification.