add ls001.po9 RFC

[libreriscv.git] / openpower / sv / ldst.mdwn
diff --git a/openpower/sv/ldst.mdwn b/openpower/sv/ldst.mdwn

index 1623f513fd16cfc4c9f05df97ec0767f1a43b450..d486556ae33354953f02aa6de5ef599897027dd7 100644 (file)
--- a/openpower/sv/ldst.mdwn
+++ b/openpower/sv/ldst.mdwn
@@ -48,13 +48,14 @@ a number of different modes:
  * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  
  *Despite being constructed from Scalar LD/ST none of these Modes exist
-or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
+or make sense in any Scalar ISA. They **only** exist in Vector ISAs
+and are a critical part of its value*.
  
  Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  as well as Element-width overrides and Twin-Predication.
  
-Note also that Indexed [[sv/remap]] mode may be applied to both v3.0
-LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
+Note also that Indexed [[sv/remap]] mode may be applied to both Scalar
+LD/ST Immediate Defined Words *and* LD/ST Indexed Defined Words.
  LD/ST-Indexed should not be conflated with Indexed REMAP mode:
  clarification is provided below.
  
@@ -71,7 +72,6 @@ alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
  modes make sense:
  
  * saturation
-* predicate-result would be useful but is lower priority than Data-Dependent Fail-First
  * simple (no augmentation)
  * fail-first (where Vector Indexed is banned)
  * Signed Effective Address computation (Vector Indexed only)
@@ -123,8 +123,7 @@ The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
  | 0 | 0 | 0   |  zz els | simple mode                |
  | 0 | 0 | 1   | PI  LF  | post-increment and Fault-First  |
  | 1 | 0 |   N | zz  els |  sat mode: N=0/1 u/s       |
-|VLi| 1 | inv | CR-bit  | Rc=1: ffirst CR sel        |
-|VLi| 1 | inv | els RC1 |  Rc=0: ffirst z/nonz       |
+|VLi| 1 | inv | CR-bit  | ffirst CR sel             |
  
  The `els` bit is only relevant when `RA.isvec` is clear: this indicates
  whether stride is unit or element:
@@ -178,8 +177,7 @@ but are the same `RM.MODE` bits (19:23 of `RM`):
  | 0 | 1 |  2  |  3   4  |  description               |
  |---|---| --- |---------|--------------------------- |
  |els| 0 | SEA |  dz  sz | simple mode        |
-|VLi| 1 | inv | CR-bit  | Rc=1: ffirst CR sel        |
-|VLi| 1 | inv | els RC1 |  Rc=0: ffirst z/nonz       |
+|VLi| 1 | inv | CR-bit  | ffirst CR sel        |
  
  Vector Indexed Strided Mode is qualified as follows:
  
@@ -389,7 +387,7 @@ may be found in [[sv/remap]]:
  Thus it can be seen that the use of Indexed REMAP saves copying
  and manual reordering of the Vector of RB offsets.
  
-## LD/ST ffirst
+## LD/ST ffirst (Fault-First)
  
  LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
  is not active) as an ordinary one, with all behaviour with respect to
@@ -447,10 +445,30 @@ FFirst where Vertical-First Mode is fully deterministic, not speculative.
  ## Data-Dependent Fail-First (not Fail/Fault-First)
  
  Not to be confused with Fail/Fault First, Data-Fail-First performs an
-additional check on the data into a Condition Register Field and if a test
-on the CR Field fails then VL is truncated and further looping terminates.
+additional check on the data, and if the test
+fails then VL is truncated and further looping terminates.
  This is precisely the same as Arithmetic Data-Dependent Fail-First,
-the only difference being that the result comes from the LD/ST.
+the only difference being that the result comes from the LD/ST
+rather than from an Arithmetic operation.
+
+Also a crucial difference between Arithmetic and LD/ST Data-Dependent Fail-First:
+except for Store-Conditional a 4-bit Condition Register Field test is created
+for testing purposes
+*but not stored* (thus there is no RC1 Mode as there is in Arithmetic).
+The reason why a CR Field is not stored is because Load/Store, particularly
+the Update instructions, is already expensive in register terms,
+and adding an extra Vector write would be too costly in hardware.
+
+*Programmer's note: Programmers
+may use Data-Dependent Load with a test to truncate VL, and may then
+follow up with a `sv.cmpi` or other operation. The important aspect is
+that the Vector Load truncated on finding a NULL pointer, for example.*
+
+*Programmer's note: Load-with-Update may be used to update
+the register used in Effective Address computation of th
+next element.  This may be used to perform single-linked-list
+walking, where Data-Dependent Fail-First terminates and
+truncates the Vector at the first NULL.*
  
  In the case of Store operations there is a quirk when VLi (VL inclusive
  is "Valid") is clear. Bear in mind the criteria is that the truncated
@@ -458,8 +476,7 @@ Vector of results, when VLi is clear, must all pass the "test", but when
  VLi is set the *current failed test* is permitted to be included.  Thus,
  the actual update (store) to Memory is **not permitted to take place**
  should the test fail. Therefore, on testing the value to be stored,
-and after updating the corresponding CR Field Element, when VLi=0 and
-finding that the test fails the Memory store must **not** occur.
+when VLi=0 and finding that the test fails the Memory store must **not** occur.
  
  Additionally, when VLi=0 and a test fails then RA does **not** receive a
  copy of the Effective Address.  Hardware implementations with Out-of-Order
@@ -477,20 +494,26 @@ nodes.  If VLi=1 it will load the NULL pointer into the Vector of results.
  If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
  one Element earlier.
  
+*Programmer's Note: by also setting the RC1 qualifier as well as setting
+VLi=1 it is possible to establish a Predicate Mask such that the first
+zero in the predicate will be the NULL pointer*
+
  ```
     RT=1 # vec - deliberately overlaps by one with RA
     RA=0 # vec - first one is valid, contains ptr
     imm = 8 # offset_of(ptr->next)
     for i in range(VL):
+       # this part is the Scalar Defined Word (standard scalar ld operation)
         EA = GPR(RA+i) + imm          # ptr + offset(next)
         data = MEM(EA, 8)             # 64-bit address of ptr->next
         GPR(RT+i) = data              # happens to be read on next loop!
-       # was a normal ld up to this point. now the Data-Fail-First
-       CR.field(i) = conditions(data)
-       if CR.field(i).EQ == testbit: # check if zero
-           if VLI then   VL = i+1                   # update VL, inclusive
-           else          VL = i                     # update VL
-           break                     # stop looping
+       # was a normal vector-ld up to this point. now the Data-Fail-First
+       cr_test = conditions(data)
+       if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
+       if cr_test.EQ == testbit:             # check if zero
+           if VLI then   VL = i+1            # update VL, inclusive
+           else          VL = i              # update VL, exclusive current
+           break                             # stop looping
  ```
  
  **Data-Dependent Fault-First on Store-Conditional (Rc=1)**
@@ -498,16 +521,27 @@ one Element earlier.
  There are very few instructions that allow Rc=1 for Load/Store:
  one of those is the `stdcx.` and other Atomic Store-Conditional
  instructions.  With Simple-V being a loop around Scalar instructions
-strictly obeying Scalar Program Order a Fail-First loop on an
-Atomic Store-Conditional will always fail the second and all other
-Store-Conditional instructions in Horizontal-First Mode because
+strictly obeying Scalar Program Order a Horizontal-First Fail-First loop
+on an Atomic Store-Conditional will always fail the second and all other
+Store-Conditional instructions because
  Load-Reservation and Store-Conditional are required to be executed
  in pairs.
  
  By contrast, in Vertical-First Mode it is in fact possible to issue
  the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is
-useful.  Care should be taken however when VL is truncated in Vertical-First
-Mode.
+useful.
+
+Programmer's note: Care should be taken when VL is truncated in
+Vertical-First Mode.
+
+**Future potential**
+
+Although Rc=1 on LD/ST is a rare occurrence at present, future versions
+of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
+with the SVP64 Vectorisation Prefixing being itself a RISC-paradigm that
+is itself fully-independent of the Scalar Suffix Defined Words, prohibiting
+the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
+operations is not strategically sound.
  
  ## LOAD/STORE Elwidths <a name="elwidth"></a>
  
@@ -682,6 +716,15 @@ established through `svstep`, are also an easy way to perform regular
  Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond that,
  REMAP will need to be used.
  
+**Parallel Reduction REMAP**
+
+No REMAP Schedule is prohibited in SVP64 because the RISC-paradigm Prefix
+is completely separate from the RISC-paradigm Scalar Defined Words.  Although
+obscure there does exist the outside possibility that a potential use for
+Parallel Reduction Schedules on LD/ST would find a use in Computer Science.
+Readers are invited to contact the authors of this document if one is ever
+found.
+
  --------
  
  [[!tag standards]]