replace "Defined word" with "Defined Word-instruction"

[libreriscv.git] / openpower / sv / ldst.mdwn
diff --git a/openpower/sv/ldst.mdwn b/openpower/sv/ldst.mdwn

index bf38b6632eaa896c0a06c0b96297a4f093bf41bb..04f6de58232abc5f9a39676f5612296c59eca8a5 100644 (file)
--- a/openpower/sv/ldst.mdwn
+++ b/openpower/sv/ldst.mdwn
@@ -1,218 +1,252 @@
-[[!tag standards]]
-
  # SV Load and Store
  
+This section describes how Standard Load/Store Defined Word-instructions are exploited as
+Element-level Load/Stores and augmented to create direct equivalents of
+Vector Load/Store instructions.  
+
+<!-- hide -->
  Links:
  
  * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
+* <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
  * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  * [[ldst/discussion]]
  
-# Rationale
+## Rationale
  
  All Vector ISAs dating back fifty years have extensive and comprehensive
  Load and Store operations that go far beyond the capabilities of Scalar
-RISC or CISC processors, yet at their heart on an individual element
+RISC and most CISC processors, yet at their heart on an individual element
  basis may be found to be no different from RISC Scalar equivalents.
  
-The resource savings from Vector LD/ST are significant and stem from
-the fact that one single instruction can trigger a dozen (or in some
-microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
+The resource savings from Vector LD/ST are significant and stem
+from the fact that one single instruction can trigger a dozen (or in
+some microarchitectures such as Cray or NEC SX Aurora) hundreds of
+element-level Memory accesses.
  
  Additionally, and simply: if the Arithmetic side of an ISA supports
  Vector Operations, then in order to keep the ALUs 100% occupied the
  Memory infrastructure (and the ISA itself) correspondingly needs Vector
  Memory Operations as well.
  
-Vectorised Load and Store also presents an extra dimension (literally)
-which creates scenarios unique to Vector applications, that a Scalar
-(and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
-add the modes typically found in *all* Scalable Vector ISAs,
-without changing the behaviour of the underlying Base
-(Scalar) v3.0B operations in any way.
+Vectorized Load and Store also presents an extra dimension (literally)
+which creates scenarios unique to Vector applications, that a Scalar (and
+even a SIMD) ISA simply never encounters: not even the complex Addressing
+Modes of the 68,000 or S/360 resemble Vector Load/Store.
+SVP64 endeavours to add the
+modes typically found in *all* Scalable Vector ISAs, without changing the
+behaviour of the underlying Base (Scalar) v3.0B operations in any way.
+(The sole apparent exception is Post-Increment Mode on LD/ST-update
+instructions)
+<!-- show -->
  
-# Modes overview
+## Modes overview
  
-Vectorisation of Load and Store requires creation, from scalar operations,
+Vectorization of Load and Store requires creation, from scalar operations,
  a number of different modes:
  
  * **fixed aka "unit" stride** - contiguous sequence with no gaps
  * **element strided** - sequential but regularly offset, with gaps
  * **vector indexed** - vector of base addresses and vector of offsets
-* **Speculative fail-first** - where it makes sense to do so
+* **Speculative Fault-first** - where it makes sense to do so
+* **Data-Dependent Fail-First** - Conditional truncation of Vector Length
  * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  
-*Despite being constructed from Scalar LD/ST none of these Modes
-exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
+*Despite being constructed from Scalar LD/ST none of these Modes exist
+or make sense in any Scalar ISA. They **only** exist in Vector ISAs
+and are a critical part of its value*.
  
-Also included in SVP64 LD/ST is both signed and unsigned Saturation,
-as well as Element-width overrides and Twin-Predication.
+Also included in SVP64 LD/ST is Element-width overrides and Twin-Predication.
  
-Note also that Indexed [[sv/remap]] mode may be applied to both
-v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
-LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
-is provided below.
+Note also that Indexed [[sv/remap]] mode may be applied to both Scalar
+LD/ST Immediate Defined Word-instructions *and* LD/ST Indexed Defined Word-instructions.
+LD/ST-Indexed should not be conflated with Indexed REMAP mode:
+clarification is provided below.
  
  **Determining the LD/ST Modes**
  
  A minor complication (caused by the retro-fitting of modern Vector
  features to a Scalar ISA) is that certain features do not exactly make
-sense or are considered a security risk.  Fail-first on Vector Indexed
-would allow attackers to probe large numbers of pages from userspace, where
-strided fail-first (by creating contiguous sequential LDs) does not.
+sense or are considered a security risk.  Fault-first on Vector Indexed
+would allow attackers to probe large numbers of pages from userspace,
+where strided Fault-first (by creating contiguous sequential LDs likely
+to be in the same Page) does not.
  
-In addition, reduce mode makes no sense.
-Realistically we need
-an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
+In addition, reduce mode makes no sense.  Realistically we need an
+alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
+modes make sense:
  
-* saturation
-* predicate-result (mostly for cache-inhibited LD/ST)
-* normal
-* fail-first (where Vector Indexed is banned)
-* Signed Effective Address computation (Vector Indexed only)
-* Pack/Unpack (on LD/ST immediate operations only)
+* simple (no augmentation)
+* Fault-first (where Vector Indexed is banned)
+* Data-dependent Fail-First (extremely useful for Linked-List pointer-chasing)
+* Signed Effective Address computation (Vector Indexed only, on RB)
  
  More than that however it is necessary to fit the usual Vector ISA
-capabilities onto both Power ISA LD/ST with immediate and to
-LD/ST Indexed. They present subtly different Mode tables, which, due
-to lack of space, have the following quirks:
+capabilities onto both Power ISA LD/ST with immediate and to LD/ST
+Indexed. They present subtly different Mode tables, which, due to lack
+of space, have the following quirks:
  
  * LD/ST Immediate has no individual control over src/dest zeroing,
    whereas LD/ST Indexed does.
-* LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
-* LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
+* LD/ST Immediate has saturation but LD/ST Indexed does not.
  
-# Format and fields
+## Format and fields
  
  Fields used in tables below:
  
-* **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
  * **zz**: both sz and dz are set equal to this flag.
-* **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
-* **N** sets signed/unsigned saturation.
+  If predication is enabled will put zeros into the dest
+  (or as src in the case of twin pred) when the predicate bit is zero.
+  otherwise the element is ignored or skipped, depending on context.
+* **inv CR bit** just as in branches (BO) these bits allow testing of
+  a CR bit and whether it is set (inv=0) or unset (inv=1)
  * **RC1** as if Rc=1, stores CRs *but not the result*
  * **SEA** - Signed Effective Address, if enabled performs sign-extension on
    registers that have been reduced due to elwidth overrides
+* **PI** - post-increment mode (applies to LD/ST with update only).
+  the Effective Address utilised is always just RA, i.e. the computation of
+  EA is stored in RA **after** it is actually used.
+* **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
+  may be truncated to (at least) one element, and VL altered to indicate such.
+* **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
+  in the Truncated Vector.
+* **els** - Element-strided Mode: the element index (after REMAP)
+  is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
+
+When VLi=0 on Store Operations the Memory update does **not** take place
+on the element that failed.  EA does **not** update into RA on Load/Store
+with Update instructions either.
  
  **LD/ST immediate**
  
-The table for [[sv/svp64]] for `immed(RA)` is:
+The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
+(bits 19:23 of `RM`) is:
  
-| 0-1 |  2  |  3   4  |  description               |
-| --- | --- |---------|--------------------------- |
-| 00  | 0   |  zz els | normal mode                |
-| 00  | 1   |  zz els | Structured Pack/Unpack     |
-| 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
-| 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
-| 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
-| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
-| 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
+| 0 | 1 |  2  |  3   4  |  description               |
+|---|---| --- |---------|--------------------------- |
+|els| 0 | PI  |  zz LF  | post-increment and Fault-First  |
+|VLi| 1 | inv | CR-bit  | Data-Dependent  ffirst CR sel   |
  
  The `els` bit is only relevant when `RA.isvec` is clear: this indicates
  whether stride is unit or element:
  
+```
      if RA.isvec:
          svctx.ldstmode = indexed
      elif els == 0:
          svctx.ldstmode = unitstride
      elif immediate != 0:
          svctx.ldstmode = elementstride
-
-An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
-in effect the multiplication of the immediate-offset by zero results
-in reading from the exact same memory location, *even with a Vector
-register*. (Normally this type of behaviour is reserved for the
-mapreduce modes)
-
-For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
-just the once and be copied, rather than hitting the Data Cache
-multiple times with the same memory read at the same location.
-The benefit of Cache-inhibited LD-splats is that it allows
-for memory-mapped peripherals to have multiple
-data values read in quick succession and stored in sequentially
-numbered registers (but, see Note below).
-
-For non-cache-inhibited ST from a vector source onto a scalar
-destination: with the Vector
-loop effectively creating multiple memory writes to the same location,
-we can deduce that the last of these will be the "successful" one. Thus,
-implementations are free and clear to optimise out the overwriting STs,
-leaving just the last one as the "winner".  Bear in mind that predicate
-masks will skip some elements (in source non-zeroing mode).
-Cache-inhibited ST operations on the other hand **MUST** write out
-a Vector source multiple successive times to the exact same Scalar
-destination. Just like Cache-inhibited LDs, multiple values may be
-written out in quick succession to a memory-mapped peripheral from
-sequentially-numbered registers.
+```
+
+An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
+the multiplication of the immediate-offset by zero results in reading from
+the exact same memory location, *even with a Vector register*. (Normally
+this type of behaviour is reserved for the mapreduce modes)
+
+For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
+the once and be copied, rather than hitting the Data Cache multiple
+times with the same memory read at the same location.  The benefit of
+Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
+to have multiple data values read in quick succession and stored in
+sequentially numbered registers (but, see Note below).
+
+For non-cache-inhibited ST from a vector source onto a scalar destination:
+with the Vector loop effectively creating multiple memory writes to
+the same location, we can deduce that the last of these will be the
+"successful" one. Thus, implementations are free and clear to optimise
+out the overwriting STs, leaving just the last one as the "winner".
+Bear in mind that predicate masks will skip some elements (in source
+non-zeroing mode).  Cache-inhibited ST operations on the other hand
+**MUST** write out a Vector source multiple successive times to the exact
+same Scalar destination. Just like Cache-inhibited LDs, multiple values
+may be written out in quick succession to a memory-mapped peripheral
+from sequentially-numbered registers.
  
  Note that any memory location may be Cache-inhibited
-(Power ISA v.1, Book III, 1.6.1, p1033)
+(Power ISA v3.1, Book III, 1.6.1, p1033)
+
+*Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
+mode is simply not possible: there are not enough Mode bits. One single
+Scalar Load operation may be used instead, followed by any arithmetic
+operation (including a simple mv) in "Splat" mode.*
  
  **LD/ST Indexed**
  
-The modes for `RA+RB` indexed version are slightly different:
+The modes for `RA+RB` indexed version are slightly different
+but are the same `RM.MODE` bits (19:23 of `RM`):
  
-| 0-1 |  2  |  3   4  |  description              |
-| --- | --- |---------|-------------------------- |
-| 00  | SEA |  dz  sz | normal mode        |
-| 01  | SEA | dz sz   | Strided (scalar only source)   |
-| 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
-| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
-| 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
+| 0 | 1 |  2  |  3   4  |  description               |
+|---|---| --- |---------|--------------------------- |
+|els| 0 | PI  |  zz SEA | post-increment and Fault-First        |
+|VLi| 1 | inv | CR-bit  | Data-Dependent ffirst CR sel        |
  
  Vector Indexed Strided Mode is qualified as follows:
  
-    if mode = 0b01 and !RA.isvec and !RB.isvec:
+```
+    if els and !RA.isvec and !RB.isvec:
          svctx.ldstmode = elementstride
-
-A summary of the effect of Vectorisation of src or dest:
- 
-     imm(RA)  RT.v   RA.v   no stride allowed
-     imm(RA)  RT.s   RA.v   no stride allowed
-     imm(RA)  RT.v   RA.s   stride-select allowed
-     imm(RA)  RT.s   RA.s   not vectorised
-     RA,RB    RT.v  {RA|RB}.v Standard Indexed
-     RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
-     RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
-     RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
-
-Signed Effective Address computation is only relevant for
-Vector Indexed Mode, when elwidth overrides are applied.
-The source override applies to RB, and before adding to
-RA in order to calculate the Effective Address, if SEA is
-set RB is sign-extended from elwidth bits to the full 64
-bits.  For other Modes (ffirst, saturate),
-all EA computation with elwidth overrides is unsigned.
-
-Note that cache-inhibited LD/ST  when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  Even with scalar src a
-Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
-If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
-cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
-copying the one *scalar* value into multiple register destinations.
-
-Note also that cache-inhibited VSPLAT with Predicate-result is possible.
+```
+
+A summary of the effect of Vectorization of src or dest:
+
+```
+    imm(RA)  RT.v   RA.v   no stride allowed
+    imm(RA)  RT.s   RA.v   no stride allowed
+    imm(RA)  RT.v   RA.s   stride-select allowed
+    imm(RA)  RT.s   RA.s   not vectorized
+    RA,RB    RT.v  {RA|RB}.v Standard Indexed
+    RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
+    RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
+    RA,RB    RT.s  {RA&RB}.s not vectorized (scalar identity)
+```
+
+Signed Effective Address computation is only relevant for Vector Indexed
+Mode, when elwidth overrides are applied.  The source override applies to
+RB, and before adding to RA in order to calculate the Effective Address,
+if SEA is set then RB is sign-extended from elwidth bits to the full 64 bits.
+For other Modes (ffirst), all EA computation with elwidth
+overrides is unsigned.  RA is *never* altered (not truncated)
+by element-width overrides.
+
+Note that cache-inhibited LD/ST  when VSPLAT is activated will perform
+**multiple** LD/ST operations, sequentially.  Even with scalar src
+a Cache-inhibited LD will read the same memory location *multiple
+times*, storing the result in successive Vector destination registers.
+This because the cache-inhibit instructions are typically used to read
+and write memory-mapped peripherals.  If a genuine cache-inhibited
+LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
+be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
+value into multiple register destinations.
+
+Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
  This allows for example to issue a massive batch of memory-mapped
  peripheral reads, stopping at the first NULL-terminated character and
-truncating VL to that point. No branch is needed to issue that large burst
-of LDs, which may be valuable in Embedded scenarios.
+truncating VL to that point. No branch is needed to issue that large
+burst of LDs, which may be valuable in Embedded scenarios.
  
-# Vectorisation of Scalar Power ISA v3.0B
+## Vectorization of Scalar Power ISA v3.0B
  
-Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]] and
-[[isa/fixedstore]] pseudocode to be of the form:
+Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
+and [[isa/fixedstore]] pseudocode to be of the form:
  
+```
      lbux RT, RA, RB
      EA <- (RA) + (RB)
      RT <- MEM(EA)
+```
  
  and for immediate variants:
  
+```
      lb RT,D(RA)
      EA <- RA + EXTS(D)
      RT <- MEM(EA)
+```
  
  Thus in the first example, the source registers may each be independently
  marked as scalar or vector, and likewise the destination; in the second
@@ -220,8 +254,12 @@ example only the one source and one dest may be marked as scalar or
  vector.
  
  Thus we can see that Vector Indexed may be covered, and, as demonstrated
-with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the Power v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
+with the pseudocode below, the immediate can be used to give unit
+stride or element stride.  With there being no way to tell which from
+the Power v3.0B Scalar opcode alone, the choice is provided instead by
+the SV Context.
  
+```
      # LD not VLD!  format - ldop RT, immed(RA)
      # op_width: lb=1, lh=2, lw=4, ld=8
      op_load(RT, RA, op_width, immed, svctx, RAupdate):
@@ -232,7 +270,11 @@ with the pseudocode below, the immediate can be used to give unit stride or elem
          if (RA.isvec) while (!(ps & 1<<i)) i++;
          if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
          if (RT.isvec) while (!(pd & 1<<j)) j++;
-        if svctx.ldstmode == elementstride:
+        if postinc:
+            offs = 0; # added afterwards
+            if RA.isvec: srcbase = ireg[RA+i]
+            else         srcbase = ireg[RA]
+        elif svctx.ldstmode == elementstride:
            # element stride mode
            srcbase = ireg[RA]
            offs = i * immed              # j*immed for a ST
@@ -252,18 +294,22 @@ with the pseudocode below, the immediate can be used to give unit stride or elem
  
          # compute EA
          EA = srcbase + offs
-        # update RA?
-        if RAupdate: ireg[RAupdate+u] = EA;
          # load from memory
          ireg[RT+j] <= MEM[EA];
+        # check post-increment of EA
+        if postinc: EA = srcbase + immed;
+        # update RA?
+        if RAupdate: ireg[RAupdate+u] = EA;
          if (!RT.isvec)
              break # destination scalar, end now
          if (RA.isvec) i++;
          if (RAupdate.isvec) u++;
          if (RT.isvec) j++;
+```
  
  Indexed LD is:
- 
+
+```
      # format: ldop RT, RA, RB
      function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
        ps = get_pred_val(FALSE, RA); # predication on src
@@ -286,18 +332,27 @@ Indexed LD is:
          if (RAupdate.isvec) u++;
          if (RB.isvec) k++;
          if (RT.isvec) j++;
+```
+
+Note that Element-Strided uses the Destination Step because with both
+sources being Scalar as a prerequisite condition of activation of
+Element-Stride Mode, the source step (being Scalar) would never advance.
  
-Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
+Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
+mode (`ldux`) to be effectively a *completely different* register from
+RA-as-a-source.  This because there is room in svp64 to extend RA-as-src
+as well as RA-as-dest, both independently as scalar or vector *and*
+independently extending their range.
  
-*Programmer's note: being able to set RA-as-a-source
- as separate from RA-as-a-destination as Scalar is **extremely valuable**
- once it is remembered that Simple-V element operations must
- be in Program Order, especially in loops, for saving on
- multiple address computations. Care does have
- to be taken however that RA-as-src is not overwritten by
- RA-as-dest especially in element-strided Mode.*
+*Programmer's note: being able to set RA-as-a-source as separate from
+RA-as-a-destination as Scalar is **extremely valuable** once it is
+remembered that Simple-V element operations must be in Program Order,
+especially in loops, for saving on multiple address computations. Care
+does have to be taken however that RA-as-src is not overwritten by
+RA-as-dest unless intentionally desired, especially in element-strided
+Mode.*
  
-# LD/ST Indexed vs Indexed REMAP
+## LD/ST Indexed vs Indexed REMAP
  
  Unfortunately the word "Indexed" is used twice in completely different
  contexts, potentially causing confusion.
@@ -309,23 +364,23 @@ contexts, potentially causing confusion.
    Mode that can be applied to *any* instruction **including those
    named LD/ST Indexed**.
  
-Whilst it may be costly in terms of register reads to allow REMAP
-Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
-`sv.ld *RT,RA,*RB`, or even misleadingly
-labelled  as redundant, firstly the strict
-application of the RISC Paradigm that Simple-V follows makes it awkward
-to consider *preventing* the application of Indexed REMAP to such
-operations, and secondly they are not actually the same at all.
+Whilst it may be costly in terms of register reads to allow REMAP Indexed
+Mode to be applied to any Vectorized LD/ST Indexed operation such as
+`sv.ld *RT,RA,*RB`, or even misleadingly labelled  as redundant, firstly
+the strict application of the RISC Paradigm that Simple-V follows makes
+it awkward to consider *preventing* the application of Indexed REMAP to
+such operations, and secondly they are not actually the same at all.
  
  Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
  effectively performs an *in-place* re-ordering of the offsets, RB.
  To achieve the same effect without Indexed REMAP would require taking
  a *copy* of the Vector of offsets starting at RB, manually explicitly
-reordering them, and finally using the copy of re-ordered offsets in
-a non-REMAP'ed `sv.ld`.  Using non-strided LD as an example,
-pseudocode showing what actually occurs,
-where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
+reordering them, and finally using the copy of re-ordered offsets in a
+non-REMAP'ed `sv.ld`.  Using non-strided LD as an example, pseudocode
+showing what actually occurs, where the pseudocode for `indexed_remap`
+may be found in [[sv/remap]]:
  
+```
      # sv.ld *RT,RA,*RB with Index REMAP applied to RB
      for i in 0..VL-1:
          if remap.indexed:
@@ -334,73 +389,188 @@ where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
              rb_idx = i # use the index as-is
          EA = GPR(RA) + GPR(RB+rb_idx)
          GPR(RT+i) = MEM(EA, 8)
+```
  
  Thus it can be seen that the use of Indexed REMAP saves copying
  and manual reordering of the Vector of RB offsets.
  
-# LD/ST ffirst
+## LD/ST ffirst (Fault-First)
  
  LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
-is not active) as an
-ordinary one, with all behaviour with respect to Interrupts Exceptions
-Page Faults Memory Management being identical in every regard to Scalar
-v3.0 Power ISA LD/ST. However for elements 1
-and above, if an exception would occur, then VL is **truncated** to the
-previous element: the exception is **not** then raised because the
-LD/ST that would otherwise have caused an exception is *required* to be cancelled. Additionally an implementor may choose to truncate VL for
-any arbitrary reason *except for the very first*.
-
-ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting speculative feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
-
+is not active and predication is not applied)
+as an ordinary one, with all behaviour with respect to
+Interrupts Exceptions Page Faults Memory Management being identical
+in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
+1 and above, if an exception would occur, then VL is **truncated**
+to the previous element: the exception is **not** then raised because
+the LD/ST that would otherwise have caused an exception is *required*
+to be cancelled. Additionally an implementor may choose to truncate VL
+for any arbitrary reason *except for the very first*.
+
+ffirst LD/ST to multiple pages via a Vectorized Index base is
+considered a security risk due to the abuse of probing multiple
+pages in rapid succession and getting speculative feedback on which
+pages would fail.  Therefore Vector Indexed LD/ST is prohibited
+entirely, and the Mode bit instead used for element-strided LD/ST.
+<!-- hide -->
+See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
+<!-- show -->
+
+```
      for(i = 0; i < VL; i++)
          reg[rt + i] = mem[reg[ra] + i * reg[rb]];
-
-High security implementations where any kind of speculative probing
-of memory pages is considered a risk should take advantage of the fact that
-implementations may truncate VL at any point, without requiring software
-to be rewritten and made non-portable. Such implementations may choose
-to *always* set VL=1 which will have the effect of terminating any
-speculative probing (and also adversely affect performance), but will
-at least not require applications to be rewritten.
-
-Low-performance simpler hardware implementations may also
-choose (always) to also set VL=1 as the bare minimum compliant implementation of
-LD/ST Fail-First. It is however critically important to remember that
-the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
-**MUST** raise exceptions exactly like an ordinary LD/ST.
-
-For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
-such as the beginning of a cache line, or beginning of a Virtual Memory
-page. Likewise, to reduce workloads or balance resources.
-
-Vertical-First Mode is slightly strange in that only one element
-at a time is ever executed anyway.  Given that programmers may
-legitimately choose to alter srcstep and dststep in non-sequential
-order as part of explicit loops, it is neither possible nor
-safe to make speculative assumptions about future LD/STs.
-Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
-This is very different from Arithmetic (Data-dependent) FFirst
-where Vertical-First Mode is fully deterministic, not speculative.
-
-# LD/ST Pack/Unpack Mode
-
-As described in [[sv/normal]], 
-Structured Pack/Unpack is similar to VSX `vpack` and `vunpack` except
-generalised not only to a Schedule to be applied to any operation but
-also extended to vec2/3/4.
-
-Just as in [[sv/normal]] operations,
-setting this mode changes the meaning of bits 4-5 in `RM` from being
-`ELWIDTH` to a pair of Pack/Unpack bits.  Thus it is not possible
-to separately override source and destination elwidths at the same
-time as use Pack/Unpack: the `SRC_ELWIDTH` bits (6-7) must be used as
-both source and destination elwidth.
-
-Pack/Unpack only applies to LD/ST-immediate operations.  
-See [[sv/svp64/appendix]] for details on how Pack/Unpack
-is implemented.
-
-# LOAD/STORE Elwidths <a name="elwidth"></a>
+```
+
+High security implementations where any kind of speculative probing of
+memory pages is considered a risk should take advantage of the fact
+that implementations may truncate VL at any point, without requiring
+software to be rewritten and made non-portable. Such implementations may
+choose to *always* set VL=1 which will have the effect of terminating
+any speculative probing (and also adversely affect performance), but
+will at least not require applications to be rewritten.
+
+Low-performance simpler hardware implementations may also choose (always)
+to also set VL=1 as the bare minimum compliant implementation of LD/ST
+Fail-First. It is however critically important to remember that the first
+element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.  **MUST**
+raise exceptions exactly like an ordinary LD/ST.
+
+For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
+for any implementation-specific reason. For example: it is perfectly
+reasonable for implementations to alter VL when ffirst LD or ST operations
+are initiated on a nonaligned boundary, such that within a loop the
+subsequent iteration of that loop begins the following ffirst LD/ST
+operations on an aligned boundary such as the beginning of a cache line,
+or beginning of a Virtual Memory page. Likewise, to reduce workloads or
+balance resources.
+
+When Predication is used, the "first" element is considered to be the first
+non-predicated element rather than specifically `srcstep=0`.
+
+Vertical-First Mode is slightly strange in that only one element at a time
+is ever executed anyway.  Given that programmers may legitimately choose
+to alter srcstep and dststep in non-sequential order as part of explicit
+loops, it is neither possible nor safe to make speculative assumptions
+about future LD/STs.  Therefore, Fail-First LD/ST in Vertical-First is
+`UNDEFINED`.  This is very different from Arithmetic (Data-dependent)
+FFirst where Vertical-First Mode is fully deterministic, not speculative.
+
+## Data-Dependent Fail-First (not Fail/Fault-First)
+
+Not to be confused with Fail/Fault First, Data-Fail-First performs an
+additional check on the data, and if the test
+fails then VL is truncated and further looping terminates.
+This is precisely the same as Arithmetic Data-Dependent Fail-First,
+the only difference being that the result comes from the LD/ST
+rather than from an Arithmetic operation.
+
+Also a crucial difference between Arithmetic and LD/ST Data-Dependent Fail-First:
+except for Store-Conditional a 4-bit Condition Register Field test is created
+for testing purposes
+*but not stored* (thus there is no RC1 Mode as there is in Arithmetic).
+The reason why a CR Field is not stored is because Load/Store, particularly
+the Update instructions, is already expensive in register terms,
+and adding an extra Vector write would be too costly in hardware.
+
+*Programmer's note: Programmers
+may use Data-Dependent Load with a test to truncate VL, and may then
+follow up with a `sv.cmpi` or other operation. The important aspect is
+that the Vector Load truncated on finding a NULL pointer, for example.*
+
+*Programmer's note: Load-with-Update may be used to update
+the register used in Effective Address computation of th
+next element.  This may be used to perform single-linked-list
+walking, where Data-Dependent Fail-First terminates and
+truncates the Vector at the first NULL.*
+
+**Load/Store Data-Dependent Fail-First, VLi=0**
+
+In the case of Store operations there is a quirk when VLi (VL inclusive
+is "Valid") is clear. Bear in mind the criteria is that the truncated
+Vector of results, when VLi is clear, must all pass the "test", but when
+VLi is set the *current failed test* is permitted to be included.  Thus,
+the actual update (store) to Memory is **not permitted to take place**
+should the test fail.
+
+Additionally in any Load/Store with Update instruction,
+when VLi=0 and a test fails then RA does **not** receive a
+copy of the Effective Address.  Hardware implementations with Out-of-Order
+Micro-Architectures should use speculative Shadow-Hold and Cancellation
+(or other Transactional Rollback mechanism) when the test fails.
+
+* **Load, VLi=0**: perform the Memory Load, do not put the result into the regfile yet (or EA into RA). Test the Loaded data: if fail do not store the Load in the register file (or EA into RA). Otherwise proceed with updating regfiles. VL is truncated to "only elements that passed the test"
+* **Store, VLi=0**: even before the Store takes place, perform the test on the data to *be* stored.  If fail do not proceed with the Store at all. VL is truncated to "only elements that passed the test"
+
+**Load/Store Data-Dependent Fail-First, VLi=1**
+
+By contrast if VLi=1 and the test fails, the Store may proceed *and then*
+looping terminates.  In this way, when Inclusive the Vector of Truncated results
+contains the first-failed data (including RA on Updates)
+
+* **Load, VLi=1**: perform the Memory Load, complete it in full (including EA into RA). Test the Loaded data: if fail then VL is truncated to "elements tested".
+* **Store, VLi=0**: same as Load. Perform the Store in full and after-the-fact carry out the test of the original data requested to be stored. If fail then VL is truncated to "elements tested".
+
+Below is an example of loading the starting addresses of Linked-List
+nodes.  If VLi=1 it will load the NULL pointer into the Vector of results.
+If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
+one Element earlier (only loading non-NULL data into registers).
+
+*Programmer's Note: by also setting the RC1 qualifier as well as setting
+VLi=1 it is possible to establish a Predicate Mask such that the first
+zero in the predicate will be the NULL pointer*
+
+```
+   RT=1 # vec - deliberately overlaps by one with RA
+   RA=0 # vec - first one is valid, contains ptr
+   imm = 8 # offset_of(ptr->next)
+   for i in range(VL):
+       # this part is the Scalar Defined Word-instruction (standard scalar ld operation)
+       EA = GPR(RA+i) + imm          # ptr + offset(next)
+       data = MEM(EA, 8)             # 64-bit address of ptr->next
+       # was a normal vector-ld up to this point. now the Data-Fail-First
+       cr_test = conditions(data)
+       if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
+       action_load = True
+       if cr_test.EQ == testbit:             # check if zero
+           if VLI then
+              VL = i+1            # update VL, inclusive
+           else
+              VL = i              # update VL, exclusive current
+              action_load = False # current load excluded
+           stop = True            # stop looping
+       if action_load:
+          GPR(RT+i) = data        # happens to be read on next loop!
+       if stop: break
+```
+
+**Data-Dependent Fail-First on Store-Conditional (Rc=1)**
+
+There are very few instructions that allow Rc=1 for Load/Store:
+one of those is the `stdcx.` and other Atomic Store-Conditional
+instructions.  With Simple-V being a loop around Scalar instructions
+strictly obeying Scalar Program Order a Horizontal-First Fail-First loop
+on an Atomic Store-Conditional will always fail the second and all other
+Store-Conditional instructions because
+Load-Reservation and Store-Conditional are required to be executed
+in pairs.
+
+By contrast, in Vertical-First Mode it is in fact possible to issue
+the pairs, and consequently allowing Vectorized Data-Dependent Fail-First is
+useful.
+
+Programmer's note: Care should be taken when VL is truncated in
+Vertical-First Mode.
+
+**Future potential**
+
+Although Rc=1 on LD/ST is a rare occurrence at present, future versions
+of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
+with the SVP64 Vectorization Prefixing being itself a RISC-paradigm that
+is itself fully-independent of the Scalar Suffix Defined Word-instructions, prohibiting
+the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
+operations is not strategically sound.
+
+## LOAD/STORE Elwidths <a name="elwidth"></a>
  
  Loads and Stores are almost unique in that the Power Scalar ISA
  provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
@@ -411,36 +581,30 @@ others like it provide an explicit operation width.  There are therefore
  * src element width override (8/16/32/default)
  * destination element width override (8/16/32/default)
  
-Some care is therefore needed to express and make clear the transformations, 
+Some care is therefore needed to express and make clear the transformations,
  which are expressly in this order:
  
  * Calculate the Effective Address from RA at full width
-  but (on Indexed Load) allow srcwidth overrides on RB 
+  but (on Indexed Load) allow srcwidth overrides on RB
  * Load at the operation width (lb/lh/lw/ld) as usual
  * byte-reversal as usual
-* Non-saturated mode:
-   - zero-extension or truncation from operation width to dest elwidth
-   - place result in destination at dest elwidth
-* Saturated mode:
-   - Sign-extension or truncation from operation width to dest width
-   - signed/unsigned saturation down to dest elwidth
+* zero-extension or truncation from operation width to dest elwidth
+* place result in destination at dest elwidth
  
  In order to respect Power v3.0B Scalar behaviour the memory side
  is treated effectively as completely separate and distinct from SV
  augmentation.  This is primarily down to quirks surrounding LE/BE and
  byte-reversal.
  
-It is rather unfortunately possible to request an elwidth override
-on the memory side which
-does not mesh with the overridden operation width: these result in
-`UNDEFINED`
-behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
-operation with a source elwidth override of 8/16/32 would result in
-overlapping memory requests, particularly on unit and element strided
-operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
-the memory operation width. Examples include `sv.lw/sw=16/els` which
-requests (overlapping) 4-byte memory reads offset from
-each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
+It is rather unfortunately possible to request an elwidth override on
+the memory side which does not mesh with the overridden operation width:
+these result in `UNDEFINED` behaviour.  The reason is that the effect
+of attempting a 64-bit `sv.ld` operation with a source elwidth override
+of 8/16/32 would result in overlapping memory requests, particularly
+on unit and element strided operations.  Thus it is `UNDEFINED` when
+the elwidth is smaller than the memory operation width. Examples include
+`sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
+from each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
  where the dest elwidth override is less than the operation width.
  
  Note the following regarding the pseudocode to follow:
@@ -456,14 +620,15 @@ Note the following regarding the pseudocode to follow:
  * `svctx` specifies the SV Context and includes VL as well as
    source and destination elwidth overrides.
  
-Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
-both Immediate and Indexed LD/ST,
-does not have element-width overriding applied to it.
+Below is the pseudocode for Unit-Strided LD (which includes Vector
+capability). Observe in particular that RA, as the base address in both
+Immediate and Indexed LD/ST, does not have element-width overriding
+applied to it.
  
-Note that predication, predication-zeroing,
-and other modes except saturation have all been removed,
-for clarity and simplicity:
+Note that predication, predication-zeroing, and other modes
+have all been removed, for clarity and simplicity:
  
+```
      # LD not VLD!
      # this covers unit stride mode and a type of vector offset
      function op_ld(RT, RA, op_width, imm_offs, svctx)
@@ -476,37 +641,34 @@ for clarity and simplicity:
              # unit / element stride mode, compute 64 bit address
              srcbase = get_polymorphed_reg(RA, 64, 0)
              # adjust for unit/el-stride
-            srcbase += ....
+            srcbase += .... uses op_width here
  
          # read the underlying memory
          memread <= MEM(srcbase + imm_offs, op_width)
  
-        # check saturation.
-        if svpctx.saturation_mode:
-            # ... saturation adjustment...
-            memread = clamp(memread, op_width, svctx.dest_elwidth)
-        else:
-            # truncate/extend to over-ridden dest width.
-            memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
+        # truncate/extend to over-ridden dest width.
+        memread = adjust_wid(memread, op_width, svctx.elwidth)
  
          # takes care of inserting memory-read (now correctly byteswapped)
          # into regfile underlying LE-defined order, into the right place
-        # within the NEON-like register, respecting destination element
-        # bitwidth, and the element index (j)
-        set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
+        # using Element-Packing starting at register RT, respecting destination
+        # element bitwidth, and the element index (j)
+        set_polymorphed_reg(RT, svctx.elwidth, j, memread)
  
          # increments both src and dest element indices (no predication here)
          i++;
          j++;
+```
  
-Note above that the source elwidth is *not used at all* in LD-immediate.
-*(For Pack/Unpack Mode which shares the same source elwidth bits this
-is no great loss)*.
+Note above that the source elwidth is *not used at all* in LD-immediate: RA
+never has elwidth overrides, leaving the elwidth free for truncation/extension
+of the result.
  
  For LD/Indexed, the key is that in the calculation of the Effective Address,
  RA has no elwidth override but RB does.  Pseudocode below is simplified
-for clarity: predication and all modes except saturation are removed:
+for clarity: predication and all modes are removed:
  
+```
      # LD not VLD! ld*rx if brev else ld*
      function op_ld(RT, RA, RB, op_width, svctx, brev)
        for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
@@ -531,54 +693,68 @@ for clarity: predication and all modes except saturation are removed:
          if (bytereverse):
              memread = byteswap(memread, op_width)
  
-        if svpctx.saturation_mode:
-            # ... saturation adjustment...
-            memread = clamp(memread, op_width, svctx.dest_elwidth)
-        else:
-            # truncate/extend to over-ridden dest width.
-            memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
+        # truncate/extend to over-ridden dest width.
+        dest_width = op_width if RT.isvec else 64
+        memread = adjust_wid(memread, op_width, dest_width)
  
          # takes care of inserting memory-read (now correctly byteswapped)
          # into regfile underlying LE-defined order, into the right place
          # within the NEON-like register, respecting destination element
          # bitwidth, and the element index (j)
-        set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
+        set_polymorphed_reg(RT, destwidth, j, memread)
  
          # increments both src and dest element indices (no predication here)
          i++;
          j++;
-
-# Remapped LD/ST
-
-In the [[sv/remap]] page the concept of "Remapping" is described.
-Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
-a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
-elements worth of LDs or STs.  The usual interest in such re-mapping
-is for example in separating out 24-bit RGB channel data into separate
-contiguous registers.  NEON covers this as shown in the diagram below:
-
-<img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
-
-Remap easily covers this capability, and with dest
-elwidth overrides and saturation may do so with built-in conversion that
-would normally require additional width-extension, sign-extension and
-min/max Vectorised instructions as post-processing stages.
+```
+
+*Programmer's note: with no destination elwidth override the destination
+width must be implicitly ascertained.  The assumption is that if the destination
+is a Scalar that the entire 64-bit register must be written, thus the width is
+extended to 64-bit.  If however the destination is a Vector then it is deemed
+appropriate to use the LD/ST width and to perform contiguous register element
+packing at that width.  The justification for doing so is that if further
+sign-extension or saturation is required after a LD, these may be performed by a
+follow-up instruction that uses a source elwidth override matching the exact width
+of the LD operation.  Correspondingly for a ST a destination elwidth override
+on a prior instruction may match the exact width of the ST instruction.*
+
+## Remapped LD/ST
+
+In the [[sv/remap]] page the concept of "Remapping" is described.  Whilst
+it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
+arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
+of LDs or STs.  The usual interest in such re-mapping is for example in
+separating out 24-bit RGB channel data into separate contiguous registers.
+NEON covers this as shown in the diagram below:
+
+![Load/Store remap](/openpower/sv/load-store.svg)
+
+REMAP easily covers this capability, and with dest elwidth overrides
+and saturation may do so with built-in conversion that would normally
+require additional width-extension, sign-extension and min/max Vectorized
+instructions as post-processing stages.
  
  Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
  because the generic abstracted concept of "Remapping", when applied to
  LD/ST, will give that same capability, with far more flexibility.
  
-Also LD/ST with immediate has a Pack/Unpack option similar to VSX
-'vpack' and 'vunpack', as well as the VSX Pixel instructions. Enabling
-this mode on SubVectors is straightforward and does not involve
-the setup cost of REMAP. Unlike REMAP, Pack/Unpack on LD/ST does not have
-Saturation (or Fail-first) at the same time.
+It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
+established through `svstep`, are also an easy way to perform regular
+Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond that,
+REMAP will need to be used.
  
-*Programmer's note: a decision on what is best if combining Saturation
-with Pack/Unpack is required will depend on resources.  REMAP will
-require less registers but is more costly to set up. On the other
-hand LDST Pack/Unpack followed by Saturated MV or arithmetic requires
-intermediary registers at full width prior to reduced saturated width.
-A balanced decision is therefore needed*.
+**Parallel Reduction REMAP**
  
+No REMAP Schedule is prohibited in SVP64 because the RISC-paradigm Prefix
+is completely separate from the RISC-paradigm Scalar Defined Word-instructions.  Although
+obscure there does exist the outside possibility that a potential use for
+Parallel Reduction Schedules on LD/ST would find a use in Computer Science.
+Readers are invited to contact the authors of this document if one is ever
+found.
+
+--------
+
+[[!tag standards]]
  
+\newpage{}