(no commit message)

[libreriscv.git] / openpower / sv / ldst.mdwn
diff --git a/openpower/sv/ldst.mdwn b/openpower/sv/ldst.mdwn

index fd3fbed4193d0ef849edd12630dc8e08df35372f..8fe06719dedd990033c54e22a62ca40a7129ca65 100644 (file)
--- a/openpower/sv/ldst.mdwn
+++ b/openpower/sv/ldst.mdwn
@@ -9,6 +9,31 @@ Links:
  * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
+* [[simple_v_extension/specification/ld.x]]
+
+# Rationale
+
+All Vector ISAs dating back fifty years have extensive and comprehensive
+Load and Store operations that go far beyond the capabilities of Scalar
+RISC or CISC processors, yet at their heart on an individual element
+basis may be found to be no different from RISC Scalar equivalents.
+
+The resource savings from Vector LD/ST are significant and stem from
+the fact that one single instruction can trigger a dozen (or in some
+microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
+
+Additionally, and simply: if the Arithmetic side of an ISA supports
+Vector Operations, then in order to keep the ALUs 100% occupied the
+Memory infrastructure (and the ISA itself) correspondingly needs Vector
+Memory Operations as well.
+
+Vectorised Load and Store also presents an extra dimension (literally)
+which creates scenarios unique to Vector applications, that a Scalar
+(and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
+add such modes without changing the behaviour of the underlying Base
+(Scalar) v3.0B operations.
+
+# Modes overview
  
  Vectorisation of Load and Store requires creation, from scalar operations,
  a number of different modes:
@@ -16,9 +41,14 @@ a number of different modes:
  * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  * element strided (sequential but regularly offset, with gaps)
  * vector indexed (vector of base addresses and vector of offsets)
-* fail-first on the same (where it makes sense to do so)
+* Speculative fail-first (where it makes sense to do so)
  * Structure Packing (covered in SV by [[sv/remap]]).
  
+Also included in SVP64 LD/ST is both signed and unsigned Saturation,
+as well as Element-width overrides and Twin-Predication.
+
+# Vectorisation of Scalar Power ISA v3.0B
+
  OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  [[isa/fixedstore]] pseudocode to be of the form:
  
@@ -50,13 +80,13 @@ with the pseudocode below, the immediate can be used to give unit stride or elem
          if (RA.isvec) while (!(ps & 1<<i)) i++;
          if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
          if (RT.isvec) while (!(pd & 1<<j)) j++;
-        if svctx.ldstmode == bitreversed: # for FFT/DCT
-          # FFT/DCT bitreversed mode
+        if svctx.ldstmode == shifted: # for FFT/DCT
+          # FFT/DCT shifted mode
            if (RA.isvec)
              srcbase = ireg[RA+i]
            else
              srcbase = ireg[RA]
-          offs = (bitrev(i, VL) * immed) << RC
+          offs = (i * immed) << RC
          elif svctx.ldstmode == elementstride:
            # element stride mode
            srcbase = ireg[RA]
@@ -88,7 +118,8 @@ with the pseudocode below, the immediate can be used to give unit stride or elem
          if (RT.isvec) j++;
  
      # reverses the bitorder up to "width" bits
-    def bitrev(val, width):
+    def bitrev(val, VL):
+      width = log2(VL)
        result = 0
        for _ in range(width):
          result = (result << 1) | (val & 1)
@@ -107,13 +138,17 @@ Indexed LD is:
          if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
          if (RB.isvec) while (!(ps & 1<<k)) k++;
          if (RT.isvec) while (!(pd & 1<<j)) j++;
-        EA = ireg[RA+i] + ireg[RB+k] # indexed address
+        if svctx.ldstmode == elementstride:
+            EA = ireg[RA] + ireg[RB]*j   # register-strided
+        else
+            EA = ireg[RA+i] + ireg[RB+k] # indexed address
          if RAupdate: ireg[RAupdate+u] = EA
          ireg[RT+j] <= MEM[EA];
          if (!RT.isvec)
              break # destination scalar, end immediately
-        if (!RA.isvec && !RB.isvec)
-            break # scalar-scalar
+        if svctx.ldstmode != elementstride:
+            if (!RA.isvec && !RB.isvec)
+                break # scalar-scalar
          if (RA.isvec) i++;
          if (RAupdate.isvec) u++;
          if (RB.isvec) k++;
@@ -126,7 +161,7 @@ Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux
  A minor complication (caused by the retro-fitting of modern Vector
  features to a Scalar ISA) is that certain features do not exactly make
  sense or are considered a security risk.  Fail-first on Vector Indexed
-allows attackers to probe large numbers of pages from userspace, where
+would allow attackers to probe large numbers of pages from userspace, where
  strided fail-first (by creating contiguous sequential LDs) does not.
  
  In addition, reduce mode makes no sense, and for LD/ST with immediates
@@ -136,17 +171,24 @@ an alternative table meaning for [[sv/svp64]] mode.  The following modes make se
  * saturation
  * predicate-result (mostly for cache-inhibited LD/ST)
  * normal
-* fail-first, where a vector source on RA or RB is banned
+* fail-first (where Vector Indexed is banned)
+* Signed Effective Address computation (Vector Indexed only)
+
+Also, given that FFT, DCT and other related algorithms
+are of such high importance in so many areas of Computer
+Science, a special "shift" mode has been added which
+allows part of the immediate to be used instead as RC, a register
+which shifts the immediate `DS << GPR(RC)`.
  
  The table for [[sv/svp64]] for `immed(RA)` is:
  
  | 0-1 |  2  |  3   4  |  description               |
  | --- | --- |---------|--------------------------- |
-| 00  | 0   |  dz els | normal mode                |
-| 00  | 1   |  dz rsv | bitreverse mode (FFT, DCT) |
+| 00  | 0   |  zz els | normal mode                |
+| 00  | 1   |  zz shf | shift mode                 |
  | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
  | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
-| 10  |   N | dz  els |  sat mode: N=0/1 u/s       |
+| 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
  | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
  | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
  
@@ -169,13 +211,20 @@ in reading from the exact same memory location.
  For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
  just the once and be copied, rather than hitting the Data Cache
  multiple times with the same memory read at the same location.
+This would allow for memory-mapped peripherals to have multiple
+data values read in quick succession and stored in sequentially
+numbered registers.
  
-For ST from a vector source onto a scalar destination: with the Vector
+For non-cache-inhibited ST from a vector source onto a scalar
+destination: with the Vector
  loop effectively creating multiple memory writes to the same location,
  we can deduce that the last of these will be the "successful" one. Thus,
  implementations are free and clear to optimise out the overwriting STs,
  leaving just the last one as the "winner".  Bear in mind that predicate
  masks will skip some elements (in source non-zeroing mode).
+Cache-inhibited ST operations on the other hand **MUST** write out
+a Vector source multiple successive times to the exact same Scalar
+destination.
  
  Note that there are no immediate versions of cache-inhibited LD/ST.
  
@@ -183,13 +232,16 @@ The modes for `RA+RB` indexed version are slightly different:
  
  | 0-1 |  2  |  3   4  |  description              |
  | --- | --- |---------|-------------------------- |
-| 00  |   0 |  dz  sz | normal mode                      |
-| 00  |   1 |  rsvd   | reserved                     |
-| 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
-| 01  | inv | dz  RC1 |  Rc=0: ffirst z/nonz |
+| 00  | SEA |  dz  sz | normal mode        |
+| 01  | SEA | dz sz  | Strided (scalar only source)   |
  | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
  | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
-| 11  | inv | dz  RC1 |  Rc=0: pred-result z/nonz |
+| 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
+
+Vector Indexed Strided Mode is qualified as follows:
+
+    if mode = 0b01 and !RA.isvec and !RB.isvec:
+        svctx.ldstmode = elementstride
  
  A summary of the effect of Vectorisation of src or dest:
   
@@ -197,16 +249,64 @@ A summary of the effect of Vectorisation of src or dest:
       imm(RA)  RT.s   RA.v   no stride allowed
       imm(RA)  RT.v   RA.s   stride-select allowed
       imm(RA)  RT.s   RA.s   not vectorised
-     RA,RB    RT.v  RA/RB.v ffirst banned
-     RA,RB    RT.s  RA/RB.v ffirst banned
-     RA,RB    RT.v  RA/RB.s VSPLAT possible
-     RA,RB    RT.s  RA/RB.s not vectorised
+     RA,RB    RT.v  {RA|RB}.v UNDEFINED
+     RA,RB    RT.s  {RA|RB}.v UNDEFINED
+     RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
+     RA,RB    RT.s  {RA&RB}.s not vectorised
+
+Signed Effective Address computation is only relevant for
+Vector Indexed Mode, when elwidth overrides are applied.
+The source override applies to RB, and before adding to
+RA in order to calculate the Effective Address, if SEA is
+set RB is sign-extended from elwidth bits to the full 64
+bits.  For other Modes (ffirst, saturate),
+all EA computation with elwidth overrides is unsigned.
  
  Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
  If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
  cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
  
-# LOAD/STORE Elwidths <a name="ldst"></a>
+## LD/ST ffirst
+
+LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
+ordinary one.  Exceptions occur "as normal".  However for elements 1
+and above, if an exception would occur, then VL is **truncated** to the
+previous element: the exception is **not** then raised because the
+LD/ST was effectively speculative.
+
+ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
+
+    for(i = 0; i < VL; i++)
+        reg[rt + i] = mem[reg[ra] + i * reg[rb]];
+
+High security implementations where any kind of speculative probing
+of memory pages is considered a risk should take advantage of the fact that
+implementations may truncate VL at any point, without requiring software
+to be rewritten and made non-portable. Such implementations may choose
+to *always* set VL=1 which will have the effect of terminating any
+speculative probing (and also adversely affect performance), but will
+at least not require applications to be rewritten.
+
+Low-performance simpler hardware implementations may
+choose (always) to also set VL=1 as the bare minimum compliant implementation of
+LD/ST Fail-First. It is however critically important to remember that
+the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
+**MUST** raise exceptions exactly like an ordinary LD/ST.
+
+For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary
+such as the beginning of a cache line, or beginning of a Virtual Memory
+page. Likewise, to reduce workloads or balance resources.
+
+Vertical-First Mode is slightly strange in that only one element
+at a time is ever executed anyway.  Given that programmers may
+legitimately choose to alter srcstep and dststep in non-sequential
+order as part of explicit loops, it is neither possible nor
+safe to make speculative assumptions about future LD/STs.
+Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
+This is very different from Arithmetic (Data-dependent) FFirst
+where Vertical-First Mode is deterministic, not speculative.
+
+# LOAD/STORE Elwidths <a name="elwidth"></a>
  
  Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
  provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
@@ -234,6 +334,17 @@ is treated effectively as completely separate and distinct from SV
  augmentation.  This is primarily down to quirks surrounding LE/BE and
  byte-reversal in OpenPOWER.
  
+It is unfortunately possible to request an elwidth override on the memory side which
+does not mesh with the operation width: these result in `UNDEFINED`
+behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
+operation with a source elwidth override of 8/16/32 would result in
+overlapping memory requests, particularly on unit and element strided
+operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
+the memory operation width. Examples include `sv.lw/sw=16/els` which
+requests (overlapping) 4-byte memory reads offset from
+each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
+where the dest elwidth override is less than the operation width.
+
  Note the following regarding the pseudocode to follow:
  
  * `scalar identity behaviour` SV Context parameter conditions turn this
@@ -277,7 +388,6 @@ and other modes have all been removed, for clarity and simplicity:
          if (bytereverse):
              memread = byteswap(memread, op_width)
  
-
          # check saturation.
          if svpctx.saturation_mode:
              ... saturation adjustment...