From 50e59482e5baf44d8793f056b316a3a66b19887b Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sun, 26 Mar 2023 12:38:30 +0100
Subject: [PATCH]

---
 openpower/sv/rfc/ls008.mdwn | 133 ++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/openpower/sv/rfc/ls008.mdwn b/openpower/sv/rfc/ls008.mdwn
index b0c93da8b..9ff016512 100644
--- a/openpower/sv/rfc/ls008.mdwn
+++ b/openpower/sv/rfc/ls008.mdwn
@@ -219,6 +219,139 @@ Additional pseudo-op for obtaining VL without modifying it (or any state):
     getvl  r5      : setvl  r5, r0, vf=0, vs=0, ms=0
     getvl. r5      : setvl. r5, r0, vf=0, vs=0, ms=0
 
+Note that whilst it is possible to set both MVL and VL from the same
+immediate, it is not possible to set them to different immediates in
+the same instruction.  Doing so would require two instructions.
+
+**Selecting sources for VL**
+
+There is considerable opcode pressure, consequently to set MVL and VL
+from different sources is as follows:
+
+| condition           | effect         |
+| - | - |
+| `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR)  |
+| `vs=1, RA=0, RT=0`  | VL set to MIN(MVL, SVi+1)  |
+| `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA)  |
+| `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA)  |
+
+The reasoning here is that the opportunity to set RT equal to the
+immediate `SVi+1` is sacrificed in favour of setting from CTR.
+
+# Unusual Rc=1 behaviour
+
+Normally, the return result from an instruction is in `RT`. With
+it being possible for `RT=0` to mean that `CTR` mode is to be read,
+some different semantics are needed.
+
+CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that
+overflow may occur: `VL`, if set either from an immediate or from `CTR`,
+may not exceed `MAXVL`, and if it is, `CR0.SO` must be set.
+
+Additionally, in reality it is **`VL`** being set. Therefore, rather
+than `CR0` testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE
+is set if `VL` is non-zero.
+
+# Vertical First Mode
+
+Vertical First is effectively like an implicit single bit predicate
+applied to every SVP64 instruction.  **ONLY** one element in each
+SVP64 Vector instruction is executed; srcstep and dststep do **not**
+increment, and the Program Counter progresses **immediately** to
+the next instruction just as it would for any standard scalar v3.0B
+instruction.
+
+An explicit mode of setvl is called which can move srcstep and
+dststep on to the next element, still respecting predicate
+masks.  
+
+In other words, where normal SVP64 Vectorisation acts "horizontally"
+by looping first through 0 to VL-1 and only then moving the PC
+to the next instruction, Vertical-First moves the PC onwards
+(vertically) through multiple instructions **with the same
+srcstep and dststep**, then an explict instruction used to
+advance srcstep/dststep. An outer loop is expected to be
+used (branch instruction) which completes a series of
+Vector operations.
+
+```svfstep``` mode is enabled when vf=1, vs=0 and ms=0. 
+When Rc=1 it is possible to determine when any level of
+loops reach an end condition, or if VL has been reached. The immediate can
+be reinterpreted as indicating which SVSTATE (0-3)
+should be tested and placed into CR0 (when Rc=1)
+
+When RT is not zero, an internal stepping index may also be returned,
+either the REMAP index or srcstep or dststep. This table is identical
+to that of [[sv/svstep]]:
+
+* `SVi=1`: also include inner middle and outer
+  loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
+* `SVi=2`: test SVSTATE1 (and return conditions)
+* `SVi=3`: test SVSTATE2 (and return conditions)
+* `SVi=4`: test SVSTATE3 (and return conditions)
+* `SVi=5`: `SVSTATE.srcstep` is returned.
+* `SVi=6`: `SVSTATE.dststep` is returned.
+
+Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
+
+*Programmers should be aware that VL, srcstep and dststep are global in nature.
+Nested looping with different schedules is perfectly possible, as is
+calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
+
+**SUBVL**
+
+Sub-vector elements are not be considered "Vertical". The vec2/3/4
+is to be considered as if the "single element".  Caveats exist for
+[[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled,
+due to the order in which VL and SUBVL loops are applied being
+swapped (outer-inner becomes inner-outer)
+
+# Examples
+
+## Core concept loop
+
+```
+loop:
+    setvl a3, a0, MVL=8    #  update a3 with vl
+                           # (# of elements this iteration)
+                           # set MVL to 8
+    # do vector operations at up to 8 length (MVL=8)
+    # ...
+    sub a0, a0, a3   # Decrement count by vl
+    bnez a0, loop    # Any more?
+```
+
+## Loop using Rc=1
+
+    my_fn:
+      li r3, 1000
+      b test
+    loop:
+      sub r3, r3, r4
+      ...
+    test:
+      setvli. r4, r3, MVL=64
+      bne cr0, loop
+    end:
+      blr
+
+## Load/Store-Multi (selective)
+
+Up to 64 FPRs will be loaded, here.  `r3` is set one per bit
+for each FP register required to be loaded.  The block of memory
+from which the registers are loaded is contiguous (no gaps):
+any FP register which has a corresponding zero bit in `r3`
+is *unaltered*.  In essence this is a selective LD-multi with
+"Scatter" capability.
+
+    setvli r0, MVL=64, VL=64
+    sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers
+
+Up to 64 FPRs will be saved, here.  Again, `r3` 
+
+    setvli r0, MVL=64, VL=64
+    sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers
+
 -------------
 
 \newpage{}
-- 
2.30.2