(no commit message)

[libreriscv.git] / openpower / sv / setvl.mdwn
diff --git a/openpower/sv/setvl.mdwn b/openpower/sv/setvl.mdwn

index 2158d8dabf170520502bd628c428e16177c04e95..630fa711564d59d5037d4fffe9fa6ef03ec4a7bc 100644 (file)
--- a/openpower/sv/setvl.mdwn
+++ b/openpower/sv/setvl.mdwn
@@ -1,129 +1,111 @@
-[[!tag standards]]
-
-# DRAFT setvl/setvli
+# setvl: Set Vector Length
  
+<!-- hide -->
  See links:
  
  * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=914> TODO: setvl should not set SO
  * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
+* <https://bugs.libre-soc.org/show_bug.cgi?id=927> bug - RT>=32
  * <https://bugs.libre-soc.org/show_bug.cgi?id=862> VF Predication
  * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
  * [[sv/svstep]]
  * pseudocode [[openpower/isa/simplev]]
+<!-- show -->
  
-Use of setvl results in changes to the SVSTATE SPR. see [[sv/sprs]]
-
-# Behaviour and Rationale
-
-SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
-just like RVV.  However unlike RVV, SV sits on top of the standard Scalar
-regfiles: there is no separate Vector register numbering.  Therefore, also
-unlike RVV, SV does not have hard-coded "Lanes": microarchitects
-may use *ordinary* in-order, out-of-order, or superscalar designs
-as the basis for SV. By contrast, the relevant parameter
-in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
-anywhere from 1 to tens of thousands of Lanes in supercomputers.
-
-SV is more like how MMX used to sit on top of the x86 FP regfile.
-Therefore when Vector operations are performed, the question has to
-be asked, "well, how much of the regfile do you want to allocate to
-this operation?" because if it is too small an amount performance may
-be affected, and if too large then other registers would overlap and
-cause data  corruption, or even if allocated correctly would require
-spill to memory.
-
-The answer effectively needs to be parameterised.  Hence: MAXVL (MVL)
-is set from an immediate, so that the compiler may decide, statically, a
-guaranteed resource allocation according to the needs of the application.
-
-While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
-optimization. It does not carry side-effects for the arch, though for
-a specific cpu it may affect hw unit usage.
-
-Other than being able to set MVL, SV's VL (Vector Length) works just like
-RVV's VL, with one minor twist.  RVV permits the `setvl` instruction to
-set VL to an arbitrary explicit value.  Within the limit of MVL, VL
-**MUST** be set to the requested value. Given that RVV only works on Vector Loops,
-this is fine and part of its value and design.  However, SV sits on top
-of the standard register files.  When MVL=VL=2, a Vector Add on `r3`
-will perform two Scalar Adds: one on `r3` and one on `r4`.
-
-Thus there is the opportunity to set VL to an explicit value (within the
-limits of MVL) with the reasonable expectation that if two operations
-are requested (by setting VL=2) then two operations are guaranteed.
-This avoids the need for a loop (with not-insignificant use of the
-regfiles for counters), simply two instructions:
-
-    setvli r0, MVL=64, VL=64
-    ld r0.v, 0(r30) # load exactly 64 registers from memory
+Add the following section to the Simple-V Chapter
  
-Page Faults etc. aside this is *guaranteed* 100% without fail to perform
-64 unit-strided LDs starting from the address pointed to by r30 and put
-the contents into r0 through r63.  Thus it becomes a "LOAD-MULTI". Twin
-Predication could even be used to only load relevant registers from
-the stack.  This *only works if VL is set to the requested value* rather
-than, as in RVV, allowing the hardware to set VL to an arbitrary value.
+## setvl
  
-Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
-In combination with SVP64 [[sv/branches]] this can save one instruction
-inside critical inner loops. Note: to avoid having an extra opcode
-bit in `setvl`,
-to select CTR is slightly convoluted.
+SVL-Form
  
-# Format
+| 0-5|6-10|11-15|16-22 | 23 24 25 | 26-30 |31|   FORM   |
+| -- | -- | --- | ---- |----------| ----- |--|----------|
+|PO  | RT | RA  | SVi  | ms vs vf | XO    |Rc| SVL-Form |
  
-*(Allocation of opcode TBD pending OPF ISA WG approval)*,
-using EXT22 temporarily and fitting into the
-[[sv/bitmanip]] space
+* setvl RT,RA,SVi,vf,vs,ms (Rc=0)
+* setvl. RT,RA,SVi,vf,vs,ms (Rc=1)
  
-Form: SVL-Form (see [[isatables/fields.text]])
+Pseudo-code:
  
-| 0.5|6.10|11.15|16..22| 23...25    | 26.30 |31|  name   |
-| -- | -- | --- | ---- |----------- | ----- |--| ------- |
-|OPCD| RT | RA  | SVi  |   ms vs vf | 11011 |Rc| setvl   |
-
-Instruction format:
+```
+    overflow <- 0b0    # sets CR.SO if set and if Rc=1
+    VLimm <- SVi + 1
+    # set or get MVL
+    if ms = 1 then MVL <- VLimm[0:6]
+    else           MVL <- SVSTATE[0:6]
+    # set or get VL
+    if vs = 0                then VL <- SVSTATE[7:13]
+    else if _RA != 0         then
+        if (RA) >u 0b1111111 then
+            VL <- 0b1111111
+            overflow <- 0b1
+        else                      VL <- (RA)[57:63]
+    else if _RT = 0          then VL <- VLimm[0:6]
+    else if CTR >u 0b1111111 then
+        VL <- 0b1111111
+        overflow <- 0b1
+    else                          VL <- CTR[57:63]
+    # limit VL to within MVL
+    if VL >u MVL then
+        overflow <- 0b1
+        VL <- MVL
+    SVSTATE[0:6] <- MVL
+    SVSTATE[7:13] <- VL
+    if _RT != 0 then
+       GPR(_RT) <- [0]*57 || VL
+    # MAXVL is a static "state-reset" opportunity so VF is only set then.
+    if ms = 1 then
+         SVSTATE[63] <- vf   # set Vertical-First mode
+         SVSTATE[62] <- 0b0  # clear persist bit
+```
  
-    setvl RT,RA,SVi,vf,vs,ms
-    setvl. RT,RA,SVi,vf,vs,ms
+Special Registers Altered:
  
-Note that the immediate (`SVi`) spans 7 bits (16 to 22)
+```
+    CR0                     (if Rc=1)
+    SVSTATE
+```
  
+* `SVi` - bits 16-22 - an immediate operand for setting MVL and/or VL
  * `ms` - bit 23 - allows for setting of MVL
  * `vs` - bit 24 - allows for setting of VL
  * `vf` - bit 25 - sets "Vertical First Mode".
  
-Note that in immediate setting mode VL and MVL start from **one**
-i.e. that an immediate value of zero will result in VL/MVL being set to 1.
-0b111111 results in VL/MVL being set to 64. This is because setting
-VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
-to 0 would result in all Vector operations becoming `nop`.  If this is
-truly desired (nop behaviour) then setting VL and MVL to zero is to be
-done via the [[SVSTATE SPR|sv/sprs]]
+Note that in immediate setting mode VL and MVL start from **one** but that
+this is compensated for in the assembly notation.  i.e. that an immediate
+value of 1 in assembler notation actually places the value 0b0000000 in
+the `SVi` field bits: on execution the `setvl` instruction adds one to
+the decoded `SVi` field bits, resulting in VL/MVL being set to 1. In future
+this will allow VL to be set to values ranging from 1 to 128 with only 7 bits
+instead of 8.  Setting VL/MVL to 0 would result in all Vector operations
+becoming `nop`.  If this is truly desired (nop behaviour) then setting
+VL and MVL to zero is to be done via the [[SVSTATE SPR|sv/sprs]].
  
  Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
  
-    setvli VL=8    : setvl r5, r0, VL=8
-    setmvli MVL=8  : setvl r0, r0, MVL=8
+```
+    setvli   VL=8   : setvl  r0, r0, VL=8, vf=0, vs=1, ms=0
+    setvli.  VL=8   : setvl. r0, r0, VL=8, vf=0, vs=1, ms=0
+    setmvli  MVL=8  : setvl  r0, r0, MVL=8, vf=0, vs=0, ms=1
+    setmvli. MVL=8  : setvl. r0, r0, MVL=8, vf=0, vs=0, ms=1
+```
  
  Additional pseudo-op for obtaining VL without modifying it (or any state):
  
-    getvl r5       : setvl r5, r0, vf=0, vs=0, ms=0
-
-For Vertical-First mode, a pseudo-op for explicit incrementing
-of srcstep and dststep:
-
-    svfstep.        : setvl. 0, 0, vf=1, vs=0, ms=0
-
-This pseudocode op is different from [[sv/svstep]] which is used to
-perform detailed enquiries about internal state.
+```
+    getvl  r5      : setvl  r5, r0, vf=0, vs=0, ms=0
+    getvl. r5      : setvl. r5, r0, vf=0, vs=0, ms=0
+```
  
  Note that whilst it is possible to set both MVL and VL from the same
  immediate, it is not possible to set them to different immediates in
  the same instruction.  Doing so would require two instructions.
  
+Use of setvl results in changes to the SVSTATE SPR. see [[sv/sprs]]
+
  **Selecting sources for VL**
  
  There is considerable opcode pressure, consequently to set MVL and VL
@@ -139,156 +121,56 @@ from different sources is as follows:
  The reasoning here is that the opportunity to set RT equal to the
  immediate `SVi+1` is sacrificed in favour of setting from CTR.
  
-# Vertical First Mode
-
-Vertical First is effectively like an implicit single bit predicate
-applied to every SVP64 instruction.  **ONLY** one element in each
-SVP64 Vector instruction is executed; srcstep and dststep do **not**
-increment, and the Program Counter progresses **immediately** to
-the next instruction just as it would for any standard scalar v3.0B
-instruction.
-
-An explicit mode of setvl is called which can move srcstep and
-dststep on to the next element, still respecting predicate
-masks.  
-
-In other words, where normal SVP64 Vectorisation acts "horizontally"
-by looping first through 0 to VL-1 and only then moving the PC
-to the next instruction, Vertical-First moves the PC onwards
-(vertically) through multiple instructions **with the same
-srcstep and dststep**, then an explict instruction used to
-advance srcstep/dststep. An outer loop is expected to be
-used (branch instruction) which completes a series of
-Vector operations.
-
-```svfstep``` mode is enabled when vf=1, vs=0 and ms=0. 
-When Rc=1 it is possible to determine when any level of
-loops reach an end condition, or if VL has been reached. The immediate can
-be reinterpreted as indicating which SVSTATE (0-3)
-should be tested and placed into CR0.
-
-* setvl immediate = 1: only VL testing is enabled. CR0.SO is set
-  to 1 when either srcstep or dststep reach VL
-* setvl immediate = 2: also include inner middle and outer
-  loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
-* setvl immediate = 3: test SVSTATE1
-* setvl immediate = 4: test SVSTATE2
-* setvl immediate = 5: test SVSTATE3
-
-Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
-
-*Programmers should be aware that VL, srcstep and dststep are global in nature.
-Nested looping with different schedules is perfectly possible, as is
-calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
+**Unusual Rc=1 behaviour**
+
+Normally, the return result from an instruction is in `RT`. With it
+being possible for `RT=0` to mean that `CTR` mode is to be read, some
+different semantics are needed.
+
+CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that
+overflow may occur: `VL`, if set either from an immediate or from `CTR`,
+may not exceed `MAXVL`, and if it is, `CR0.SO` must be set.
+
+In reality it is **`VL`** being set. Therefore, rather than `CR0`
+testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE is set if `VL`
+is non-zero.
  
  **SUBVL**
  
  Sub-vector elements are not be considered "Vertical". The vec2/3/4
  is to be considered as if the "single element".  Caveats exist for
-[[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled.
-
-# Pseudocode
-
-    // instruction fields:
-    rd = get_rt_field();         // bits 6..10
-    ra = get_ra_field();         // bits 11..15
-    vf = get_vf_field();         // bit 23
-    vs = get_vs_field();         // bit 24
-    ms = get_ms_field();         // bit 25
-    Rc = get_Rc_field();         // bit 31
-
-    if vf and not vs and not ms {
-        // increment src/dest step mode
-        // NOTE! this is in no way complete! predication is not included
-        // and neither is SUBVL mode
-        srcstep = SPR[SV].srcstep
-        dststep = SPR[SV].dststep
-        VL = SPR[SV].VL
-        srcstep++
-        dststep++
-        rollover = (srcstep == VL or dststep == VL)
-        if rollover:
-            // Reset srcstep, dststep, and also exit "Vertical First" mode
-            srcstep = 0
-            dststep = 0
-            MSR[6] = 0
-        SPR[SV].srcstep = srcstep
-        SPR[SV].dststep = dststep
-
-        // write CR? helps for doing Vertical loops, detects end
-        // of Vector Elements
-        if Rc = 1 {
-            // update CR to indicate that srcstep/dststep "rolled over"
-            CR0.eq = rollover
-        }
-    } else {
-        // add one. MVL/VL=1..64 not 0..63
-        vlimmed = get_immed_field()+1; //  16..22
-
-        // set VL (or not).
-        // 4 options: from SPR, from immed, from ra, from CTR
-        if vs {
-           // VL to be sourced from fields/regs
-           if ra != 0 {
-               VL = GPR[ra]
-           } else {
-               VL = vlimmed
-           }
-        } else {
-           // VL not to change (except if MVL is reduced)
-           // read from SPRs
-           VL = SPR[SV_VL]
-        }
-
-        // set MVL (or not).
-        // 2 options: from SPR, from immed
-        if ms {
-           MVL = vlimmed
-        } else {
-           // MVL not to change, read from SPRs
-           MVL = SPR[SV_MVL]
-        }
-
-        // calculate (limit) VL
-        VL = min(VL, MVL)
-
-        // store VL, MVL
-        SVSTATE.VL = VL
-        SVSTATE.MVL = MVL
-
-        // write rd
-        if rt != 0 {
-            // rt is not zero
-            regs[rt] = VL;
-        }
-        // write CR?
-        if Rc = 1 {
-            // update CR from VL (not rt)
-            CR0.eq = (VL == 0)
-            ...
-            ...
-        }
-        // write Vertical-First mode
-        SVSTATE.vf = vf
-    }
-
-# Examples
-
-## Core concept loop
+[[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled, due
+to the order in which VL and SUBVL loops are applied being swapped
+(outer-inner becomes inner-outer)
+
+## Examples
+
+### Core concept loop
+
+This example illustrates the Cray-style Loop concept. However where most Cray
+Vectors have a Max Vector Length hard-coded into the architecture, Simple-V
+allows MVL to be set, but only as a static immediate, so that compilers may
+embed the register resource allocation statically at compile-time.
  
  ```
  loop:
      setvl a3, a0, MVL=8    #  update a3 with vl
                             # (# of elements this iteration)
-                           # set MVL to 8
+                           # set MVL to 8 and
+                           # set a3=VL=MIN(a0,MVL)
      # do vector operations at up to 8 length (MVL=8)
      # ...
-    sub a0, a0, a3   # Decrement count by vl
+    sub. a0, a0, a3   # Decrement count by vl, set CR0.eq
      bnez a0, loop    # Any more?
  ```
  
-## Loop using Rc=1
+### Loop using Rc=1
+
+In this example, the `setvl.` instruction enabled Rc=1, which
+sets CR0.eq when VL becomes zero. Testing of `r4` (cmpi) is thus redundant
+saving one instruction.
  
+```
      my_fn:
        li r3, 1000
        b test
@@ -300,4 +182,32 @@ loop:
        bne cr0, loop
      end:
        blr
+```
+
+### Load/Store-Multi (selective)
+
+Up to 64 FPRs will be loaded, here.  `r3` is set one per bit for each
+FP register required to be loaded.  The block of memory from which the
+registers are loaded is contiguous (no gaps): any FP register which has
+a corresponding zero bit in `r3` is *unaltered*.  In essence this is a
+selective LD-multi with "Scatter" (`VCOMPRESS`) capability.
+
+```
+    setvli r0, MVL=64, VL=64
+    sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers
+```
+
+Up to 64 FPRs will be saved, here.  Again, `r3` specifies which
+registers are set in a `VEXPAND` fashion.
+
+```
+    setvli r0, MVL=64, VL=64
+    sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers
+```
+
+[[!tag standards]]
+
+------
+
+\newpage{}