X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fsetvl.mdwn;h=12f7cb52e0d33387658989e9409e797fef65ce0a;hb=0a873ae35b9633a01c0090a30d919cc8c979cdfc;hp=873c50673473800d35852c7640bd5fc64bd5595b;hpb=aebb69ecf9e881f0e376431d1e2d007a8db44dfd;p=libreriscv.git

diff --git a/openpower/sv/setvl.mdwn b/openpower/sv/setvl.mdwn
index 873c50673..12f7cb52e 100644
--- a/openpower/sv/setvl.mdwn
+++ b/openpower/sv/setvl.mdwn
@@ -1,43 +1,333 @@
+[[!tag standards]]
+
 # OpenPOWER SV setvl/setvli
 
 See links:
 
 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=587>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
+* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
+* old page [[simple_v_extension/specification/sv.setvl]]
+* [[sv/svstep]]
+
+Use of setvl results in changes to the MVL, VL and STATE SPRs. see [[sv/sprs]]â§
+
+# Behaviour and Rationale
+
+SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
+just like RVV.  However unlike RVV, SV sits on top of the standard Scalar
+regfiles: there is no separate Vector register numbering.  Therefore, also
+unlike RVV, SV does not have hard-coded "Lanes": microarchitects
+may use *ordinary* in-order, out-of-order, or superscalar designs
+as the basis for SV. By contrast, the relevant parameter
+in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
+anywhere from 1 to tens of thousands of Lanes in supercomputers.
+
+SV is more like how MMX used to sit on top of the x86 FP regfile.
+Therefore when Vector operations are performed, the question has to
+be asked, "well, how much of the regfile do you want to allocate to
+this operation?" because if it is too small an amount performance may
+be affected, and if too large then other registers would overlap and
+cause data  corruption, or even if allocated correctly would require
+spill to memory.
+
+The answer effectively needs to be parameterised.  Hence: MAXVL (MVL)
+is set from an immediate, so that the compiler may decide, statically, a
+guaranteed resource allocation according to the needs of the application.
+
+While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
+optimization. It does not carry side-effects for the arch, though for
+a specific cpu it may affect hw unit usage.
+
+Other than being able to set MVL, SV's VL (Vector Length) works just like
+RVV's VL, with one minor twist.  RVV permits the `setvl` instruction to
+set VL to an arbitrary explicit value.  Within the limit of MVL, VL
+**MUST** be set to the requested value. Given that RVV only works on Vector Loops,
+this is fine and part of its value and design.  However, SV sits on top
+of the standard register files.  When MVL=VL=2, a Vector Add on `r3`
+will perform two Scalar Adds: one on `r3` and one on `r4`.
+
+Thus there is the opportunity to set VL to an explicit value (within the
+limits of MVL) with the reasonable expectation that if two operations
+are requested (by setting VL=2) then two operations are guaranteed.
+This avoids the need for a loop (with not-insignificant use of the
+regfiles for counters), simply two instructions:
+
+    setvli r0, MVL=64, VL=64
+    ld r0.v, 0(r30) # load exactly 64 registers from memory
+
+Page Faults etc. aside this is *guaranteed* 100% without fail to perform
+64 unit-strided LDs starting from the address pointed to by r30 and put
+the contents into r0 through r63.  Thus it becomes a "LOAD-MULTI". Twin
+Predication could even be used to only load relevant registers from
+the stack.  This *only works if VL is set to the requested value* rather
+than, as in RVV, allowing the hardware to set VL to an arbitrary value
+(caveat being, limited to not exceed MVL)
+
+Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
+In combination with SVP64 [[sv/branches]] this can save one instruction
+inside critical inner loops.
 
 # Format
 
-| 0..5 |6....10|11....16|17...20|21.22|23.24|25.26|27...30|31|  name   |
-|------|-------|--------|-------|-----|-----|--|--|-------|--|---------|
-| 19   | RT    | RA     |       | XO[0:4]        XO[5:10] |/ | XL-Form |
-| 19   | (RT|0)| (RA|0) |vlimmed      |//   |vs|ms| NNNNN |/ | setvl/i |
+*(Allocation of opcode TBD pending OPF ISA WG approval)*,
+using EXT22 temporarily and fitting into the
+[[sv/bitmanip]] space
+
+Form: SVL-Form (see [[isatables/fields.text]])
+
+| 0.5|6.10|11.15|16..21| 22...25    | 26.30 |31|  name   |
+| -- | -- | --- | ---- |----------- | ----- |--| ------- |
+|OPCD| RT | RA  | SVi  |cv ms vs vf | 11110 |Rc| setvl   |
+
+Instruction format:
+
+    setvl RT,RA,SVi,vf,vs,ms
+    setvl. RT,RA,SVi,vf,vs,ms
+
+Note that the immediate (`SVi`) spans 7 bits (16 to 22)
+
+* `cv` - bit 22 - reads CTR instead of RA 
+* `ms` - bit 23 - allows for setting of MVL.
+* `vs` - bit 24 - allows for setting of VL.
+* `vf` - bit 25 - sets "Vertical First Mode".
+
+Note that in immediate setting mode VL and MVL start from **one**
+i.e. that an immediate value of zero will result in VL/MVL being set to 1.
+0b111111 results in VL/MVL being set to 64. This is because setting
+VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
+to 0 would result in all Vector operations becoming `nop`.  If this is
+truly desired (nop behaviour) then setting VL and MVL to zero is to be
+done via the [[SVSTATE SPR|sv/sprs]]
+
+Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
+
+    setvli VL=8    : setvl r5, r0, VL=8
+    setmvli MVL=8  : setvl r0, r0, MVL=8
+
+Additional pseudo-op for obtaining VL without modifying it:
+
+    getvl r5       : setvl r5, r0, vf=0, vs=0, ms=0
+
+For Vertical-First mode, a pseudo-op for explicit incrementing
+of srcstep and dststep:
+
+    svstep.        : setvl. 0, 0, vf=1, vs=0, ms=0
+
+Note that whilst it is possible to set both MVL and VL from the same
+immediate, it is not possible to set them to different immediates in
+the same instruction.  That would require two instructions.
+
+# Vertical First Mode
+
+Vertical First is effectively like an implicit single bit predicate
+applied to every SVP64 instruction.  **ONLY** one element in each
+SVP64 Vector instruction is executed; srcstep and dststep do **not**
+increment, and the Program Counter progresses **immediately* to
+the next instruction just as it would for any standard scalar v3.0B
+instruction.
+
+An explicit mode of setvl is called which can move srcstep and
+dststep on to the next element, still respecting predicate
+masks.  
+
+In other words, where normal SVP64 Vectorisation acts "horizontally"
+by looping first through 0 to VL-1 and only then moving the PC
+to the next instruction, Vertical-First moves the PC onwards
+(vertically) through multiple instructions **with the same
+srcstep and dststep**, then an explict instruction used to
+advance srcstep/dststep, and an outer loop is expected to be
+used (branch instruction) which completes a series of
+Vector operations.
+
+```svstep``` mode is enabled when vf=1, vs=0 and ms=0. 
+When Rc=1 it is possible to determine when any level of
+loops reach an end condition, or if VL has been reached. The immediate can
+be reinterpreted as indicating which SVSTATE (0-3)
+should be tested and placed into CR0.
+
+* setvl immediate = 1: only VL testing is enabled. CR0.SO is set
+  to 1 when either srcstep or dststep reach VL
+* setvl immediate = 2: also include inner middle and outer
+  loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
+* setvl immediate = 3: test SVSTATE1
+* setvl immediate = 4: test SVSTATE2
+* setvl immediate = 5: test SVSTATE3
+
+Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
+
+*Programmers should be aware that VL, srcstep and dststep are global in nature.
+Nested looping with different schedules is perfectly possible, as is
+calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
 
 # Pseudocode
 
     // instruction fields:
-    rd = get_rt_field();
-    ra = get_ra_field();
-    vlimmed = get_immed_field();
+    rd = get_rt_field();         // bits 6..10
+    ra = get_ra_field();         // bits 11..15
+    vc = get_vc_field();         // bit 22
+    vf = get_vf_field();         // bit 23
+    vs = get_vs_field();         // bit 24
+    ms = get_ms_field();         // bit 25
+    Rc = get_Rc_field();         // bit 31
 
-    if vs {
-       VL = vlimmed
-    } else {
-       VL = SPR[SV_VL]
-    }
-    if ms {
-       MVL = vlimmed
+    if vf and not vs and not ms {
+        // increment src/dest step mode
+        // NOTE! this is in no way complete! predication is not included
+        // and neither is SUB-VL mode
+        srcstep = SPR[SV].srcstep
+        dststep = SPR[SV].dststep
+        VL = SPR[SV].VL
+        srcstep++
+        dststep++
+        rollover = (srcstep == VL or dststep == VL)
+        if rollover:
+            // Reset srcstep, dststep, and also exit "Vertical First" mode
+            srcstep = 0
+            dststep = 0
+            MSR[6] = 0
+        SPR[SV].srcstep = srcstep
+        SPR[SV].dststep = dststep
+
+        // write CR? helps for doing Vertical loops, detects end
+        // of Vector Elements
+        if Rc {
+            // update CR to indicate that srcstep/dststep "rolled over"
+            CR0.eq = rollover
+        }
     } else {
-       MVL = SPR[SV_MVL]
-    }
-    // calculate VL
-    VL = min(VL, MVL)
+        // add one. MVL/VL=1..64 not 0..63
+        vlimmed = get_immed_field()+1; //  16..22
+
+        // set VL (or not).
+        // 4 options: from SPR, from immed, from ra, from CTR
+        if vs {
+           // VL to be sourced from fields/regs
+           if vc {
+               VL = CTR
+           } else if ra != 0 {
+               VL = GPR[ra]
+           } else {
+               VL = vlimmed
+           }
+        } else {
+           // VL not to change (except if MVL is reduced)
+           // read from SPRs
+           VL = SPR[SV_VL]
+        }
+
+        // set MVL (or not).
+        // 2 options: from SPR, from immed
+        if ms {
+           MVL = vlimmed
+        } else {
+           // MVL not to change, read from SPRs
+           MVL = SPR[SV_MVL]
+        }
+
+        // calculate (limit) VL
+        VL = min(VL, MVL)
 
-    // store VL, MVL
-    SPR[SV_VL] = VL
-    SPR[SV_MVL] = MVL
+        // store VL, MVL
+        SVSTATE.VL = VL
+        SVSTATE.MVL = MVL
 
-    // write rd
-    if rd != 0 {
-        // rd is not x0
-        regs[rd] = VL;
+        // write rd
+        if rt != 0 {
+            // rt is not zero
+            regs[rt] = VL;
+        }
+        // write CR?
+        if Rc {
+            // update CR from VL (not rt)
+            CR0.eq = (VL == 0)
+            ...
+            ...
+        }
+        // write Vertical-First mode
+        SVSTATE.vf = vf
     }
+
+# Examples
+
+## Core concept loop
+
+```
+loop:
+    setvl a3, a0, MVL=8    #  update a3 with vl
+                           # (# of elements this iteration)
+                           # set MVL to 8
+    # do vector operations at up to 8 length (MVL=8)
+    # ...
+    sub a0, a0, a3   # Decrement count by vl
+    bnez a0, loop    # Any more?
+```
+
+## Loop using Rc=1
+
+    my_fn:
+      li r3, 1000
+      b test
+    loop:
+      sub r3, r3, r4
+      ...
+    test:
+      setvli. r4, r3, MVL=64
+      bne cr0, loop
+    end:
+      blr
+
+## setmvlhi double loop
+
+Two elements per inner loop are executed per instruction. This assumes
+that underlying hardware, when `setmvlhi` requests a parallelism hint of 2
+actually sets a parallelism hint of 2.
+
+This example, in c, would be:
+
+```
+long *r4;
+for (i=0; i < CTR; i++)
+{
+    r4[i+2] += r4[i]
+}
+```
+
+where, clearly, it is not possible to do more
+than 2 elements in parallel at a time: attempting
+to do so would result in data corruption. The compiler
+may be able to determine memory aliases and inform
+hardware at runtime of the maximum safe parallelism
+limit.
+
+Whilst this example could be simplified to simply set VL=2,
+or exploit the fact that overlapping adds have well-defined
+behaviour, this has not been done, here, for illustrative purposes
+in order to demonstrate setmvhli and Vertical-First Mode.
+
+Note, crucially, how r4, r32 and r20 are **NOT** incremented
+inside the inner loop.  The MAXVL reservation is still 8,
+i.e. as srcstep and dststep advance (by 2 elements at a time)
+registers r20-r27 will be used for the first LD, and
+registers r32-39 for the second LD.  `r4+srcstep*8` will be used
+as the elstrided offset for LDs.
+
+```
+   setmvlhi  8, 2 # MVL=8, VFHint=2
+loop:
+    setvl  r1, CTR, vf=1 # VL=r1=MAX(MVL, CTR), VF=1
+    mulli  r1, r1, 8     # multiply by int width
+loopinner:
+    sv.ld r20.v, r4(0) # load VLhint elements (max 2)
+    addi r2, r4, 16    # 2 elements ahead
+    sv.ld r32.v, r2(0) # load VLhint elements (max 2)
+    sv.add r32.v, r20.v, r32.v # x[i+2] += x[i]
+    sv.st r32.v, r2(0) # store VLhint elements
+    svstep.            # srcstep += VLhint
+    bnz loopinner      # repeat until srcstep=VL
+    # now done VL elements, move to next batch
+    add r4, r4, r1     # move r4 pointer forward
+    sv.bnz/ctr loop    # decrement CTR by VL
+```