+<!-- hide -->
+# Links
+* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
+<!-- show -->
# svstep: Vertical-First Stepping and status reporting
SVL-Form
-* svstep RT,SVi,vf (Rc=0)
-* svstep. RT,SVi,vf (Rc=1)
+* svstep RT,RA,SVi,vf (Rc=0)
+* svstep. RT,RA,SVi,vf (Rc=1)
| 0-5|6-10|11.15|16..22| 23-25 | 26-30 |31| Form |
|----|----|-----|------|----------|-------|--|--------- |
-|PO | RT | / | SVi | / / vf | XO |Rc| SVL-Form |
+|PO | RT | RA | SVi | / / vf | XO |Rc| SVL-Form |
Pseudo-code:
**Description**
svstep may be used to enquire about the REMAP Schedule and it may be
-used to alter Vectorisation State. When `vf=1` then stepping occurs.
+used to alter Vectorization State. When `vf=1` then stepping occurs.
When `vf=0` the enquiry is performed without altering internal state.
If `SVi=0, Rc=0, vf=0` the instruction is a `nop`.
through to SVi=4 selects SVSHAPE3.
* When `SVi` is 5, `SVSTATE.srcstep` is returned.
* When `SVi` is 6, `SVSTATE.dststep` is returned.
+* When `SVi` is 7, `SVSTATE.ssubstep` is returned.
+* When `SVi` is 8, `SVSTATE.dsubstep` is returned.
* When `SVi` is 0b1100 pack/unpack in SVSTATE is cleared
* When `SVi` is 0b1101 pack in SVSTATE is set, unpack is cleared
* When `SVi` is 0b1110 unpack in SVSTATE is set, pack is cleared
* Horizontal-First Mode can be used to return all indices,
i.e. walks through all possible states.
-**Vectorisation of svstep itself**
+**Vectorization of svstep itself**
As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as
`sv.svstep`. This will work perfectly well in Horizontal-First
-as it will in Vertical-First Mode.
+as it will in Vertical-First Mode although there are caveats for
+the Deterministic use of looping with Sub-Vectors in Vertical-First mode.
Example: to obtain the full set of possible computed element
-indices use `sv.svstep RT.v,SVI,1` which will store all computed element
+indices use `sv.svstep *RT,SVi,1` which will store all computed element
indices, starting from RT. If Rc=1 then a co-result Vector of CR Fields
will also be returned, comprising the "loop end-points" of each of the inner
loops when either Matrix Mode or DCT/FFT is set. In other words,
SO bit set on the very last element, when all loops reach their maximum
extent.
-*Programmer's note (1): VL in some situations, particularly larger Matrices,
-may exceed 64,
-meaning that `sv.svshape` returning a considerable number of values. Under
-such circumstances `sv.svshape/ew=8` is recommended.*
+*Programmer's note: VL in some situations, particularly larger
+Matrices (5x7x3 will set MAXVL=105), will cause `sv.svstep` to return a
+considerable number of values. Under such circumstances `sv.svstep/ew=8`
+is recommended.*
-*Programmer's note (2): having conveniently obtained a pre-computed
-Schedule with `sv.svstep`,
-it may then be used as the input to Indexed REMAP Mode
-to achieve the exact same Schedule. It is evident however that
+*Programmer's note: having conveniently obtained a pre-computed Schedule
+with `sv.svstep`, it may then be used as the input to Indexed REMAP
+Mode to achieve the exact same Schedule. It is evident however that
before use some of the Indices may be arbitrarily altered as desired.
`sv.svstep` helps the programmer avoid having to manually recreate
-Indices for certain
-types of common Loop patterns, and in its simplest form, without REMAP
-(SVi=5 or SVi=6),
-is equivalent to the `iota` instruction found in other Vector ISAs*
+Indices for certain types of common Loop patterns. In its simplest form,
+without REMAP (SVi=5 or SVi=6), is equivalent to the `iota` instruction
+found in other Vector ISAs*
**Vertical First Mode**
Vertical First is effectively like an implicit single bit predicate
-applied to every SVP64 instruction. **ONLY** one element in each
-SVP64 Vector instruction is executed; srcstep and dststep do **not**
-increment, and the Program Counter progresses **immediately** to
-the next instruction just as it would for any standard scalar v3.0B
-instruction.
-
-A mode of srcstep (SVi=0) is called which can move srcstep and
-dststep on to the next element, still respecting predicate
-masks.
-
-In other words, where normal SVP64 Vectorisation acts "horizontally"
-by looping first through 0 to VL-1 and only then moving the PC
-to the next instruction, Vertical-First moves the PC onwards
-(vertically) through multiple instructions **with the same
-srcstep and dststep**, then an explict instruction used to
-advance srcstep/dststep. An outer loop is expected to be
-used (branch instruction) which completes a series of
-Vector operations.
-
-Testing any end condition of any loop of any REMAP state allows branches to be
-used to create loops.
-
-Programmer's note: when Predicate Non-Zeroing is used this indicates to
+applied to every SVP64 instruction. **ONLY** one element in each SVP64
+Vector instruction is executed; srcstep and dststep do **not** increment
+automatically on completion of one instruction, and the Program Counter
+progresses **immediately** to the next instruction just as it would for
+any standard scalar v3.0B instruction.
+
+A mode of srcstep (SVi=0) is called which can move srcstep and dststep
+on to the next element, still respecting predicate masks.
+
+In other words, where normal SVP64 Vectorization acts "horizontally"
+by looping first through 0 to VL-1 and only then moving the PC to the
+next instruction, Vertical-First moves the PC onwards (vertically)
+through multiple instructions **with the same srcstep and dststep**,
+then an explict instruction used to advance srcstep/dststep. An outer
+loop is expected to be used (branch instruction) which completes a series
+of Vector operations.
+
+Testing any end condition of any loop of any REMAP state allows branches
+to be used to create loops.
+
+*Programmer's note: when Predicate Non-Zeroing is used this indicates to
the underlying hardware that any masked-out element must be skipped.
-*This includes in Vertical-First Mode*, and programmers should be keenly
-aware that srcstep or dststep or both *may* jump by more than one as
-a result, because the actual request under these circumstances was to execute
-on the first available next *non-masked-out* element.
+*This includes in Vertical-First Mode*, and programmers should be
+keenly aware that srcstep or dststep or both *may* jump by more than
+one as a result, because the actual request under these circumstances
+was to execute on the first available next *non-masked-out* element.
+It should be evident that it is the `sv.svstep` instruction that must
+be Predicated in order for the **entire** loop to use the Predicate
+correctly, and it is strongly recommended for all instructions within
+the same Vertical-First Loop to utilise the exact same Predicate Mask(s).*
+
+Programmers should be aware that VL, srcstep and dststep and the SUBVL
+substeps are global in nature. Nested looping with different schedules
+is perfectly possible, as is calling of functions, however SVSTATE
+(and any associated SVSHAPEs if REMAP is being used) should obviously
+be stored on the stack in order to achieve this benefit not normally
+found in Vector ISAs.
+
+**Use of svstep with Vertical-First sub-vectors**
+
+Incrementing and iteration through subvector state ssubstep and dsubstep is
+possible with `sv.svstep/vecN` where as expected N may be 2/3/4. However it is necessary
+to use the exact same Sub-Vector qualifier on any Prefixed
+instructions, within any given Vertical-First loop: `vec2/3/4` is **not**
+automatically applied to all instructions, it must be explicitly applied on
+a per-instruction basis. Also valid
+is not specifying a Sub-vector
+qualifier at all, but it is critically important to note that
+operations will be repeated. For example if `sv.svstep/vec2`
+is not used on `sv.addi` then each Vector element operation is
+repeated twice. The reason is that whilst svstep will be
+iterating through both the SUBVL and VL loops, the addi instruction
+only uses `srcstep` and `dststep` (not ssubstep or dsubstep) Illustrated below:
+
+```
+ def offset():
+ for step in range(VL):
+ for substep in range(SUBVL=2):
+ yield step, substep
+ for i, j in offset():
+ vec2_offs = i * SUBVL + j # calculate vec2 offset
+ addi RT+i, RA+i, 1 # but sv.addi is not vec2!
+ muli/vec2 RT+vec2_offs, RA+vec2_offs, 2 # this is
+```
+
+Actual assembler would be:
+
+```
+ loop:
+ setvl VF=1, CTRmode
+ sv.addi *RT, *RA, 1 # no vec2
+ sv.muli/vec2 *RT, *RA, 2 # vec2
+ sv.svstep/vec2 # must match the muli
+ sv.bc CTRmode, loop # subtracts VL from CTR
+```
+
+This illustrates the correct but seemingly-anomalous behaviour: `sv.svstep/vec2`
+is being requested to update `SVSTATE` to follow a vec2 loop construct. The anomalous
+`sv.addi` is not prohibited as it may in fact be desirable to execute operations twice,
+or to re-load data that was overwritten, and many other possibilities.
+
+-------------
+
+\newpage{}
+
+# Appendix
+
+**src_iterate**
+
+Note that `srcstep` and `ssubstep` are not the absolute final Element
+(and Sub-Element) offsets. `srcstep` still has to go through individual
+`REMAP` translation before becoming a per-operand (RA, RB, RC, RT, RS)
+Element-level Source offset.
+
+Note also critically that `PACK` mode simply inverts the outer/order
+loops making SUBVL the outer loop and VL the inner.
+
+```
+ # source-stepping iterator
+ subvl = SVSTATE.subvl
+ vl = SVSTATE.vl
+ pack = SVSTATE.pack
+ unpack = SVSTATE.unpack
+ ssubstep = SVSTATE.ssubstep
+ end_ssub = ssubstep == subvl
+ end_src = SVSTATE.srcstep == vl-1
+ # first source step.
+ srcstep = SVSTATE.srcstep
+ # used below:
+ # sz - from RM.MODE, source-zeroing
+ # srcmask - from RM.MODE, the source predicate
+ if pack:
+ # pack advances subvl in *outer* loop
+ while True:
+ assert srcstep <= vl-1
+ end_src = srcstep == vl-1
+ if end_src:
+ if end_ssub:
+ loopend = True
+ else:
+ SVSTATE.ssubstep += 1
+ srcstep = 0 # reset
+ break
+ else:
+ srcstep += 1 # advance srcstep
+ if not sz:
+ break
+ if ((1 << srcstep) & srcmask) != 0:
+ break
+ else:
+ # advance subvl in *inner* loop
+ if end_ssub:
+ while True:
+ assert srcstep <= vl-1
+ end_src = srcstep == vl-1
+ if end_src: # end-point
+ loopend = True
+ srcstep = 0
+ break
+ else:
+ srcstep += 1
+ if not sz:
+ break
+ if ((1 << srcstep) & srcmask) != 0:
+ break
+ else:
+ log(" sskip", bin(srcmask), bin(1 << srcstep))
+ SVSTATE.ssubstep = 0b00 # reset
+ else:
+ # advance ssubstep
+ SVSTATE.ssubstep += 1
+
+ SVSTATE.srcstep = srcstep
+```
+
+-------------
+
+\newpage{}
+
+**dest_iterate**
+
+Note that `dststep` and `dsubstep` are not the absolute final Element
+(and Sub-Element) offsets. `dststep` still has to go through individual
+`REMAP` translation before becoming a per-operand (RT, RS/EA) destination
+Element-level offset, and `dsubstep` may also go through `(f)mv.swizzle`
+reordering.
+
+Note also critically that `UNPACK` mode simply inverts the outer/order
+loops making SUBVL the outer loop and VL the inner.
+
+```
+ # dest step iterator
+ vl = SVSTATE.vl
+ subvl = SVSTATE.subvl
+ unpack = SVSTATE.unpack
+ dsubstep = SVSTATE.dsubstep
+ end_dsub = dsubstep == subvl
+ dststep = SVSTATE.dststep
+ end_dst = dststep == vl-1
+ # used below:
+ # dz - from RM.MODE, destination-zeroing
+ # dstmask - from RM.MODE, the destination predicate
+ if unpack:
+ # unpack advances subvl in *outer* loop
+ while True:
+ assert dststep <= vl-1
+ end_dst = dststep == vl-1
+ if end_dst:
+ if end_dsub:
+ loopend = True
+ else:
+ SVSTATE.dsubstep += 1
+ dststep = 0 # reset
+ break
+ else:
+ dststep += 1 # advance dststep
+ if not dz:
+ break
+ if ((1 << dststep) & dstmask) != 0:
+ break
+ else:
+ # advance subvl in *inner* loop
+ if end_dsub:
+ while True:
+ assert dststep <= vl-1
+ end_dst = dststep == vl-1
+ if end_dst: # end-point
+ loopend = True
+ dststep = 0
+ break
+ else:
+ dststep += 1
+ if not dz:
+ break
+ if ((1 << dststep) & dstmask) != 0:
+ break
+ SVSTATE.dsubstep = 0b00 # reset
+ else:
+ # advance ssubstep
+ SVSTATE.dsubstep += 1
+
+ SVSTATE.dststep = dststep
+```
+
+-------------
+
+\newpage{}
+
+**SVSTATE_NEXT**
-*Programmers should be aware that VL, srcstep and dststep are global in nature.
-Nested looping with different schedules is perfectly possible, as is
-calling of functions, however SVSTATE (and any associated SVSTATE) should
-obviously be stored on the stack in order to achieve this benefit*
+```
+ if SVi = 1 then return REMAP SVSHAPE0 current offset
+ if SVi = 2 then return REMAP SVSHAPE1 current offset
+ if SVi = 3 then return REMAP SVSHAPE2 current offset
+ if SVi = 4 then return REMAP SVSHAPE3 current offset
+ if SVi = 5 then return SVSTATE.srcstep # VL source step
+ if SVi = 6 then return SVSTATE.dststep # VL dest step
+ if SVi = 7 then return SVSTATE.ssubstep # SUBVL source step
+ if SVi = 8 then return SVSTATE.dsubstep # SUBVL dest step
+
+ # SVi=0, explicit iteration requezted
+ src_iterate();
+ dst_iterate();
+ return 0
+```
+
+**at_loopend**
+
+Both Vertical-First and Horizontal-First may use this algorithm to
+determine if the "end-of-looping" (end of Sub-Program-Counter) has
+been reached. Horizontal-First Mode will immediately move to the
+next instruction, where `svstep.` will set `CR0.EQ` to 1.
+
+```
+ # tells if this is the last possible element.
+ subvl = SVSTATE.subvl
+ vl = SVSTATE.vl
+ end_ssub = SVSTATE.ssubstep == subvl
+ end_dsub = SVSTATE.dsubstep == subvl
+ if SVSTATE.srcstep == vl-1 and end_ssub:
+ return True
+ if SVSTATE.dststep == vl-1 and end_dsub:
+ return True
+ return False
+```
[[!tag standards]]
\newpage{}
+