As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as
`sv.svstep`. This will work perfectly well in Horizontal-First
-as it will in Vertical-First Mode.
+as it will in Vertical-First Mode although there are caveats for
+the Deterministic use of looping with Sub-Vectors in Vertical-First mode.
Example: to obtain the full set of possible computed element
-indices use `sv.svstep RT.v,SVI,1` which will store all computed element
+indices use `sv.svstep *RT,SVi,1` which will store all computed element
indices, starting from RT. If Rc=1 then a co-result Vector of CR Fields
will also be returned, comprising the "loop end-points" of each of the inner
loops when either Matrix Mode or DCT/FFT is set. In other words,
*Programmer's note: VL in some situations, particularly larger
Matrices (5x7x3 will set MAXVL=105), will cause `sv.svstep` to return a
considerable number of values. Under such circumstances `sv.svstep/ew=8`
-is recommended, followed likewise by setting elwidth=8 on `svindex`*
+is recommended.*
*Programmer's note: having conveniently obtained a pre-computed Schedule
with `sv.svstep`, it may then be used as the input to Indexed REMAP
be stored on the stack in order to achieve this benefit not normally
found in Vector ISAs.
+**Use of svstep with Vertical-First sub-vectors**
+
+Incrementing and iteration through subvector state ssubstep and dsubstep is
+possible with `sv.svstep/vecN` where as expected N may be 2/3/4. However it is necessary
+to use the exact same Sub-Vector qualifier on any Prefixed
+instructions, within any given Vertical-First loop: `vec2/3/4` is **not**
+automatically applied to all instructions, it must be explicitly applied on
+a per-instruction basis. Also valid
+is not specifying a Sub-vector
+qualifier at all, but it is critically important to note that
+operations will be repeated. For example if `sv.svstep/vec2`
+is not used on `sv.addi` then each Vector element operation is
+repeated twice. The reason is that whilst svstep will be
+iterating through both the SUBVL and VL loops, the addi instruction
+only uses `srcstep` and `dststep` (not ssubstep or dsubstep) Illustrated below:
+
+```
+ def offset():
+ for step in range(VL):
+ for substep in range(SUBVL=2):
+ yield step, substep
+ for i, j in offset():
+ vec2_offs = i * SUBVL + j # calculate vec2 offset
+ addi RT+i, RA+i, 1 # but sv.addi is not vec2!
+ muli/vec2 RT+vec2_offs, RA+vec2_offs, 2 # this is
+```
+
+Actual assembler would be:
+
+```
+ loop:
+ setvl VF=1, CTRmode
+ sv.addi *RT, *RA, 1 # no vec2
+ sv.muli/vec2 *RT, *RA, 2 # vec2
+ sv.svstep/vec2 # must match the muli
+ sv.bc CTRmode, loop # subtracts VL from CTR
+```
+
+This illustrates the correct but seemingly-anomalous behaviour: `sv.svstep/vec2`
+is being requested to update `SVSTATE` to follow a vec2 loop construct. The anomalous
+`sv.addi` is not prohibited as it may in fact be desirable to execute operations twice,
+or to re-load data that was overwritten, and many other possibilities.
+
-------------
\newpage{}