# svstep: Vertical-First Stepping and status reporting SVL-Form * svstep RT,SVi,vf (Rc=0) * svstep. RT,SVi,vf (Rc=1) | 0-5|6-10|11.15|16..22| 23-25 | 26-30 |31| Form | |----|----|-----|------|----------|-------|--|--------- | |PO | RT | / | SVi | / / vf | XO |Rc| SVL-Form | Pseudo-code: ``` if SVi[3:4] = 0b11 then # store pack and unpack in SVSTATE SVSTATE[53] <- SVi[5] SVSTATE[54] <- SVi[6] RT <- [0]*62 || SVSTATE[53:54] else # Vertical-First explicit stepping. step <- SVSTATE_NEXT(SVi, vf) RT <- [0]*57 || step ``` Special Registers Altered: CR0 (if Rc=1) **Description** svstep may be used to enquire about the REMAP Schedule and it may be used to alter Vectorization State. When `vf=1` then stepping occurs. When `vf=0` the enquiry is performed without altering internal state. If `SVi=0, Rc=0, vf=0` the instruction is a `nop`. The following Modes exist: * `SVi=0`: appropriately step srcstep, dststep, subsrcstep and subdststep to the next element, taking pack and unpack into consideration. * When `SVi` is 1-4 the REMAP Schedule for a given SVSHAPE may be returned in `RT`. SVi=1 selects SVSHAPE0 current state, through to SVi=4 selects SVSHAPE3. * When `SVi` is 5, `SVSTATE.srcstep` is returned. * When `SVi` is 6, `SVSTATE.dststep` is returned. * When `SVi` is 7, `SVSTATE.ssubstep` is returned. * When `SVi` is 8, `SVSTATE.dsubstep` is returned. * When `SVi` is 0b1100 pack/unpack in SVSTATE is cleared * When `SVi` is 0b1101 pack in SVSTATE is set, unpack is cleared * When `SVi` is 0b1110 unpack in SVSTATE is set, pack is cleared * When `SVi` is 0b1111 pack/unpack in SVSTATE are set As this is a Single-Predicated (1P) instruction, predication may be applied to skip (or zero) elements. * Vertical-First Mode will return the requested index (and move to the next state if `vf=1`) * Horizontal-First Mode can be used to return all indices, i.e. walks through all possible states. **Vectorization of svstep itself** As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as `sv.svstep`. This will work perfectly well in Horizontal-First as it will in Vertical-First Mode although there are caveats for the Deterministic use of looping with Sub-Vectors in Vertical-First mode. Example: to obtain the full set of possible computed element indices use `sv.svstep *RT,SVi,1` which will store all computed element indices, starting from RT. If Rc=1 then a co-result Vector of CR Fields will also be returned, comprising the "loop end-points" of each of the inner loops when either Matrix Mode or DCT/FFT is set. In other words, for example, when the `xdim` inner loop reaches the end and on the next iteration it will begin again at zero, the CR Field `EQ` will be set. With a maximum of three loops within both Matrix and DCT/FFT Modes, the CR Field's EQ bit will be set at the end of the first inner loop, the LE bit for the second, the GT bit for the outermost loop and the SO bit set on the very last element, when all loops reach their maximum extent. *Programmer's note: VL in some situations, particularly larger Matrices (5x7x3 will set MAXVL=105), will cause `sv.svstep` to return a considerable number of values. Under such circumstances `sv.svstep/ew=8` is recommended.* *Programmer's note: having conveniently obtained a pre-computed Schedule with `sv.svstep`, it may then be used as the input to Indexed REMAP Mode to achieve the exact same Schedule. It is evident however that before use some of the Indices may be arbitrarily altered as desired. `sv.svstep` helps the programmer avoid having to manually recreate Indices for certain types of common Loop patterns. In its simplest form, without REMAP (SVi=5 or SVi=6), is equivalent to the `iota` instruction found in other Vector ISAs* **Vertical First Mode** Vertical First is effectively like an implicit single bit predicate applied to every SVP64 instruction. **ONLY** one element in each SVP64 Vector instruction is executed; srcstep and dststep do **not** increment automatically on completion of one instruction, and the Program Counter progresses **immediately** to the next instruction just as it would for any standard scalar v3.0B instruction. A mode of srcstep (SVi=0) is called which can move srcstep and dststep on to the next element, still respecting predicate masks. In other words, where normal SVP64 Vectorization acts "horizontally" by looping first through 0 to VL-1 and only then moving the PC to the next instruction, Vertical-First moves the PC onwards (vertically) through multiple instructions **with the same srcstep and dststep**, then an explict instruction used to advance srcstep/dststep. An outer loop is expected to be used (branch instruction) which completes a series of Vector operations. Testing any end condition of any loop of any REMAP state allows branches to be used to create loops. *Programmer's note: when Predicate Non-Zeroing is used this indicates to the underlying hardware that any masked-out element must be skipped. *This includes in Vertical-First Mode*, and programmers should be keenly aware that srcstep or dststep or both *may* jump by more than one as a result, because the actual request under these circumstances was to execute on the first available next *non-masked-out* element. It should be evident that it is the `sv.svstep` instruction that must be Predicated in order for the **entire** loop to use the Predicate correctly, and it is strongly recommended for all instructions within the same Vertical-First Loop to utilise the exact same Predicate Mask(s).* Programmers should be aware that VL, srcstep and dststep and the SUBVL substeps are global in nature. Nested looping with different schedules is perfectly possible, as is calling of functions, however SVSTATE (and any associated SVSHAPEs if REMAP is being used) should obviously be stored on the stack in order to achieve this benefit not normally found in Vector ISAs. **Use of svstep with Vertical-First sub-vectors** Incrementing and iteration through subvector state ssubstep and dsubstep is possible with `sv.svstep/vecN` where as expected N may be 2/3/4. However it is necessary to use the exact same Sub-Vector qualifier on any Prefixed instructions, within any given Vertical-First loop: `vec2/3/4` is **not** automatically applied to all instructions, it must be explicitly applied on a per-instruction basis. Also valid is not specifying a Sub-vector qualifier at all, but it is critically important to note that operations will be repeated. For example if `sv.svstep/vec2` is not used on `sv.addi` then each Vector element operation is repeated twice. The reason is that whilst svstep will be iterating through both the SUBVL and VL loops, the addi instruction only uses `srcstep` and `dststep` (not ssubstep or dsubstep) Illustrated below: ``` def offset(): for step in range(VL): for substep in range(SUBVL=2): yield step, substep for i, j in offset(): vec2_offs = i * SUBVL + j # calculate vec2 offset addi RT+i, RA+i, 1 # but sv.addi is not vec2! muli/vec2 RT+vec2_offs, RA+vec2_offs, 2 # this is ``` Actual assembler would be: ``` loop: setvl VF=1, CTRmode sv.addi *RT, *RA, 1 # no vec2 sv.muli/vec2 *RT, *RA, 2 # vec2 sv.svstep/vec2 # must match the muli sv.bc CTRmode, loop # subtracts VL from CTR ``` This illustrates the correct but seemingly-anomalous behaviour: `sv.svstep/vec2` is being requested to update `SVSTATE` to follow a vec2 loop construct. The anomalous `sv.addi` is not prohibited as it may in fact be desirable to execute operations twice, or to re-load data that was overwritten, and many other possibilities. ------------- \newpage{} # Appendix **src_iterate** Note that `srcstep` and `ssubstep` are not the absolute final Element (and Sub-Element) offsets. `srcstep` still has to go through individual `REMAP` translation before becoming a per-operand (RA, RB, RC, RT, RS) Element-level Source offset. Note also critically that `PACK` mode simply inverts the outer/order loops making SUBVL the outer loop and VL the inner. ``` # source-stepping iterator subvl = SVSTATE.subvl vl = SVSTATE.vl pack = SVSTATE.pack unpack = SVSTATE.unpack ssubstep = SVSTATE.ssubstep end_ssub = ssubstep == subvl end_src = SVSTATE.srcstep == vl-1 # first source step. srcstep = SVSTATE.srcstep # used below: # sz - from RM.MODE, source-zeroing # srcmask - from RM.MODE, the source predicate if pack: # pack advances subvl in *outer* loop while True: assert srcstep <= vl-1 end_src = srcstep == vl-1 if end_src: if end_ssub: loopend = True else: SVSTATE.ssubstep += 1 srcstep = 0 # reset break else: srcstep += 1 # advance srcstep if not sz: break if ((1 << srcstep) & srcmask) != 0: break else: # advance subvl in *inner* loop if end_ssub: while True: assert srcstep <= vl-1 end_src = srcstep == vl-1 if end_src: # end-point loopend = True srcstep = 0 break else: srcstep += 1 if not sz: break if ((1 << srcstep) & srcmask) != 0: break else: log(" sskip", bin(srcmask), bin(1 << srcstep)) SVSTATE.ssubstep = 0b00 # reset else: # advance ssubstep SVSTATE.ssubstep += 1 SVSTATE.srcstep = srcstep ``` ------------- \newpage{} **dest_iterate** Note that `dststep` and `dsubstep` are not the absolute final Element (and Sub-Element) offsets. `dststep` still has to go through individual `REMAP` translation before becoming a per-operand (RT, RS/EA) destination Element-level offset, and `dsubstep` may also go through `(f)mv.swizzle` reordering. Note also critically that `UNPACK` mode simply inverts the outer/order loops making SUBVL the outer loop and VL the inner. ``` # dest step iterator vl = SVSTATE.vl subvl = SVSTATE.subvl unpack = SVSTATE.unpack dsubstep = SVSTATE.dsubstep end_dsub = dsubstep == subvl dststep = SVSTATE.dststep end_dst = dststep == vl-1 # used below: # dz - from RM.MODE, destination-zeroing # dstmask - from RM.MODE, the destination predicate if unpack: # unpack advances subvl in *outer* loop while True: assert dststep <= vl-1 end_dst = dststep == vl-1 if end_dst: if end_dsub: loopend = True else: SVSTATE.dsubstep += 1 dststep = 0 # reset break else: dststep += 1 # advance dststep if not dz: break if ((1 << dststep) & dstmask) != 0: break else: # advance subvl in *inner* loop if end_dsub: while True: assert dststep <= vl-1 end_dst = dststep == vl-1 if end_dst: # end-point loopend = True dststep = 0 break else: dststep += 1 if not dz: break if ((1 << dststep) & dstmask) != 0: break SVSTATE.dsubstep = 0b00 # reset else: # advance ssubstep SVSTATE.dsubstep += 1 SVSTATE.dststep = dststep ``` ------------- \newpage{} **SVSTATE_NEXT** ``` if SVi = 1 then return REMAP SVSHAPE0 current offset if SVi = 2 then return REMAP SVSHAPE1 current offset if SVi = 3 then return REMAP SVSHAPE2 current offset if SVi = 4 then return REMAP SVSHAPE3 current offset if SVi = 5 then return SVSTATE.srcstep # VL source step if SVi = 6 then return SVSTATE.dststep # VL dest step if SVi = 7 then return SVSTATE.ssubstep # SUBVL source step if SVi = 8 then return SVSTATE.dsubstep # SUBVL dest step # SVi=0, explicit iteration requezted src_iterate(); dst_iterate(); return 0 ``` **at_loopend** Both Vertical-First and Horizontal-First may use this algorithm to determine if the "end-of-looping" (end of Sub-Program-Counter) has been reached. Horizontal-First Mode will immediately move to the next instruction, where `svstep.` will set `CR0.EQ` to 1. ``` # tells if this is the last possible element. subvl = SVSTATE.subvl vl = SVSTATE.vl end_ssub = SVSTATE.ssubstep == subvl end_dsub = SVSTATE.dsubstep == subvl if SVSTATE.srcstep == vl-1 and end_ssub: return True if SVSTATE.dststep == vl-1 and end_dsub: return True return False ``` [[!tag standards]] ------------- \newpage{}