X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fsvstep.mdwn;h=73505acc0d921555b865cb4685bdd062b101127b;hb=659e698716659496567b766f2b47131459e0da6c;hp=e537b7545e2dd27acc1cf1eb2b7f782b337b5fe7;hpb=7c383a571f8fdd8c555b676814ee6aceada642d9;p=libreriscv.git diff --git a/openpower/sv/svstep.mdwn b/openpower/sv/svstep.mdwn index e537b7545..73505acc0 100644 --- a/openpower/sv/svstep.mdwn +++ b/openpower/sv/svstep.mdwn @@ -1,13 +1,17 @@ + +# Links +* + # svstep: Vertical-First Stepping and status reporting SVL-Form -* svstep RT,SVi,vf (Rc=0) -* svstep. RT,SVi,vf (Rc=1) +* svstep RT,RA,SVi,vf (Rc=0) +* svstep. RT,RA,SVi,vf (Rc=1) | 0-5|6-10|11.15|16..22| 23-25 | 26-30 |31| Form | |----|----|-----|------|----------|-------|--|--------- | -|PO | RT | / | SVi | / / vf | XO |Rc| SVL-Form | +|PO | RT | RA | SVi | / / vf | XO |Rc| SVL-Form | Pseudo-code: @@ -30,7 +34,7 @@ Special Registers Altered: **Description** svstep may be used to enquire about the REMAP Schedule and it may be -used to alter Vectorisation State. When `vf=1` then stepping occurs. +used to alter Vectorization State. When `vf=1` then stepping occurs. When `vf=0` the enquiry is performed without altering internal state. If `SVi=0, Rc=0, vf=0` the instruction is a `nop`. @@ -43,6 +47,8 @@ The following Modes exist: through to SVi=4 selects SVSHAPE3. * When `SVi` is 5, `SVSTATE.srcstep` is returned. * When `SVi` is 6, `SVSTATE.dststep` is returned. +* When `SVi` is 7, `SVSTATE.ssubstep` is returned. +* When `SVi` is 8, `SVSTATE.dsubstep` is returned. * When `SVi` is 0b1100 pack/unpack in SVSTATE is cleared * When `SVi` is 0b1101 pack in SVSTATE is set, unpack is cleared * When `SVi` is 0b1110 unpack in SVSTATE is set, pack is cleared @@ -56,14 +62,15 @@ to skip (or zero) elements. * Horizontal-First Mode can be used to return all indices, i.e. walks through all possible states. -**Vectorisation of svstep itself** +**Vectorization of svstep itself** As a 32-bit instruction, `svstep` may be itself be Vector-Prefixed, as `sv.svstep`. This will work perfectly well in Horizontal-First -as it will in Vertical-First Mode. +as it will in Vertical-First Mode although there are caveats for +the Deterministic use of looping with Sub-Vectors in Vertical-First mode. Example: to obtain the full set of possible computed element -indices use `sv.svstep RT.v,SVI,1` which will store all computed element +indices use `sv.svstep *RT,SVi,1` which will store all computed element indices, starting from RT. If Rc=1 then a co-result Vector of CR Fields will also be returned, comprising the "loop end-points" of each of the inner loops when either Matrix Mode or DCT/FFT is set. In other words, @@ -75,58 +82,288 @@ the LE bit for the second, the GT bit for the outermost loop and the SO bit set on the very last element, when all loops reach their maximum extent. -*Programmer's note (1): VL in some situations, particularly larger Matrices, -may exceed 64, -meaning that `sv.svshape` returning a considerable number of values. Under -such circumstances `sv.svshape/ew=8` is recommended.* +*Programmer's note: VL in some situations, particularly larger +Matrices (5x7x3 will set MAXVL=105), will cause `sv.svstep` to return a +considerable number of values. Under such circumstances `sv.svstep/ew=8` +is recommended.* -*Programmer's note (2): having conveniently obtained a pre-computed -Schedule with `sv.svstep`, -it may then be used as the input to Indexed REMAP Mode -to achieve the exact same Schedule. It is evident however that +*Programmer's note: having conveniently obtained a pre-computed Schedule +with `sv.svstep`, it may then be used as the input to Indexed REMAP +Mode to achieve the exact same Schedule. It is evident however that before use some of the Indices may be arbitrarily altered as desired. `sv.svstep` helps the programmer avoid having to manually recreate -Indices for certain -types of common Loop patterns, and in its simplest form, without REMAP -(SVi=5 or SVi=6), -is equivalent to the `iota` instruction found in other Vector ISAs* +Indices for certain types of common Loop patterns. In its simplest form, +without REMAP (SVi=5 or SVi=6), is equivalent to the `iota` instruction +found in other Vector ISAs* **Vertical First Mode** Vertical First is effectively like an implicit single bit predicate -applied to every SVP64 instruction. **ONLY** one element in each -SVP64 Vector instruction is executed; srcstep and dststep do **not** -increment, and the Program Counter progresses **immediately** to -the next instruction just as it would for any standard scalar v3.0B -instruction. - -A mode of srcstep (SVi=0) is called which can move srcstep and -dststep on to the next element, still respecting predicate -masks. - -In other words, where normal SVP64 Vectorisation acts "horizontally" -by looping first through 0 to VL-1 and only then moving the PC -to the next instruction, Vertical-First moves the PC onwards -(vertically) through multiple instructions **with the same -srcstep and dststep**, then an explict instruction used to -advance srcstep/dststep. An outer loop is expected to be -used (branch instruction) which completes a series of -Vector operations. - -Testing any end condition of any loop of any REMAP state allows branches to be -used to create loops. - -Programmer's note: when Predicate Non-Zeroing is used this indicates to +applied to every SVP64 instruction. **ONLY** one element in each SVP64 +Vector instruction is executed; srcstep and dststep do **not** increment +automatically on completion of one instruction, and the Program Counter +progresses **immediately** to the next instruction just as it would for +any standard scalar v3.0B instruction. + +A mode of srcstep (SVi=0) is called which can move srcstep and dststep +on to the next element, still respecting predicate masks. + +In other words, where normal SVP64 Vectorization acts "horizontally" +by looping first through 0 to VL-1 and only then moving the PC to the +next instruction, Vertical-First moves the PC onwards (vertically) +through multiple instructions **with the same srcstep and dststep**, +then an explict instruction used to advance srcstep/dststep. An outer +loop is expected to be used (branch instruction) which completes a series +of Vector operations. + +Testing any end condition of any loop of any REMAP state allows branches +to be used to create loops. + +*Programmer's note: when Predicate Non-Zeroing is used this indicates to the underlying hardware that any masked-out element must be skipped. -*This includes in Vertical-First Mode*, and programmers should be keenly -aware that srcstep or dststep or both *may* jump by more than one as -a result, because the actual request under these circumstances was to execute -on the first available next *non-masked-out* element. +*This includes in Vertical-First Mode*, and programmers should be +keenly aware that srcstep or dststep or both *may* jump by more than +one as a result, because the actual request under these circumstances +was to execute on the first available next *non-masked-out* element. +It should be evident that it is the `sv.svstep` instruction that must +be Predicated in order for the **entire** loop to use the Predicate +correctly, and it is strongly recommended for all instructions within +the same Vertical-First Loop to utilise the exact same Predicate Mask(s).* + +Programmers should be aware that VL, srcstep and dststep and the SUBVL +substeps are global in nature. Nested looping with different schedules +is perfectly possible, as is calling of functions, however SVSTATE +(and any associated SVSHAPEs if REMAP is being used) should obviously +be stored on the stack in order to achieve this benefit not normally +found in Vector ISAs. + +**Use of svstep with Vertical-First sub-vectors** + +Incrementing and iteration through subvector state ssubstep and dsubstep is +possible with `sv.svstep/vecN` where as expected N may be 2/3/4. However it is necessary +to use the exact same Sub-Vector qualifier on any Prefixed +instructions, within any given Vertical-First loop: `vec2/3/4` is **not** +automatically applied to all instructions, it must be explicitly applied on +a per-instruction basis. Also valid +is not specifying a Sub-vector +qualifier at all, but it is critically important to note that +operations will be repeated. For example if `sv.svstep/vec2` +is not used on `sv.addi` then each Vector element operation is +repeated twice. The reason is that whilst svstep will be +iterating through both the SUBVL and VL loops, the addi instruction +only uses `srcstep` and `dststep` (not ssubstep or dsubstep) Illustrated below: + +``` + def offset(): + for step in range(VL): + for substep in range(SUBVL=2): + yield step, substep + for i, j in offset(): + vec2_offs = i * SUBVL + j # calculate vec2 offset + addi RT+i, RA+i, 1 # but sv.addi is not vec2! + muli/vec2 RT+vec2_offs, RA+vec2_offs, 2 # this is +``` + +Actual assembler would be: + +``` + loop: + setvl VF=1, CTRmode + sv.addi *RT, *RA, 1 # no vec2 + sv.muli/vec2 *RT, *RA, 2 # vec2 + sv.svstep/vec2 # must match the muli + sv.bc CTRmode, loop # subtracts VL from CTR +``` + +This illustrates the correct but seemingly-anomalous behaviour: `sv.svstep/vec2` +is being requested to update `SVSTATE` to follow a vec2 loop construct. The anomalous +`sv.addi` is not prohibited as it may in fact be desirable to execute operations twice, +or to re-load data that was overwritten, and many other possibilities. + +------------- + +\newpage{} + +# Appendix + +**src_iterate** + +Note that `srcstep` and `ssubstep` are not the absolute final Element +(and Sub-Element) offsets. `srcstep` still has to go through individual +`REMAP` translation before becoming a per-operand (RA, RB, RC, RT, RS) +Element-level Source offset. + +Note also critically that `PACK` mode simply inverts the outer/order +loops making SUBVL the outer loop and VL the inner. + +``` + # source-stepping iterator + subvl = SVSTATE.subvl + vl = SVSTATE.vl + pack = SVSTATE.pack + unpack = SVSTATE.unpack + ssubstep = SVSTATE.ssubstep + end_ssub = ssubstep == subvl + end_src = SVSTATE.srcstep == vl-1 + # first source step. + srcstep = SVSTATE.srcstep + # used below: + # sz - from RM.MODE, source-zeroing + # srcmask - from RM.MODE, the source predicate + if pack: + # pack advances subvl in *outer* loop + while True: + assert srcstep <= vl-1 + end_src = srcstep == vl-1 + if end_src: + if end_ssub: + loopend = True + else: + SVSTATE.ssubstep += 1 + srcstep = 0 # reset + break + else: + srcstep += 1 # advance srcstep + if not sz: + break + if ((1 << srcstep) & srcmask) != 0: + break + else: + # advance subvl in *inner* loop + if end_ssub: + while True: + assert srcstep <= vl-1 + end_src = srcstep == vl-1 + if end_src: # end-point + loopend = True + srcstep = 0 + break + else: + srcstep += 1 + if not sz: + break + if ((1 << srcstep) & srcmask) != 0: + break + else: + log(" sskip", bin(srcmask), bin(1 << srcstep)) + SVSTATE.ssubstep = 0b00 # reset + else: + # advance ssubstep + SVSTATE.ssubstep += 1 + + SVSTATE.srcstep = srcstep +``` + +------------- + +\newpage{} + +**dest_iterate** + +Note that `dststep` and `dsubstep` are not the absolute final Element +(and Sub-Element) offsets. `dststep` still has to go through individual +`REMAP` translation before becoming a per-operand (RT, RS/EA) destination +Element-level offset, and `dsubstep` may also go through `(f)mv.swizzle` +reordering. + +Note also critically that `UNPACK` mode simply inverts the outer/order +loops making SUBVL the outer loop and VL the inner. + +``` + # dest step iterator + vl = SVSTATE.vl + subvl = SVSTATE.subvl + unpack = SVSTATE.unpack + dsubstep = SVSTATE.dsubstep + end_dsub = dsubstep == subvl + dststep = SVSTATE.dststep + end_dst = dststep == vl-1 + # used below: + # dz - from RM.MODE, destination-zeroing + # dstmask - from RM.MODE, the destination predicate + if unpack: + # unpack advances subvl in *outer* loop + while True: + assert dststep <= vl-1 + end_dst = dststep == vl-1 + if end_dst: + if end_dsub: + loopend = True + else: + SVSTATE.dsubstep += 1 + dststep = 0 # reset + break + else: + dststep += 1 # advance dststep + if not dz: + break + if ((1 << dststep) & dstmask) != 0: + break + else: + # advance subvl in *inner* loop + if end_dsub: + while True: + assert dststep <= vl-1 + end_dst = dststep == vl-1 + if end_dst: # end-point + loopend = True + dststep = 0 + break + else: + dststep += 1 + if not dz: + break + if ((1 << dststep) & dstmask) != 0: + break + SVSTATE.dsubstep = 0b00 # reset + else: + # advance ssubstep + SVSTATE.dsubstep += 1 + + SVSTATE.dststep = dststep +``` + +------------- + +\newpage{} + +**SVSTATE_NEXT** -*Programmers should be aware that VL, srcstep and dststep are global in nature. -Nested looping with different schedules is perfectly possible, as is -calling of functions, however SVSTATE (and any associated SVSTATE) should -obviously be stored on the stack in order to achieve this benefit* +``` + if SVi = 1 then return REMAP SVSHAPE0 current offset + if SVi = 2 then return REMAP SVSHAPE1 current offset + if SVi = 3 then return REMAP SVSHAPE2 current offset + if SVi = 4 then return REMAP SVSHAPE3 current offset + if SVi = 5 then return SVSTATE.srcstep # VL source step + if SVi = 6 then return SVSTATE.dststep # VL dest step + if SVi = 7 then return SVSTATE.ssubstep # SUBVL source step + if SVi = 8 then return SVSTATE.dsubstep # SUBVL dest step + + # SVi=0, explicit iteration requezted + src_iterate(); + dst_iterate(); + return 0 +``` + +**at_loopend** + +Both Vertical-First and Horizontal-First may use this algorithm to +determine if the "end-of-looping" (end of Sub-Program-Counter) has +been reached. Horizontal-First Mode will immediately move to the +next instruction, where `svstep.` will set `CR0.EQ` to 1. + +``` + # tells if this is the last possible element. + subvl = SVSTATE.subvl + vl = SVSTATE.vl + end_ssub = SVSTATE.ssubstep == subvl + end_dsub = SVSTATE.dsubstep == subvl + if SVSTATE.srcstep == vl-1 and end_ssub: + return True + if SVSTATE.dststep == vl-1 and end_dsub: + return True + return False +``` [[!tag standards]] @@ -134,3 +371,4 @@ obviously be stored on the stack in order to achieve this benefit* \newpage{} +