From 50e59482e5baf44d8793f056b316a3a66b19887b Mon Sep 17 00:00:00 2001 From: lkcl Date: Sun, 26 Mar 2023 12:38:30 +0100 Subject: [PATCH] --- openpower/sv/rfc/ls008.mdwn | 133 ++++++++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) diff --git a/openpower/sv/rfc/ls008.mdwn b/openpower/sv/rfc/ls008.mdwn index b0c93da8b..9ff016512 100644 --- a/openpower/sv/rfc/ls008.mdwn +++ b/openpower/sv/rfc/ls008.mdwn @@ -219,6 +219,139 @@ Additional pseudo-op for obtaining VL without modifying it (or any state): getvl r5 : setvl r5, r0, vf=0, vs=0, ms=0 getvl. r5 : setvl. r5, r0, vf=0, vs=0, ms=0 +Note that whilst it is possible to set both MVL and VL from the same +immediate, it is not possible to set them to different immediates in +the same instruction. Doing so would require two instructions. + +**Selecting sources for VL** + +There is considerable opcode pressure, consequently to set MVL and VL +from different sources is as follows: + +| condition | effect | +| - | - | +| `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR) | +| `vs=1, RA=0, RT=0` | VL set to MIN(MVL, SVi+1) | +| `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA) | +| `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA) | + +The reasoning here is that the opportunity to set RT equal to the +immediate `SVi+1` is sacrificed in favour of setting from CTR. + +# Unusual Rc=1 behaviour + +Normally, the return result from an instruction is in `RT`. With +it being possible for `RT=0` to mean that `CTR` mode is to be read, +some different semantics are needed. + +CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that +overflow may occur: `VL`, if set either from an immediate or from `CTR`, +may not exceed `MAXVL`, and if it is, `CR0.SO` must be set. + +Additionally, in reality it is **`VL`** being set. Therefore, rather +than `CR0` testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE +is set if `VL` is non-zero. + +# Vertical First Mode + +Vertical First is effectively like an implicit single bit predicate +applied to every SVP64 instruction. **ONLY** one element in each +SVP64 Vector instruction is executed; srcstep and dststep do **not** +increment, and the Program Counter progresses **immediately** to +the next instruction just as it would for any standard scalar v3.0B +instruction. + +An explicit mode of setvl is called which can move srcstep and +dststep on to the next element, still respecting predicate +masks. + +In other words, where normal SVP64 Vectorisation acts "horizontally" +by looping first through 0 to VL-1 and only then moving the PC +to the next instruction, Vertical-First moves the PC onwards +(vertically) through multiple instructions **with the same +srcstep and dststep**, then an explict instruction used to +advance srcstep/dststep. An outer loop is expected to be +used (branch instruction) which completes a series of +Vector operations. + +```svfstep``` mode is enabled when vf=1, vs=0 and ms=0. +When Rc=1 it is possible to determine when any level of +loops reach an end condition, or if VL has been reached. The immediate can +be reinterpreted as indicating which SVSTATE (0-3) +should be tested and placed into CR0 (when Rc=1) + +When RT is not zero, an internal stepping index may also be returned, +either the REMAP index or srcstep or dststep. This table is identical +to that of [[sv/svstep]]: + +* `SVi=1`: also include inner middle and outer + loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT +* `SVi=2`: test SVSTATE1 (and return conditions) +* `SVi=3`: test SVSTATE2 (and return conditions) +* `SVi=4`: test SVSTATE3 (and return conditions) +* `SVi=5`: `SVSTATE.srcstep` is returned. +* `SVi=6`: `SVSTATE.dststep` is returned. + +Testing any end condition of any loop of any REMAP state allows branches to be used to create loops. + +*Programmers should be aware that VL, srcstep and dststep are global in nature. +Nested looping with different schedules is perfectly possible, as is +calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.* + +**SUBVL** + +Sub-vector elements are not be considered "Vertical". The vec2/3/4 +is to be considered as if the "single element". Caveats exist for +[[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled, +due to the order in which VL and SUBVL loops are applied being +swapped (outer-inner becomes inner-outer) + +# Examples + +## Core concept loop + +``` +loop: + setvl a3, a0, MVL=8 # update a3 with vl + # (# of elements this iteration) + # set MVL to 8 + # do vector operations at up to 8 length (MVL=8) + # ... + sub a0, a0, a3 # Decrement count by vl + bnez a0, loop # Any more? +``` + +## Loop using Rc=1 + + my_fn: + li r3, 1000 + b test + loop: + sub r3, r3, r4 + ... + test: + setvli. r4, r3, MVL=64 + bne cr0, loop + end: + blr + +## Load/Store-Multi (selective) + +Up to 64 FPRs will be loaded, here. `r3` is set one per bit +for each FP register required to be loaded. The block of memory +from which the registers are loaded is contiguous (no gaps): +any FP register which has a corresponding zero bit in `r3` +is *unaltered*. In essence this is a selective LD-multi with +"Scatter" capability. + + setvli r0, MVL=64, VL=64 + sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers + +Up to 64 FPRs will be saved, here. Again, `r3` + + setvli r0, MVL=64, VL=64 + sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers + ------------- \newpage{} -- 2.30.2