X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fsetvl.mdwn;h=12f7cb52e0d33387658989e9409e797fef65ce0a;hb=0a873ae35b9633a01c0090a30d919cc8c979cdfc;hp=873c50673473800d35852c7640bd5fc64bd5595b;hpb=aebb69ecf9e881f0e376431d1e2d007a8db44dfd;p=libreriscv.git diff --git a/openpower/sv/setvl.mdwn b/openpower/sv/setvl.mdwn index 873c50673..12f7cb52e 100644 --- a/openpower/sv/setvl.mdwn +++ b/openpower/sv/setvl.mdwn @@ -1,43 +1,333 @@ +[[!tag standards]] + # OpenPOWER SV setvl/setvli See links: * * +* +* TODO +* +* old page [[simple_v_extension/specification/sv.setvl]] +* [[sv/svstep]] + +Use of setvl results in changes to the MVL, VL and STATE SPRs. see [[sv/sprs]]♧ + +# Behaviour and Rationale + +SV's Vector Engine is based on Cray-style Variable-length Vectorisation, +just like RVV. However unlike RVV, SV sits on top of the standard Scalar +regfiles: there is no separate Vector register numbering. Therefore, also +unlike RVV, SV does not have hard-coded "Lanes": microarchitects +may use *ordinary* in-order, out-of-order, or superscalar designs +as the basis for SV. By contrast, the relevant parameter +in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems, +anywhere from 1 to tens of thousands of Lanes in supercomputers. + +SV is more like how MMX used to sit on top of the x86 FP regfile. +Therefore when Vector operations are performed, the question has to +be asked, "well, how much of the regfile do you want to allocate to +this operation?" because if it is too small an amount performance may +be affected, and if too large then other registers would overlap and +cause data corruption, or even if allocated correctly would require +spill to memory. + +The answer effectively needs to be parameterised. Hence: MAXVL (MVL) +is set from an immediate, so that the compiler may decide, statically, a +guaranteed resource allocation according to the needs of the application. + +While RVV's MAXVL was a hw limit, SV's MVL is simply a loop +optimization. It does not carry side-effects for the arch, though for +a specific cpu it may affect hw unit usage. + +Other than being able to set MVL, SV's VL (Vector Length) works just like +RVV's VL, with one minor twist. RVV permits the `setvl` instruction to +set VL to an arbitrary explicit value. Within the limit of MVL, VL +**MUST** be set to the requested value. Given that RVV only works on Vector Loops, +this is fine and part of its value and design. However, SV sits on top +of the standard register files. When MVL=VL=2, a Vector Add on `r3` +will perform two Scalar Adds: one on `r3` and one on `r4`. + +Thus there is the opportunity to set VL to an explicit value (within the +limits of MVL) with the reasonable expectation that if two operations +are requested (by setting VL=2) then two operations are guaranteed. +This avoids the need for a loop (with not-insignificant use of the +regfiles for counters), simply two instructions: + + setvli r0, MVL=64, VL=64 + ld r0.v, 0(r30) # load exactly 64 registers from memory + +Page Faults etc. aside this is *guaranteed* 100% without fail to perform +64 unit-strided LDs starting from the address pointed to by r30 and put +the contents into r0 through r63. Thus it becomes a "LOAD-MULTI". Twin +Predication could even be used to only load relevant registers from +the stack. This *only works if VL is set to the requested value* rather +than, as in RVV, allowing the hardware to set VL to an arbitrary value +(caveat being, limited to not exceed MVL) + +Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`. +In combination with SVP64 [[sv/branches]] this can save one instruction +inside critical inner loops. # Format -| 0..5 |6....10|11....16|17...20|21.22|23.24|25.26|27...30|31| name | -|------|-------|--------|-------|-----|-----|--|--|-------|--|---------| -| 19 | RT | RA | | XO[0:4] XO[5:10] |/ | XL-Form | -| 19 | (RT|0)| (RA|0) |vlimmed |// |vs|ms| NNNNN |/ | setvl/i | +*(Allocation of opcode TBD pending OPF ISA WG approval)*, +using EXT22 temporarily and fitting into the +[[sv/bitmanip]] space + +Form: SVL-Form (see [[isatables/fields.text]]) + +| 0.5|6.10|11.15|16..21| 22...25 | 26.30 |31| name | +| -- | -- | --- | ---- |----------- | ----- |--| ------- | +|OPCD| RT | RA | SVi |cv ms vs vf | 11110 |Rc| setvl | + +Instruction format: + + setvl RT,RA,SVi,vf,vs,ms + setvl. RT,RA,SVi,vf,vs,ms + +Note that the immediate (`SVi`) spans 7 bits (16 to 22) + +* `cv` - bit 22 - reads CTR instead of RA +* `ms` - bit 23 - allows for setting of MVL. +* `vs` - bit 24 - allows for setting of VL. +* `vf` - bit 25 - sets "Vertical First Mode". + +Note that in immediate setting mode VL and MVL start from **one** +i.e. that an immediate value of zero will result in VL/MVL being set to 1. +0b111111 results in VL/MVL being set to 64. This is because setting +VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL +to 0 would result in all Vector operations becoming `nop`. If this is +truly desired (nop behaviour) then setting VL and MVL to zero is to be +done via the [[SVSTATE SPR|sv/sprs]] + +Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise + + setvli VL=8 : setvl r5, r0, VL=8 + setmvli MVL=8 : setvl r0, r0, MVL=8 + +Additional pseudo-op for obtaining VL without modifying it: + + getvl r5 : setvl r5, r0, vf=0, vs=0, ms=0 + +For Vertical-First mode, a pseudo-op for explicit incrementing +of srcstep and dststep: + + svstep. : setvl. 0, 0, vf=1, vs=0, ms=0 + +Note that whilst it is possible to set both MVL and VL from the same +immediate, it is not possible to set them to different immediates in +the same instruction. That would require two instructions. + +# Vertical First Mode + +Vertical First is effectively like an implicit single bit predicate +applied to every SVP64 instruction. **ONLY** one element in each +SVP64 Vector instruction is executed; srcstep and dststep do **not** +increment, and the Program Counter progresses **immediately* to +the next instruction just as it would for any standard scalar v3.0B +instruction. + +An explicit mode of setvl is called which can move srcstep and +dststep on to the next element, still respecting predicate +masks. + +In other words, where normal SVP64 Vectorisation acts "horizontally" +by looping first through 0 to VL-1 and only then moving the PC +to the next instruction, Vertical-First moves the PC onwards +(vertically) through multiple instructions **with the same +srcstep and dststep**, then an explict instruction used to +advance srcstep/dststep, and an outer loop is expected to be +used (branch instruction) which completes a series of +Vector operations. + +```svstep``` mode is enabled when vf=1, vs=0 and ms=0. +When Rc=1 it is possible to determine when any level of +loops reach an end condition, or if VL has been reached. The immediate can +be reinterpreted as indicating which SVSTATE (0-3) +should be tested and placed into CR0. + +* setvl immediate = 1: only VL testing is enabled. CR0.SO is set + to 1 when either srcstep or dststep reach VL +* setvl immediate = 2: also include inner middle and outer + loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT +* setvl immediate = 3: test SVSTATE1 +* setvl immediate = 4: test SVSTATE2 +* setvl immediate = 5: test SVSTATE3 + +Testing any end condition of any loop of any REMAP state allows branches to be used to create loops. + +*Programmers should be aware that VL, srcstep and dststep are global in nature. +Nested looping with different schedules is perfectly possible, as is +calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.* # Pseudocode // instruction fields: - rd = get_rt_field(); - ra = get_ra_field(); - vlimmed = get_immed_field(); + rd = get_rt_field(); // bits 6..10 + ra = get_ra_field(); // bits 11..15 + vc = get_vc_field(); // bit 22 + vf = get_vf_field(); // bit 23 + vs = get_vs_field(); // bit 24 + ms = get_ms_field(); // bit 25 + Rc = get_Rc_field(); // bit 31 - if vs { - VL = vlimmed - } else { - VL = SPR[SV_VL] - } - if ms { - MVL = vlimmed + if vf and not vs and not ms { + // increment src/dest step mode + // NOTE! this is in no way complete! predication is not included + // and neither is SUB-VL mode + srcstep = SPR[SV].srcstep + dststep = SPR[SV].dststep + VL = SPR[SV].VL + srcstep++ + dststep++ + rollover = (srcstep == VL or dststep == VL) + if rollover: + // Reset srcstep, dststep, and also exit "Vertical First" mode + srcstep = 0 + dststep = 0 + MSR[6] = 0 + SPR[SV].srcstep = srcstep + SPR[SV].dststep = dststep + + // write CR? helps for doing Vertical loops, detects end + // of Vector Elements + if Rc { + // update CR to indicate that srcstep/dststep "rolled over" + CR0.eq = rollover + } } else { - MVL = SPR[SV_MVL] - } - // calculate VL - VL = min(VL, MVL) + // add one. MVL/VL=1..64 not 0..63 + vlimmed = get_immed_field()+1; // 16..22 + + // set VL (or not). + // 4 options: from SPR, from immed, from ra, from CTR + if vs { + // VL to be sourced from fields/regs + if vc { + VL = CTR + } else if ra != 0 { + VL = GPR[ra] + } else { + VL = vlimmed + } + } else { + // VL not to change (except if MVL is reduced) + // read from SPRs + VL = SPR[SV_VL] + } + + // set MVL (or not). + // 2 options: from SPR, from immed + if ms { + MVL = vlimmed + } else { + // MVL not to change, read from SPRs + MVL = SPR[SV_MVL] + } + + // calculate (limit) VL + VL = min(VL, MVL) - // store VL, MVL - SPR[SV_VL] = VL - SPR[SV_MVL] = MVL + // store VL, MVL + SVSTATE.VL = VL + SVSTATE.MVL = MVL - // write rd - if rd != 0 { - // rd is not x0 - regs[rd] = VL; + // write rd + if rt != 0 { + // rt is not zero + regs[rt] = VL; + } + // write CR? + if Rc { + // update CR from VL (not rt) + CR0.eq = (VL == 0) + ... + ... + } + // write Vertical-First mode + SVSTATE.vf = vf } + +# Examples + +## Core concept loop + +``` +loop: + setvl a3, a0, MVL=8 # update a3 with vl + # (# of elements this iteration) + # set MVL to 8 + # do vector operations at up to 8 length (MVL=8) + # ... + sub a0, a0, a3 # Decrement count by vl + bnez a0, loop # Any more? +``` + +## Loop using Rc=1 + + my_fn: + li r3, 1000 + b test + loop: + sub r3, r3, r4 + ... + test: + setvli. r4, r3, MVL=64 + bne cr0, loop + end: + blr + +## setmvlhi double loop + +Two elements per inner loop are executed per instruction. This assumes +that underlying hardware, when `setmvlhi` requests a parallelism hint of 2 +actually sets a parallelism hint of 2. + +This example, in c, would be: + +``` +long *r4; +for (i=0; i < CTR; i++) +{ + r4[i+2] += r4[i] +} +``` + +where, clearly, it is not possible to do more +than 2 elements in parallel at a time: attempting +to do so would result in data corruption. The compiler +may be able to determine memory aliases and inform +hardware at runtime of the maximum safe parallelism +limit. + +Whilst this example could be simplified to simply set VL=2, +or exploit the fact that overlapping adds have well-defined +behaviour, this has not been done, here, for illustrative purposes +in order to demonstrate setmvhli and Vertical-First Mode. + +Note, crucially, how r4, r32 and r20 are **NOT** incremented +inside the inner loop. The MAXVL reservation is still 8, +i.e. as srcstep and dststep advance (by 2 elements at a time) +registers r20-r27 will be used for the first LD, and +registers r32-39 for the second LD. `r4+srcstep*8` will be used +as the elstrided offset for LDs. + +``` + setmvlhi 8, 2 # MVL=8, VFHint=2 +loop: + setvl r1, CTR, vf=1 # VL=r1=MAX(MVL, CTR), VF=1 + mulli r1, r1, 8 # multiply by int width +loopinner: + sv.ld r20.v, r4(0) # load VLhint elements (max 2) + addi r2, r4, 16 # 2 elements ahead + sv.ld r32.v, r2(0) # load VLhint elements (max 2) + sv.add r32.v, r20.v, r32.v # x[i+2] += x[i] + sv.st r32.v, r2(0) # store VLhint elements + svstep. # srcstep += VLhint + bnz loopinner # repeat until srcstep=VL + # now done VL elements, move to next batch + add r4, r4, r1 # move r4 pointer forward + sv.bnz/ctr loop # decrement CTR by VL +```