openpower/sv/setvl.mdwn

   1 [[!tag standards]]
   2
   3 # OpenPOWER SV setvl/setvli
   4
   5 See links:
   6
   7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
  12 * old page [[simple_v_extension/specification/sv.setvl]]
  13 * [[sv/svstep]]
  14
  15 Use of setvl results in changes to the MVL, VL and STATE SPRs. see [[sv/sprs]]♧
  16
  17 # Behaviour and Rationale
  18
  19 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
  20 just like RVV.  However unlike RVV, SV sits on top of the standard Scalar
  21 regfiles: there is no separate Vector register numbering.  Therefore, also
  22 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
  23 may use *ordinary* in-order, out-of-order, or superscalar designs
  24 as the basis for SV. By contrast, the relevant parameter
  25 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
  26 anywhere from 1 to tens of thousands of Lanes in supercomputers.
  27
  28 SV is more like how MMX used to sit on top of the x86 FP regfile.
  29 Therefore when Vector operations are performed, the question has to
  30 be asked, "well, how much of the regfile do you want to allocate to
  31 this operation?" because if it is too small an amount performance may
  32 be affected, and if too large then other registers would overlap and
  33 cause data  corruption, or even if allocated correctly would require
  34 spill to memory.
  35
  36 The answer effectively needs to be parameterised.  Hence: MAXVL (MVL)
  37 is set from an immediate, so that the compiler may decide, statically, a
  38 guaranteed resource allocation according to the needs of the application.
  39
  40 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
  41 optimization. It does not carry side-effects for the arch, though for
  42 a specific cpu it may affect hw unit usage.
  43
  44 Other than being able to set MVL, SV's VL (Vector Length) works just like
  45 RVV's VL, with one minor twist.  RVV permits the `setvl` instruction to
  46 set VL to an arbitrary explicit value.  Within the limit of MVL, VL
  47 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
  48 this is fine and part of its value and design.  However, SV sits on top
  49 of the standard register files.  When MVL=VL=2, a Vector Add on `r3`
  50 will perform two Scalar Adds: one on `r3` and one on `r4`.
  51
  52 Thus there is the opportunity to set VL to an explicit value (within the
  53 limits of MVL) with the reasonable expectation that if two operations
  54 are requested (by setting VL=2) then two operations are guaranteed.
  55 This avoids the need for a loop (with not-insignificant use of the
  56 regfiles for counters), simply two instructions:
  57
  58     setvli r0, MVL=64, VL=64
  59     ld r0.v, 0(r30) # load exactly 64 registers from memory
  60
  61 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
  62 64 unit-strided LDs starting from the address pointed to by r30 and put
  63 the contents into r0 through r63.  Thus it becomes a "LOAD-MULTI". Twin
  64 Predication could even be used to only load relevant registers from
  65 the stack.  This *only works if VL is set to the requested value* rather
  66 than, as in RVV, allowing the hardware to set VL to an arbitrary value
  67 (caveat being, limited to not exceed MVL)
  68
  69 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
  70 In combination with SVP64 [[sv/branches]] this can save one instruction
  71 inside critical inner loops.
  72
  73 # Format
  74
  75 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
  76 using EXT22 temporarily and fitting into the
  77 [[sv/bitmanip]] space
  78
  79 Form: SVL-Form (see [[isatables/fields.text]])
  80
  81 | 0.5|6.10|11.15|16..21| 22...25    | 26.30 |31|  name   |
  82 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
  83 |OPCD| RT | RA  | SVi  |cv ms vs vf | 11110 |Rc| setvl   |
  84
  85 Instruction format:
  86
  87     setvl RT,RA,SVi,vf,vs,ms
  88     setvl. RT,RA,SVi,vf,vs,ms
  89
  90 Note that the immediate (`SVi`) spans 7 bits (16 to 22)
  91
  92 * `cv` - bit 22 - reads CTR instead of RA
  93 * `ms` - bit 23 - allows for setting of MVL.
  94 * `vs` - bit 24 - allows for setting of VL.
  95 * `vf` - bit 25 - sets "Vertical First Mode".
  96
  97 Note that in immediate setting mode VL and MVL start from **one**
  98 i.e. that an immediate value of zero will result in VL/MVL being set to 1.
  99 0b111111 results in VL/MVL being set to 64. This is because setting
 100 VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
 101 to 0 would result in all Vector operations becoming `nop`.  If this is
 102 truly desired (nop behaviour) then setting VL and MVL to zero is to be
 103 done via the [[SVSTATE SPR|sv/sprs]]
 104
 105 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
 106
 107     setvli VL=8    : setvl r5, r0, VL=8
 108     setmvli MVL=8  : setvl r0, r0, MVL=8
 109
 110 Additional pseudo-op for obtaining VL without modifying it:
 111
 112     getvl r5       : setvl r5, r0, vf=0, vs=0, ms=0
 113
 114 For Vertical-First mode, a pseudo-op for explicit incrementing
 115 of srcstep and dststep:
 116
 117     svstep.        : setvl. 0, 0, vf=1, vs=0, ms=0
 118
 119 Note that whilst it is possible to set both MVL and VL from the same
 120 immediate, it is not possible to set them to different immediates in
 121 the same instruction.  That would require two instructions.
 122
 123 # Vertical First Mode
 124
 125 Vertical First is effectively like an implicit single bit predicate
 126 applied to every SVP64 instruction.  **ONLY** one element in each
 127 SVP64 Vector instruction is executed; srcstep and dststep do **not**
 128 increment, and the Program Counter progresses **immediately* to
 129 the next instruction just as it would for any standard scalar v3.0B
 130 instruction.
 131
 132 An explicit mode of setvl is called which can move srcstep and
 133 dststep on to the next element, still respecting predicate
 134 masks.
 135
 136 In other words, where normal SVP64 Vectorisation acts "horizontally"
 137 by looping first through 0 to VL-1 and only then moving the PC
 138 to the next instruction, Vertical-First moves the PC onwards
 139 (vertically) through multiple instructions **with the same
 140 srcstep and dststep**, then an explict instruction used to
 141 advance srcstep/dststep, and an outer loop is expected to be
 142 used (branch instruction) which completes a series of
 143 Vector operations.
 144
 145 ```svstep``` mode is enabled when vf=1, vs=0 and ms=0.
 146 When Rc=1 it is possible to determine when any level of
 147 loops reach an end condition, or if VL has been reached. The immediate can
 148 be reinterpreted as indicating which SVSTATE (0-3)
 149 should be tested and placed into CR0.
 150
 151 * setvl immediate = 1: only VL testing is enabled. CR0.SO is set
 152   to 1 when either srcstep or dststep reach VL
 153 * setvl immediate = 2: also include inner middle and outer
 154   loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
 155 * setvl immediate = 3: test SVSTATE1
 156 * setvl immediate = 4: test SVSTATE2
 157 * setvl immediate = 5: test SVSTATE3
 158
 159 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
 160
 161 *Programmers should be aware that VL, srcstep and dststep are global in nature.
 162 Nested looping with different schedules is perfectly possible, as is
 163 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
 164
 165 # Pseudocode
 166
 167     // instruction fields:
 168     rd = get_rt_field();         // bits 6..10
 169     ra = get_ra_field();         // bits 11..15
 170     vc = get_vc_field();         // bit 22
 171     vf = get_vf_field();         // bit 23
 172     vs = get_vs_field();         // bit 24
 173     ms = get_ms_field();         // bit 25
 174     Rc = get_Rc_field();         // bit 31
 175
 176     if vf and not vs and not ms {
 177         // increment src/dest step mode
 178         // NOTE! this is in no way complete! predication is not included
 179         // and neither is SUB-VL mode
 180         srcstep = SPR[SV].srcstep
 181         dststep = SPR[SV].dststep
 182         VL = SPR[SV].VL
 183         srcstep++
 184         dststep++
 185         rollover = (srcstep == VL or dststep == VL)
 186         if rollover:
 187             // Reset srcstep, dststep, and also exit "Vertical First" mode
 188             srcstep = 0
 189             dststep = 0
 190             MSR[6] = 0
 191         SPR[SV].srcstep = srcstep
 192         SPR[SV].dststep = dststep
 193
 194         // write CR? helps for doing Vertical loops, detects end
 195         // of Vector Elements
 196         if Rc {
 197             // update CR to indicate that srcstep/dststep "rolled over"
 198             CR0.eq = rollover
 199         }
 200     } else {
 201         // add one. MVL/VL=1..64 not 0..63
 202         vlimmed = get_immed_field()+1; //  16..22
 203
 204         // set VL (or not).
 205         // 4 options: from SPR, from immed, from ra, from CTR
 206         if vs {
 207            // VL to be sourced from fields/regs
 208            if vc {
 209                VL = CTR
 210            } else if ra != 0 {
 211                VL = GPR[ra]
 212            } else {
 213                VL = vlimmed
 214            }
 215         } else {
 216            // VL not to change (except if MVL is reduced)
 217            // read from SPRs
 218            VL = SPR[SV_VL]
 219         }
 220
 221         // set MVL (or not).
 222         // 2 options: from SPR, from immed
 223         if ms {
 224            MVL = vlimmed
 225         } else {
 226            // MVL not to change, read from SPRs
 227            MVL = SPR[SV_MVL]
 228         }
 229
 230         // calculate (limit) VL
 231         VL = min(VL, MVL)
 232
 233         // store VL, MVL
 234         SVSTATE.VL = VL
 235         SVSTATE.MVL = MVL
 236
 237         // write rd
 238         if rt != 0 {
 239             // rt is not zero
 240             regs[rt] = VL;
 241         }
 242         // write CR?
 243         if Rc {
 244             // update CR from VL (not rt)
 245             CR0.eq = (VL == 0)
 246             ...
 247             ...
 248         }
 249         // write Vertical-First mode
 250         SVSTATE.vf = vf
 251     }
 252
 253 # Examples
 254
 255 ## Core concept loop
 256
 257 ```
 258 loop:
 259     setvl a3, a0, MVL=8    #  update a3 with vl
 260                            # (# of elements this iteration)
 261                            # set MVL to 8
 262     # do vector operations at up to 8 length (MVL=8)
 263     # ...
 264     sub a0, a0, a3   # Decrement count by vl
 265     bnez a0, loop    # Any more?
 266 ```
 267
 268 ## Loop using Rc=1
 269
 270     my_fn:
 271       li r3, 1000
 272       b test
 273     loop:
 274       sub r3, r3, r4
 275       ...
 276     test:
 277       setvli. r4, r3, MVL=64
 278       bne cr0, loop
 279     end:
 280       blr
 281
 282 ## setmvlhi double loop
 283
 284 Two elements per inner loop are executed per instruction. This assumes
 285 that underlying hardware, when `setmvlhi` requests a parallelism hint of 2
 286 actually sets a parallelism hint of 2.
 287
 288 This example, in c, would be:
 289
 290 ```
 291 long *r4;
 292 for (i=0; i < CTR; i++)
 293 {
 294     r4[i+2] += r4[i]
 295 }
 296 ```
 297
 298 where, clearly, it is not possible to do more
 299 than 2 elements in parallel at a time: attempting
 300 to do so would result in data corruption. The compiler
 301 may be able to determine memory aliases and inform
 302 hardware at runtime of the maximum safe parallelism
 303 limit.
 304
 305 Whilst this example could be simplified to simply set VL=2,
 306 or exploit the fact that overlapping adds have well-defined
 307 behaviour, this has not been done, here, for illustrative purposes
 308 in order to demonstrate setmvhli and Vertical-First Mode.
 309
 310 Note, crucially, how r4, r32 and r20 are **NOT** incremented
 311 inside the inner loop.  The MAXVL reservation is still 8,
 312 i.e. as srcstep and dststep advance (by 2 elements at a time)
 313 registers r20-r27 will be used for the first LD, and
 314 registers r32-39 for the second LD.  `r4+srcstep*8` will be used
 315 as the elstrided offset for LDs.
 316
 317 ```
 318    setmvlhi  8, 2 # MVL=8, VFHint=2
 319 loop:
 320     setvl  r1, CTR, vf=1 # VL=r1=MAX(MVL, CTR), VF=1
 321     mulli  r1, r1, 8     # multiply by int width
 322 loopinner:
 323     sv.ld r20.v, r4(0) # load VLhint elements (max 2)
 324     addi r2, r4, 16    # 2 elements ahead
 325     sv.ld r32.v, r2(0) # load VLhint elements (max 2)
 326     sv.add r32.v, r20.v, r32.v # x[i+2] += x[i]
 327     sv.st r32.v, r2(0) # store VLhint elements
 328     svstep.            # srcstep += VLhint
 329     bnz loopinner      # repeat until srcstep=VL
 330     # now done VL elements, move to next batch
 331     add r4, r4, r1     # move r4 pointer forward
 332     sv.bnz/ctr loop    # decrement CTR by VL
 333 ```