openpower/sv/setvl.mdwn

   1 [[!tag standards]]
   2
   3 # OpenPOWER SV setvl/setvli
   4
   5 See links:
   6
   7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
  12 * old page [[simple_v_extension/specification/sv.setvl]]
  13
  14 Use of setvl results in changes to the MVL, VL and STATE SPRs. see [[sv/sprs]]♧
  15
  16 # Behaviour and Rationale
  17
  18 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
  19 just like RVV.  However unlike RVV, SV sits on top of the standard Scalar
  20 regfiles: there is no separate Vector register numbering.  Therefore, also
  21 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
  22 may use *ordinary* in-order, out-of-order, or superscalar designs
  23 as the basis for SV. By contrast, the relevant parameter
  24 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
  25 anywhere from 1 to tens of thousands of Lanes in supercomputers.
  26
  27 SV is more like how MMX used to sit on top of the x86 FP regfile.
  28 Therefore when Vector operations are performed, the question has to
  29 be asked, "well, how much of the regfile do you want to allocate to
  30 this operation?" because if it is too small an amount performance may
  31 be affected, and if too large then other registers would overlap and
  32 cause data  corruption, or even if allocated correctly would require
  33 spill to memory.
  34
  35 The answer effectively needs to be parameterised.  Hence: MAXVL (MVL)
  36 is set from an immediate, so that the compiler may decide, statically, a
  37 guaranteed resource allocation according to the needs of the application.
  38
  39 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
  40 optimization. It does not carry side-effects for the arch, though for
  41 a specific cpu it may affect hw unit usage.
  42
  43 Other than being able to set MVL, SV's VL (Vector Length) works just like
  44 RVV's VL, with one minor twist.  RVV permits the `setvl` instruction to
  45 set VL to an arbitrary explicit value.  Within the limit of MVL, VL
  46 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
  47 this is fine and part of its value and design.  However, SV sits on top
  48 of the standard register files.  When MVL=VL=2, a Vector Add on `r3`
  49 will perform two Scalar Adds: one on `r3` and one on `r4`.
  50
  51 Thus there is the opportunity to set VL to an explicit value (within the
  52 limits of MVL) with the reasonable expectation that if two operations
  53 are requested (by setting VL=2) then two operations are guaranteed.
  54 This avoids the need for a loop (with not-insignificant use of the
  55 regfiles for counters), simply two instructions:
  56
  57     setvli r0, MVL=64, VL=64
  58     ld r0.v, 0(r30) # load exactly 64 registers from memory
  59
  60 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
  61 64 unit-strided LDs starting from the address pointed to by r30 and put
  62 the contents into r0 through r63.  Thus it becomes a "LOAD-MULTI". Twin
  63 Predication could even be used to only load relevant registers from
  64 the stack.  This *only works if VL is set to the requested value* rather
  65 than, as in RVV, allowing the hardware to set VL to an arbitrary value
  66 (caveat being, limited to not exceed MVL)
  67
  68 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
  69 In combination with SVP64 [[sv/branches]] this can save one instruction
  70 inside critical inner loops.
  71
  72 # Format
  73
  74 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
  75 using EXT22 temporarily and fitting into the
  76 [[sv/bitmanip]] space
  77
  78 Form: SVL-Form (see [[isatables/fields.text]])
  79
  80 | 0.5|6.10|11.15|16..21| 22...25    | 26.30 |31|  name   |
  81 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
  82 |OPCD| RT | RA  | SVi  |cv ms vs vf | 11110 |Rc| setvl   |
  83
  84 Note that the immediate (`SVi`) spans 7 bits (16 to 22)
  85
  86 * `cv` - bit 22 - reads CTR instead of RA
  87 * `ms` - bit 23 - allows for setting of MVL.
  88 * `vs` - bit 24 - allows for setting of VL.
  89 * `vf` - bit 25 - sets "Vertical First Mode".
  90
  91 Note that in immediate setting mode VL and MVL start from **one**
  92 i.e. that an immediate value of zero will result in VL/MVL being set to 1.
  93 0b111111 results in VL/MVL being set to 64. This is because setting
  94 VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
  95 to 0 would result in all Vector operations becoming `nop`.  If this is
  96 truly desired (nop behaviour) then setting VL and MVL to zero is to be
  97 done via the [[SVSTATE SPR|sv/sprs]]
  98
  99 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
 100
 101     setvli VL=8    : setvl r5, r0, VL=8
 102     setmvli MVL=8  : setvl r0, r0, MVL=8
 103
 104 Additional pseudo-op for obtaining VL without modifying it:
 105
 106     getvl r5       : setvl r5, r0, vf=0, vs=0, ms=0
 107
 108 For Vertical-First mode, a pseudo-op for explicit incrementing
 109 of srcstep and dststep:
 110
 111     svstep.        : setvl. 0, 0, vf=1, vs=0, ms=0
 112
 113 Note that whilst it is possible to set both MVL and VL from the same
 114 immediate, it is not possible to set them to different immediates in
 115 the same instruction.  That would require two instructions.
 116
 117 # setmvlhi
 118
 119 In Vertical-First Mode the minimum expectation is that one scalar
 120 element may be executed by each instruction. There are however
 121 circumstances where it may be possible to execute more than one
 122 element per instruction (srcstep elements 0-3 for example)
 123 but leaving it up to hardware to
 124 determine a "safe minimum amount" where memory corruption does
 125 not occur may not be practical (or is simply very costly).
 126
 127 Therefore, setmvlhi may specify, as determined by the compiler,
 128 exactly what that quantity is.  Unlike VL, which is an amount
 129 that, when requested, **must** be executed, VFhint may be set
 130 by the hardware to an amount that the hardware is capable of.
 131 In other words: setmvlhi requests a hint size, bur hardware chooses
 132 the actual hint.
 133
 134 The reason for this cooperative negotiation between hardware and
 135 software is that whilst the compiler may have information about
 136 memory hazards that must be avoided which hardware cannot
 137 know about, the hardware knows the maximum batch size
 138 it can execute in parallel but the compiler is unaware of
 139 the variance in that batch size on different implementations.
 140 Thus, hardware sets VLHint to the minimum of the requested
 141 amount and the hardware limit. Simple implementations always
 142 set VLHint to 1.
 143
 144 Critical to note are two things:
 145
 146 1. VFhint must not be set by hardware to an amount that
 147 exceeds either MVL or the requested amount, and must set
 148 VFhint to at least 1 element.
 149 2. svstep will increment srcstep and dststep by VFhint,
 150 therefore when hardware says it can perform N element
 151 operations, hardware **MUST** perform N operations
 152 for every single instruction.
 153
 154 Form: SVL-Form (see [[isatables/fields.text]])
 155
 156 | 0.5|6.10|11.15|16..21|22 | 23...25  | 26.30 |31|  name    |
 157 | -- | -- | --- | ---- |---| -------- | ----- |--| -------- |
 158 |OPCD| RT | MVL | SVi  |MVL| ms vs vf | 10110 |Rc| setmvlhi  |
 159
 160 # Vertical First Mode
 161
 162 Vertical First is effectively like an implicit single bit predicate
 163 applied to every SVP64 instruction.  **ONLY** one element in each
 164 SVP64 Vector instruction is executed; srcstep and dststep do **not**
 165 increment, and the Program Counter progresses **immediately* to
 166 the next instruction just as it would for any standard scalar v3.0B
 167 instruction.
 168
 169 An explicit mode of setvl is called which can move srcstep and
 170 dststep on to the next element, still respecting predicate
 171 masks.
 172
 173 In other words, where normal SVP64 Vectorisation acts "horizontally"
 174 by looping first through 0 to VL-1 and only then moving the PC
 175 to the next instruction, Vertical-First moves the PC onwards
 176 (vertically) through multiple instructions **with the same
 177 srcstep and dststep**, then an explict instruction used to
 178 advance srcstep/dststep, and an outer loop is expected to be
 179 used (branch instruction) which completes a series of
 180 Vector operations.
 181
 182 ```svstep``` mode is enabled when vf=1, vs=0 and ms=0.
 183 When Rc=1 it is possible to determine when any level of
 184 loops reach an end condition, or if VL has been reached. The immediate can
 185 be reinterpreted as indicating which SVSTATE (0-3)
 186 should be tested and placed into CR0.
 187
 188 * setvl immediate = 1: only VL testing is enabled. CR0.SO is set
 189   to 1 when either srcstep or dststep reach VL
 190 * setvl immediate = 2: also include inner middle and outer
 191   loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
 192 * setvl immediate = 3: test SVSTATE1
 193 * setvl immediate = 4: test SVSTATE2
 194 * setvl immediate = 5: test SVSTATE3
 195
 196 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
 197
 198 *Programmers should be aware that VL, srcstep and dststep are global in nature.
 199 Nested looping with different schedules is perfectly possible, as is
 200 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
 201
 202 # Pseudocode
 203
 204     // instruction fields:
 205     rd = get_rt_field();         // bits 6..10
 206     ra = get_ra_field();         // bits 11..15
 207     vc = get_vc_field();         // bit 22
 208     vf = get_vf_field();         // bit 23
 209     vs = get_vs_field();         // bit 24
 210     ms = get_ms_field();         // bit 25
 211     Rc = get_Rc_field();         // bit 31
 212
 213     if vf and not vs and not ms {
 214         // increment src/dest step mode
 215         // NOTE! this is in no way complete! predication is not included
 216         // and neither is SUB-VL mode
 217         srcstep = SPR[SV].srcstep
 218         dststep = SPR[SV].dststep
 219         VL = SPR[SV].VL
 220         srcstep++
 221         dststep++
 222         rollover = (srcstep == VL or dststep == VL)
 223         if rollover:
 224             // Reset srcstep, dststep, and also exit "Vertical First" mode
 225             srcstep = 0
 226             dststep = 0
 227             MSR[6] = 0
 228         SPR[SV].srcstep = srcstep
 229         SPR[SV].dststep = dststep
 230
 231         // write CR? helps for doing Vertical loops, detects end
 232         // of Vector Elements
 233         if Rc {
 234             // update CR to indicate that srcstep/dststep "rolled over"
 235             CR0.eq = rollover
 236         }
 237     } else {
 238         // add one. MVL/VL=1..64 not 0..63
 239         vlimmed = get_immed_field()+1; //  16..22
 240
 241         // set VL (or not).
 242         // 4 options: from SPR, from immed, from ra, from CTR
 243         if vs {
 244            // VL to be sourced from fields/regs
 245            if vc {
 246                VL = CTR
 247            } else if ra != 0 {
 248                VL = GPR[ra]
 249            } else {
 250                VL = vlimmed
 251            }
 252         } else {
 253            // VL not to change (except if MVL is reduced)
 254            // read from SPRs
 255            VL = SPR[SV_VL]
 256         }
 257
 258         // set MVL (or not).
 259         // 2 options: from SPR, from immed
 260         if ms {
 261            MVL = vlimmed
 262         } else {
 263            // MVL not to change, read from SPRs
 264            MVL = SPR[SV_MVL]
 265         }
 266
 267         // calculate (limit) VL
 268         VL = min(VL, MVL)
 269
 270         // store VL, MVL
 271         SVSTATE.VL = VL
 272         SVSTATE.MVL = MVL
 273
 274         // write rd
 275         if rt != 0 {
 276             // rt is not zero
 277             regs[rt] = VL;
 278         }
 279         // write CR?
 280         if Rc {
 281             // update CR from VL (not rt)
 282             CR0.eq = (VL == 0)
 283             ...
 284             ...
 285         }
 286         // write Vertical-First mode
 287         SVSTATE.vf = vf
 288     }
 289
 290 # Examples
 291
 292 ## Core concept loop
 293
 294 ```
 295 loop:
 296     setvl a3, a0, MVL=8    #  update a3 with vl
 297                            # (# of elements this iteration)
 298                            # set MVL to 8
 299     # do vector operations at up to 8 length (MVL=8)
 300     # ...
 301     sub a0, a0, a3   # Decrement count by vl
 302     bnez a0, loop    # Any more?
 303 ```
 304
 305 ## Loop using Rc=1
 306
 307     my_fn:
 308       li r3, 1000
 309       b test
 310     loop:
 311       sub r3, r3, r4
 312       ...
 313     test:
 314       setvli. r4, r3, MVL=64
 315       bne cr0, loop
 316     end:
 317       blr
 318
 319 ## setmvlhi double loop
 320
 321 Two elements per inner loop are executed per instruction. This assumes
 322 that underlying hardware, when `setmvlhi` requests a parallelism hint of 2
 323 actually sets a parallelism hint of 2.
 324
 325 This example, in c, would be:
 326
 327 ```
 328 long *r4;
 329 for (i=0; i < CTR; i++)
 330 {
 331     r4[i+2] += r4[i]
 332 }
 333 ```
 334
 335 where, clearly, it is not possible to do more
 336 than 2 elements in parallel at a time: attempting
 337 to do so would result in data corruption. The compiler
 338 may be able to determine memory aliases and inform
 339 hardware at runtime of the maximum safe parallelism
 340 limit.
 341
 342 Whilst this example could be simplified to simply set VL=2,
 343 or exploit the fact that overlapping adds have well-defined
 344 behaviour, this has not been done, here, for illustrative purposes
 345 in order to demonstrate setmvhli and Vertical-First Mode.
 346
 347 Note, crucially, how r4, r32 and r20 are **NOT** incremented
 348 inside the inner loop.  The MAXVL reservation is still 8,
 349 i.e. as srcstep and dststep advance (by 2 elements at a time)
 350 registers r20-r27 will be used for the first LD, and
 351 registers r32-39 for the second LD.  `r4+srcstep*8` will be used
 352 as the elstrided offset for LDs.
 353
 354 ```
 355    setmvlhi  8, 2 # MVL=8, VFHint=2
 356 loop:
 357     setvl  r1, CTR, vf=1 # VL=r1=MAX(MVL, CTR), VF=1
 358     mulli  r1, r1, 8     # multiply by int width
 359 loopinner:
 360     sv.ld r20.v, r4(0) # load VLhint elements (max 2)
 361     addi r2, r4, 16    # 2 elements ahead
 362     sv.ld r32.v, r2(0) # load VLhint elements (max 2)
 363     sv.add r32.v, r20.v, r32.v # x[i+2] += x[i]
 364     sv.st r32.v, r2(0) # store VLhint elements
 365     svstep.            # srcstep += VLhint
 366     bnz loopinner      # repeat until srcstep=VL
 367     # now done VL elements, move to next batch
 368     add r4, r4, r1     # move r4 pointer forward
 369     sv.bnz/ctr loop    # decrement CTR by VL
 370 ```