openpower/sv/setvl.mdwn

   1 [[!tag standards]]
   2
   3 # DRAFT setvl/setvli
   4
   5 See links:
   6
   7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=927> bug - RT>=32
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=862> VF Predication
  13 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
  14 * [[sv/svstep]]
  15 * pseudocode [[openpower/isa/simplev]]
  16
  17 Use of setvl results in changes to the SVSTATE SPR. see [[sv/sprs]]
  18
  19 # Behaviour and Rationale
  20
  21 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
  22 just like RVV.  However unlike RVV, SV sits on top of the standard Scalar
  23 regfiles: there is no separate Vector register numbering.  Therefore, also
  24 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
  25 may use *ordinary* in-order, out-of-order, or superscalar designs
  26 as the basis for SV. By contrast, the relevant parameter
  27 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
  28 anywhere from 1 to tens of thousands of Lanes in supercomputers.
  29
  30 SV is more like how MMX used to sit on top of the x86 FP regfile.
  31 Therefore when Vector operations are performed, the question has to
  32 be asked, "well, how much of the regfile do you want to allocate to
  33 this operation?" because if it is too small an amount performance may
  34 be affected, and if too large then other registers would overlap and
  35 cause data  corruption, or even if allocated correctly would require
  36 spill to memory.
  37
  38 The answer effectively needs to be parameterised.  Hence: MAXVL (MVL)
  39 is set from an immediate, so that the compiler may decide, statically, a
  40 guaranteed resource allocation according to the needs of the application.
  41
  42 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
  43 optimization. It does not carry side-effects for the arch, though for
  44 a specific cpu it may affect hw unit usage.
  45
  46 Other than being able to set MVL, SV's VL (Vector Length) works just like
  47 RVV's VL, with one minor twist.  RVV permits the `setvl` instruction to
  48 set VL to an arbitrary explicit value.  Within the limit of MVL, VL
  49 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
  50 this is fine and part of its value and design.  However, SV sits on top
  51 of the standard register files.  When MVL=VL=2, a Vector Add on `r3`
  52 will perform two Scalar Adds: one on `r3` and one on `r4`.
  53
  54 Thus there is the opportunity to set VL to an explicit value (within the
  55 limits of MVL) with the reasonable expectation that if two operations
  56 are requested (by setting VL=2) then two operations are guaranteed.
  57 This avoids the need for a loop (with not-insignificant use of the
  58 regfiles for counters), simply two instructions:
  59
  60     setvli r0, MVL=64, VL=64
  61     sv.ld *r0, 0(r30) # load exactly 64 registers from memory
  62
  63 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
  64 64 unit-strided LDs starting from the address pointed to by r30 and put
  65 the contents into r0 through r63.  Thus it becomes a "LOAD-MULTI". Twin
  66 Predication could even be used to only load relevant registers from
  67 the stack.  This *only works if VL is set to the requested value* rather
  68 than, as in RVV, allowing the hardware to set VL to an arbitrary value
  69 (due to variances in implementation choices).
  70
  71 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
  72 In combination with SVP64 [[sv/branches]] this can save one instruction
  73 inside critical inner loops. A caveat: to avoid having an extra opcode
  74 bit in `setvl`, selection of CTR mode is slightly convoluted.
  75
  76 # Format
  77
  78 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
  79 using EXT22 temporarily and fitting into the
  80 [[sv/bitmanip]] space
  81
  82 Form: SVL-Form (see [[isatables/fields.text]])
  83
  84 | 0.5|6.10|11.15|16..22| 23...25    | 26.30 |31|  name   |
  85 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
  86 |OPCD| RT | RA  | SVi  |   ms vs vf | 11011 |Rc| setvl   |
  87
  88 Instruction format:
  89
  90     setvl RT,RA,SVi,vf,vs,ms
  91     setvl. RT,RA,SVi,vf,vs,ms
  92
  93 Note that the immediate (`SVi`) spans 7 bits (16 to 22)
  94
  95 * `ms` - bit 23 - allows for setting of MVL
  96 * `vs` - bit 24 - allows for setting of VL
  97 * `vf` - bit 25 - sets "Vertical First Mode".
  98
  99 Note that in immediate setting mode VL and MVL start from **one**
 100 but that this is compensated for in the assembly notation.
 101 i.e. that an immediate value of 1 in assembler notation
 102 actually places the value 0b0000000 in the `SVi` field bits:
 103 on execution the `setvl` instruction adds one to the decoded
 104 `SVi` field bits, resulting in
 105 VL/MVL being set to 1. This allows VL to be set to values
 106 ranging from 1 to 128 with only 7 bits instead of 8.
 107 Setting VL/MVL
 108 to 0 would result in all Vector operations becoming `nop`.  If this is
 109 truly desired (nop behaviour) then setting VL and MVL to zero is to be
 110 done via the [[SVSTATE SPR|sv/sprs]].
 111
 112 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
 113
 114     setvli   VL=8   : setvl  r0, r0, VL=8, vf=0, vs=1, ms=0
 115     setvli.  VL=8   : setvl. r0, r0, VL=8, vf=0, vs=1, ms=0
 116     setmvli  MVL=8  : setvl  r0, r0, MVL=8, vf=0, vs=0, ms=1
 117     setmvli. MVL=8  : setvl. r0, r0, MVL=8, vf=0, vs=0, ms=1
 118
 119 Additional pseudo-op for obtaining VL without modifying it (or any state):
 120
 121     getvl  r5      : setvl  r5, r0, vf=0, vs=0, ms=0
 122     getvl. r5      : setvl. r5, r0, vf=0, vs=0, ms=0
 123
 124 For Vertical-First mode, a pseudo-op for explicit incrementing
 125 of srcstep and dststep:
 126
 127     svfstep         : setvl  0, 0, vf=1, vs=0, ms=0
 128     svfstep.        : setvl. 0, 0, vf=1, vs=0, ms=0
 129
 130 This pseudocode op is different from [[sv/svstep]] which is used to
 131 perform detailed enquiries about internal state.
 132
 133 Note that whilst it is possible to set both MVL and VL from the same
 134 immediate, it is not possible to set them to different immediates in
 135 the same instruction.  Doing so would require two instructions.
 136
 137 **Selecting sources for VL**
 138
 139 There is considerable opcode pressure, consequently to set MVL and VL
 140 from different sources is as follows:
 141
 142 | condition           | effect         |
 143 | - | - |
 144 | `vs=1, RA=0, RT!=0` | VL,RT set to MIN(MVL, CTR)  |
 145 | `vs=1, RA=0, RT=0`  | VL set to MIN(MVL, SVi+1)  |
 146 | `vs=1, RA!=0, RT=0` | VL set to MIN(MVL, RA)  |
 147 | `vs=1, RA!=0, RT!=0` | VL,RT set to MIN(MVL, RA)  |
 148
 149 The reasoning here is that the opportunity to set RT equal to the
 150 immediate `SVi+1` is sacrificed in favour of setting from CTR.
 151
 152 # Unusual Rc=1 behaviour
 153
 154 Normally, the return result from an instruction is in `RT`. With
 155 it being possible for `RT=0` to mean that `CTR` mode is to be read,
 156 some different semantics are needed.
 157
 158 CR Field 0, when `Rc=1`, may be set even if `RT=0`. The reason is that
 159 overflow may occur: `VL`, if set either from an immediate or from `CTR`,
 160 may not exceed `MAXVL`, and if it is, `CR0.SO` must be set.
 161
 162 Additionally, in reality it is **`VL`** being set. Therefore, rather
 163 than `CR0` testing `RT` when `Rc=1`, CR0.EQ is set if `VL=0`, CR0.GE
 164 is set if `VL` is non-zero.
 165
 166 # Vertical First Mode
 167
 168 Vertical First is effectively like an implicit single bit predicate
 169 applied to every SVP64 instruction.  **ONLY** one element in each
 170 SVP64 Vector instruction is executed; srcstep and dststep do **not**
 171 increment, and the Program Counter progresses **immediately** to
 172 the next instruction just as it would for any standard scalar v3.0B
 173 instruction.
 174
 175 An explicit mode of setvl is called which can move srcstep and
 176 dststep on to the next element, still respecting predicate
 177 masks.
 178
 179 In other words, where normal SVP64 Vectorisation acts "horizontally"
 180 by looping first through 0 to VL-1 and only then moving the PC
 181 to the next instruction, Vertical-First moves the PC onwards
 182 (vertically) through multiple instructions **with the same
 183 srcstep and dststep**, then an explict instruction used to
 184 advance srcstep/dststep. An outer loop is expected to be
 185 used (branch instruction) which completes a series of
 186 Vector operations.
 187
 188 ```svfstep``` mode is enabled when vf=1, vs=0 and ms=0.
 189 When Rc=1 it is possible to determine when any level of
 190 loops reach an end condition, or if VL has been reached. The immediate can
 191 be reinterpreted as indicating which SVSTATE (0-3)
 192 should be tested and placed into CR0 (when Rc=1)
 193
 194 When RT is not zero, an internal stepping index may also be returned,
 195 either the REMAP index or srcstep or dststep. This table is identical
 196 to that of [[sv/svstep]]:
 197
 198 * `SVi=1`: also include inner middle and outer
 199   loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
 200 * `SVi=2`: test SVSTATE1 (and return conditions)
 201 * `SVi=3`: test SVSTATE2 (and return conditions)
 202 * `SVi=4`: test SVSTATE3 (and return conditions)
 203 * `SVi=5`: `SVSTATE.srcstep` is returned.
 204 * `SVi=6`: `SVSTATE.dststep` is returned.
 205
 206 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
 207
 208 *Programmers should be aware that VL, srcstep and dststep are global in nature.
 209 Nested looping with different schedules is perfectly possible, as is
 210 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
 211
 212 **SUBVL**
 213
 214 Sub-vector elements are not be considered "Vertical". The vec2/3/4
 215 is to be considered as if the "single element".  Caveats exist for
 216 [[sv/mv.swizzle]] and [[sv/mv.vec]] when Pack/Unpack is enabled,
 217 due to the order in which VL and SUBVL loops are applied being
 218 swapped (outer-inner becomes inner-outer)
 219
 220 # Examples
 221
 222 ## Core concept loop
 223
 224 ```
 225 loop:
 226     setvl a3, a0, MVL=8    #  update a3 with vl
 227                            # (# of elements this iteration)
 228                            # set MVL to 8
 229     # do vector operations at up to 8 length (MVL=8)
 230     # ...
 231     sub a0, a0, a3   # Decrement count by vl
 232     bnez a0, loop    # Any more?
 233 ```
 234
 235 ## Loop using Rc=1
 236
 237     my_fn:
 238       li r3, 1000
 239       b test
 240     loop:
 241       sub r3, r3, r4
 242       ...
 243     test:
 244       setvli. r4, r3, MVL=64
 245       bne cr0, loop
 246     end:
 247       blr
 248
 249 ## Load/Store-Multi (selective)
 250
 251 Up to 64 FPRs will be loaded, here.  `r3` is set one per bit
 252 for each FP register required to be loaded.  The block of memory
 253 from which the registers are loaded is contiguous (no gaps):
 254 any FP register which has a corresponding zero bit in `r3`
 255 is *unaltered*.  In essence this is a selective LD-multi with
 256 "Scatter" capability.
 257
 258     setvli r0, MVL=64, VL=64
 259     sv.fld/dm=r3 *r0, 0(r30) # selective load 64 FP registers
 260
 261 Up to 64 FPRs will be saved, here.  Again, `r3`
 262
 263     setvli r0, MVL=64, VL=64
 264     sv.stfd/sm=r3 *fp0, 0(r30) # selective store 64 FP registers