3 # OpenPOWER SV setvl/setvli
7 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-November/001366.html>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=535>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=587>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=568> TODO
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions>
12 * old page [[simple_v_extension/specification/sv.setvl]]
14 Use of setvl results in changes to the MVL, VL and STATE SPRs. see [[sv/sprs]]♧
16 # Behaviour and Rationale
18 SV's Vector Engine is based on Cray-style Variable-length Vectorisation,
19 just like RVV. However unlike RVV, SV sits on top of the standard Scalar
20 regfiles: there is no separate Vector register numbering. Therefore, also
21 unlike RVV, SV does not have hard-coded "Lanes": microarchitects
22 may use *ordinary* in-order, out-of-order, or superscalar designs
23 as the basis for SV. By contrast, the relevant parameter
24 in RVV is "MAXVL" and this is architecturally hard-coded into RVV systems,
25 anywhere from 1 to tens of thousands of Lanes in supercomputers.
27 SV is more like how MMX used to sit on top of the x86 FP regfile.
28 Therefore when Vector operations are performed, the question has to
29 be asked, "well, how much of the regfile do you want to allocate to
30 this operation?" because if it is too small an amount performance may
31 be affected, and if too large then other registers would overlap and
32 cause data corruption, or even if allocated correctly would require
35 The answer effectively needs to be parameterised. Hence: MAXVL (MVL)
36 is set from an immediate, so that the compiler may decide, statically, a
37 guaranteed resource allocation according to the needs of the application.
39 While RVV's MAXVL was a hw limit, SV's MVL is simply a loop
40 optimization. It does not carry side-effects for the arch, though for
41 a specific cpu it may affect hw unit usage.
43 Other than being able to set MVL, SV's VL (Vector Length) works just like
44 RVV's VL, with one minor twist. RVV permits the `setvl` instruction to
45 set VL to an arbitrary explicit value. Within the limit of MVL, VL
46 **MUST** be set to the requested value. Given that RVV only works on Vector Loops,
47 this is fine and part of its value and design. However, SV sits on top
48 of the standard register files. When MVL=VL=2, a Vector Add on `r3`
49 will perform two Scalar Adds: one on `r3` and one on `r4`.
51 Thus there is the opportunity to set VL to an explicit value (within the
52 limits of MVL) with the reasonable expectation that if two operations
53 are requested (by setting VL=2) then two operations are guaranteed.
54 This avoids the need for a loop (with not-insignificant use of the
55 regfiles for counters), simply two instructions:
57 setvli r0, MVL=64, VL=64
58 ld r0.v, 0(r30) # load exactly 64 registers from memory
60 Page Faults etc. aside this is *guaranteed* 100% without fail to perform
61 64 unit-strided LDs starting from the address pointed to by r30 and put
62 the contents into r0 through r63. Thus it becomes a "LOAD-MULTI". Twin
63 Predication could even be used to only load relevant registers from
64 the stack. This *only works if VL is set to the requested value* rather
65 than, as in RVV, allowing the hardware to set VL to an arbitrary value
66 (caveat being, limited to not exceed MVL)
68 Also available is the option to set VL from CTR (`VL = MIN(CTR, MVL)`.
69 In combination with SVP64 [[sv/branches]] this can save one instruction
70 inside critical inner loops.
74 *(Allocation of opcode TBD pending OPF ISA WG approval)*,
75 using EXT22 temporarily and fitting into the
78 Form: SVL-Form (see [[isatables/fields.text]])
80 | 0.5|6.10|11.15|16..21| 22...25 | 26.30 |31| name |
81 | -- | -- | --- | ---- |----------- | ----- |--| ------- |
82 |OPCD| RT | RA | SVi |cv ms vs vf | 11110 |Rc| setvl |
84 Note that the immediate (`SVi`) spans 7 bits (16 to 22)
86 * `cv` - bit 22 - reads CTR instead of RA
87 * `ms` - bit 23 - allows for setting of MVL.
88 * `vs` - bit 24 - allows for setting of VL.
89 * `vf` - bit 25 - sets "Vertical First Mode".
91 Note that in immediate setting mode VL and MVL start from **one**
92 i.e. that an immediate value of zero will result in VL/MVL being set to 1.
93 0b111111 results in VL/MVL being set to 64. This is because setting
94 VL/MVL to 1 results in "scalar identity" behaviour, where setting VL/MVL
95 to 0 would result in all Vector operations becoming `nop`. If this is
96 truly desired (nop behaviour) then setting VL and MVL to zero is to be
97 done via the [[SVSTATE SPR|sv/sprs]]
99 Note that setmvli is a pseudo-op, based on RA/RT=0, and setvli likewise
101 setvli VL=8 : setvl r5, r0, VL=8
102 setmvli MVL=8 : setvl r0, r0, MVL=8
104 Additional pseudo-op for obtaining VL without modifying it:
106 getvl r5 : setvl r5, r0, vf=0, vs=0, ms=0
108 For Vertical-First mode, a pseudo-op for explicit incrementing
109 of srcstep and dststep:
111 svstep. : setvl. 0, 0, vf=1, vs=0, ms=0
113 Note that whilst it is possible to set both MVL and VL from the same
114 immediate, it is not possible to set them to different immediates in
115 the same instruction. That would require two instructions.
119 In Vertical-First Mode the minimum expectation is that one scalar
120 element may be executed by each instruction. There are however
121 circumstances where it may be possible to execute more than one
122 element per instruction (srcstep elements 0-3 for example)
123 but leaving it up to hardware to
124 determine a "safe minimum amount" where memory corruption does
125 not occur may not be practical (or is simply very costly).
127 Therefore, setmvlhi may specify, as determined by the compiler,
128 exactly what that quantity is. Unlike VL, which is an amount
129 that, when requested, **must** be executed, VFhint may be set
130 by the hardware to an amount that the hardware is capable of.
131 In other words: setmvlhi requests a hint size, bur hardware chooses
134 The reason for this cooperative negotiation between hardware and
135 software is that whilst the compiler may have information about
136 memory hazards that must be avoided which hardware cannot
137 know about, the hardware knows the maximum batch size
138 it can execute in parallel but the compiler is unaware of
139 the variance in that batch size on different implementations.
140 Thus, hardware sets VLHint to the minimum of the requested
141 amount and the hardware limit. Simple implementations always
144 Critical to note are two things:
146 1. VFhint must not be set by hardware to an amount that
147 exceeds either MVL or the requested amount, and must set
148 VFhint to at least 1 element.
149 2. svstep will increment srcstep and dststep by VFhint,
150 therefore when hardware says it can perform N element
151 operations, hardware **MUST** perform N operations
152 for every single instruction.
154 Form: SVL-Form (see [[isatables/fields.text]])
156 | 0.5|6.10|11.15|16..21|22 | 23...25 | 26.30 |31| name |
157 | -- | -- | --- | ---- |---| -------- | ----- |--| -------- |
158 |OPCD| RT | MVL | SVi |MVL| ms vs vf | 10110 |Rc| setmvlhi |
160 # Vertical First Mode
162 Vertical First is effectively like an implicit single bit predicate
163 applied to every SVP64 instruction. **ONLY** one element in each
164 SVP64 Vector instruction is executed; srcstep and dststep do **not**
165 increment, and the Program Counter progresses **immediately* to
166 the next instruction just as it would for any standard scalar v3.0B
169 An explicit mode of setvl is called which can move srcstep and
170 dststep on to the next element, still respecting predicate
173 In other words, where normal SVP64 Vectorisation acts "horizontally"
174 by looping first through 0 to VL-1 and only then moving the PC
175 to the next instruction, Vertical-First moves the PC onwards
176 (vertically) through multiple instructions **with the same
177 srcstep and dststep**, then an explict instruction used to
178 advance srcstep/dststep, and an outer loop is expected to be
179 used (branch instruction) which completes a series of
182 ```svstep``` mode is enabled when vf=1, vs=0 and ms=0.
183 When Rc=1 it is possible to determine when any level of
184 loops reach an end condition, or if VL has been reached. The immediate can
185 be reinterpreted as indicating which SVSTATE (0-3)
186 should be tested and placed into CR0.
188 * setvl immediate = 1: only VL testing is enabled. CR0.SO is set
189 to 1 when either srcstep or dststep reach VL
190 * setvl immediate = 2: also include inner middle and outer
191 loop end conditions from SVSTATE0 into CR.EQ CR.LE CR.GT
192 * setvl immediate = 3: test SVSTATE1
193 * setvl immediate = 4: test SVSTATE2
194 * setvl immediate = 5: test SVSTATE3
196 Testing any end condition of any loop of any REMAP state allows branches to be used to create loops.
198 *Programmers should be aware that VL, srcstep and dststep are global in nature.
199 Nested looping with different schedules is perfectly possible, as is
200 calling of functions, however SVSTATE (and any associated SVSTATE) should be stored on the stack.*
204 // instruction fields:
205 rd = get_rt_field(); // bits 6..10
206 ra = get_ra_field(); // bits 11..15
207 vc = get_vc_field(); // bit 22
208 vf = get_vf_field(); // bit 23
209 vs = get_vs_field(); // bit 24
210 ms = get_ms_field(); // bit 25
211 Rc = get_Rc_field(); // bit 31
213 if vf and not vs and not ms {
214 // increment src/dest step mode
215 // NOTE! this is in no way complete! predication is not included
216 // and neither is SUB-VL mode
217 srcstep = SPR[SV].srcstep
218 dststep = SPR[SV].dststep
222 rollover = (srcstep == VL or dststep == VL)
224 // Reset srcstep, dststep, and also exit "Vertical First" mode
228 SPR[SV].srcstep = srcstep
229 SPR[SV].dststep = dststep
231 // write CR? helps for doing Vertical loops, detects end
232 // of Vector Elements
234 // update CR to indicate that srcstep/dststep "rolled over"
238 // add one. MVL/VL=1..64 not 0..63
239 vlimmed = get_immed_field()+1; // 16..22
242 // 4 options: from SPR, from immed, from ra, from CTR
244 // VL to be sourced from fields/regs
253 // VL not to change (except if MVL is reduced)
259 // 2 options: from SPR, from immed
263 // MVL not to change, read from SPRs
267 // calculate (limit) VL
281 // update CR from VL (not rt)
286 // write Vertical-First mode
296 setvl a3, a0, MVL=8 # update a3 with vl
297 # (# of elements this iteration)
299 # do vector operations at up to 8 length (MVL=8)
301 sub a0, a0, a3 # Decrement count by vl
302 bnez a0, loop # Any more?
314 setvli. r4, r3, MVL=64
319 ## setmvlhi double loop
322 setmvlhi 8, 2 # MVL=8, VFHint=2
324 setvl r5, r3 # VL=r5=MAX(MVL, r3)
326 sv.ld r20.v, r4(0) # load VLhint elements (max 2)
327 sv.addi r20.v, r20.v, 55 # add 55 to 2 elements
328 sv.st r20.v, r4(0) # store VLhint elements
329 svstep. # srcstep += VLhint
330 bnz loopinner # repeat until srcstep=VL
331 # now done VL elements, move to next batch
332 add r4, r4, r5 # move r4 pointer forward
333 sub. r3, r3, r5 # decrement total count by VL