(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 Links:
6
7 * TODO merge old standards page [[simple_v_extension/vector_ops/]]
8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
9 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
10 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
13 out of scope for this document [[openpower/sv/3d_vector_ops]]
14 * [[simple_v_extension/specification/bitmanip]] previous version,
15 contains pseudocode for sof, sif, sbf
16
17
18 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
19
20 Notes:
21
22 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
23 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
24 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
25
26 # Vector
27
28 Both of these instructions may be synthesised from SVP64 Vector
29 instructions. conflictd is an O(N^2) instruction based on
30 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
31 with the appropriate predication
32
33 ## conflictd
34
35 This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
36
37 input = [100, 100, 3, 100, 5, 100, 100, 3]
38 conflict result = [
39 0b00000000, // Note: first element always zero
40 0b00000001, // 100 is present on #0
41 0b00000000,
42 0b00000011, // 100 is present on #0 and #1
43 0b00000000,
44 0b00001011, // 100 is present on #0, #1, #3
45 0b00011011, // .. and #4
46 0b00000100 // 3 is present on #2
47 ]
48
49 Pseudocode:
50
51 for i in range(VL):
52 for j in range(1, i):
53 if src1[i] == src2[j]:
54 result[j] |= 1<<i
55
56 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
57 using `mfcrweird` and `cmpi`. first triangular schedule on src1,
58 secpnd on src2.
59
60 Idea 2: implement using outer loop on varying setvl Horizontal-First
61 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
62 result.
63
64 li r3, 1
65 li result, 0
66 for i in range(target):
67 setvl target
68 sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
69 sv.cmpi src2.v, t0 # compare src2 vector to scalar
70 sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
71 srr t1, t1, i # shift up by i before ORing
72 or result, result, t1
73 srr r3, r3, 1 # shift r3 predicate up by one
74
75 See [[sv/cr_int_predication]] for full details on the crweird instructions:
76 the primary important aspect here is that a Vector of CR Field's EQ bits is
77 transferred into a single GPR. The secondary important aspect is that VL
78 is being adjusted in each loop, testing successively more of the input
79 vector against a given scalar, each time.
80
81 To investigate:
82
83 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
84 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
85
86 ## iota
87
88 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
89
90 When masked, only the bits not masked out are included in the count process.
91
92 viota RT/v, RA, RB
93
94 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
95
96 Example
97
98 7 6 5 4 3 2 1 0 Element number
99
100 1 0 0 1 0 0 0 1 v2 contents
101 viota.m v4, v2 # Unmasked
102 2 2 2 1 1 1 1 0 v4 result
103
104 1 1 1 0 1 0 1 1 v0 contents
105 1 0 0 1 0 0 0 1 v2 contents
106 2 3 4 5 6 7 8 9 v4 contents
107 viota.m v4, v2, v0.t # Masked
108 1 1 1 5 1 7 1 0 v4 results
109
110 def iota(RT, RA, RB):
111 mask = RB ? iregs[RB] : 0b111111...1
112 val = RA ? iregs[RA] : 0b111111...1
113 for i in range(VL):
114 if RA.scalar:
115 testmask = (1<<i)-1 # only count below
116 to_test = val & testmask & mask
117 iregs[RT+i] = popcount(to_test)
118
119 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
120 where bit 2 is inv, bits 0:1 select the bit of the CR.
121
122 def test_CR_bit(CR, BO):
123 return CR[BO[0:1]] == BO[2]
124
125 def iotacr(RT, BA, BO):
126 mask = get_src_predicate()
127 count = 0
128 for i in range(VL):
129 if mask & (1<<i) == 0:
130 count = 0 # reset back to zero
131 continue
132 iregs[RT+i] = count
133 if test_CR_bit(CR[i+BA], BO):
134 count += 1
135
136 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
137
138 scalar variant which can be Vectorised to give iotacr:
139
140 def crtaddi(RT, RA, BA, BO, D):
141 if test_CR_bit(BA, BO):
142 RT = RA + EXTS(D)
143 else:
144 RT = RA
145
146 a Vector for-loop with zero-ing on dest will give the
147 mask-out effect of resetting the count back to zero.
148 However close examination shows that the above may actually
149 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
150
151 # Scalar
152
153 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
154
155 ## sbfm
156
157 sbfm RT, RA, RB!=0
158
159 Example
160
161 7 6 5 4 3 2 1 0 Bit index
162
163 1 0 0 1 0 1 0 0 v3 contents
164 vmsbf.m v2, v3
165 0 0 0 0 0 0 1 1 v2 contents
166
167 1 0 0 1 0 1 0 1 v3 contents
168 vmsbf.m v2, v3
169 0 0 0 0 0 0 0 0 v2
170
171 0 0 0 0 0 0 0 0 v3 contents
172 vmsbf.m v2, v3
173 1 1 1 1 1 1 1 1 v2
174
175 1 1 0 0 0 0 1 1 RB vcontents
176 1 0 0 1 0 1 0 0 v3 contents
177 vmsbf.m v2, v3, v0.t
178 0 1 x x x x 1 1 v2 contents
179
180 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
181
182 pseudocode:
183
184 def sbf(rd, rs1, rs2):
185 rd = 0
186 # start setting if no predicate or if 1st predicate bit set
187 setting_mode = rs2 == x0 or (regs[rs2] & 1)
188 while i < XLEN:
189 bit = 1<<i
190 if rs2 != x0 and (regs[rs2] & bit):
191 # reset searching
192 setting_mode = False
193 if setting_mode:
194 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
195 setting_mode = False
196 else:
197 regs[rd] |= bit
198 else if rs2 != x0: # searching mode
199 if (regs[rs2] & bit):
200 setting_mode = True # back into "setting" mode
201 i += 1
202
203 ## sifm
204
205 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
206
207 sifm RT, RA, RB!=0
208
209 # Example
210
211 7 6 5 4 3 2 1 0 Bit number
212
213 1 0 0 1 0 1 0 0 v3 contents
214 vmsif.m v2, v3
215 0 0 0 0 0 1 1 1 v2 contents
216
217 1 0 0 1 0 1 0 1 v3 contents
218 vmsif.m v2, v3
219 0 0 0 0 0 0 0 1 v2
220
221 1 1 0 0 0 0 1 1 RB vcontents
222 1 0 0 1 0 1 0 0 v3 contents
223 vmsif.m v2, v3, v0.t
224 1 1 x x x x 1 1 v2 contents
225
226 Pseudo-code:
227
228 def sif(rd, rs1, rs2):
229 rd = 0
230 setting_mode = rs2 == x0 or (regs[rs2] & 1)
231
232 while i < XLEN:
233 bit = 1<<i
234
235 # only reenable when predicate in use, and bit valid
236 if !setting_mode && rs2 != x0:
237 if (regs[rs2] & bit):
238 # back into "setting" mode
239 setting_mode = True
240
241 # skipping mode
242 if !setting_mode:
243 # skip any more 1s
244 if regs[rs1] & bit == 1:
245 i += 1
246 continue
247
248 # setting mode, search for 1
249 regs[rd] |= bit # always set during search
250 if regs[rs1] & bit: # found a bit in rs1:
251 setting_mode = False
252 # next loop starts skipping
253
254 i += 1
255
256
257 ## vmsof
258
259 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
260
261 sofm RT, RA, RB
262
263 Example
264
265 7 6 5 4 3 2 1 0 Bit number
266
267 1 0 0 1 0 1 0 0 v3 contents
268 vmsof.m v2, v3
269 0 0 0 0 0 1 0 0 v2 contents
270
271 1 0 0 1 0 1 0 1 v3 contents
272 vmsof.m v2, v3
273 0 0 0 0 0 0 0 1 v2
274
275 1 1 0 0 0 0 1 1 RB vcontents
276 1 1 0 1 0 1 0 0 v3 contents
277 vmsof.m v2, v3, v0.t
278 0 1 x x x x 0 0 v2 content
279
280 Pseudo-code:
281
282 def sof(rd, rs1, rs2):
283 rd = 0
284 setting_mode = rs2 == x0 or (regs[rs2] & 1)
285
286 while i < XLEN:
287 bit = 1<<i
288
289 # only reenable when predicate in use, and bit valid
290 if !setting_mode && rs2 != x0:
291 if (regs[rs2] & bit):
292 # back into "setting" mode
293 setting_mode = True
294
295 # skipping mode
296 if !setting_mode:
297 # skip any more 1s
298 if regs[rs1] & bit == 1:
299 i += 1
300 continue
301
302 # setting mode, search for 1
303 if regs[rs1] & bit: # found a bit in rs1:
304 regs[rd] |= bit # only set when search succeeds
305 setting_mode = False
306 # next loop starts skipping
307
308 i += 1
309
310 # Carry-lookahead
311
312 used not just for carry lookahead, also a special type of predication mask operation.
313
314 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
315 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
316 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
317 * <https://i.stack.imgur.com/QSLKY.png>
318 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
319 `((P|G)+G)^P`
320 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
321
322 From QLSKY.png:
323
324 ```
325 x0 = nand(CIn, P0)
326 C0 = nand(x0, ~G0)
327
328 x1 = nand(CIn, P0, P1)
329 y1 = nand(G0, P1)
330 C1 = nand(x1, y1, ~G1)
331
332 x2 = nand(CIn, P0, P1, P2)
333 y2 = nand(G0, P1, P2)
334 z2 = nand(G1, P2)
335 C1 = nand(x2, y2, z2, ~G2)
336
337 # Gen*
338 x3 = nand(G0, P1, P2, P3)
339 y3 = nand(G1, P2, P3)
340 z3 = nand(G2, P3)
341 G* = nand(x3, y3, z3, ~G3)
342 ```
343
344 ```
345 P = (A | B) & Ci
346 G = (A & B)
347 ```
348
349 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
350
351 ```
352 At each id, compute C[id] = A[id]+B[id]+0
353 Get G[id] = C[id] > radix -1
354 Get P[id] = C[id] == radix-1
355 Join all P[id] together, likewise G[id]
356 Compute newC = ((P|G)+G)^P
357 result[id] = (C[id] + newC[id]) % radix
358 ```
359
360 two versions: scalar int version and CR based version.
361
362 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
363
364 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
365
366 if zero (no propagation) then CR0.eq is zero
367
368 CR based version, TODO.