(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 * TODO merge old standards page [[simple_v_extension/vector_ops/]]
6 * bugreport <https://bugs.libre-soc.org/show_bug.cgi?id=142>
7
8 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
9
10 Notes:
11
12 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
13 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
14 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
15
16 Links:
17
18 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
19 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
20 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
21 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
22 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
23 out of scope for this document
24 * [[simple_v_extension/specification/bitmanip]] previous version,
25 contains pseudocode for sof, sif, sbf
26
27 # Vector
28
29 Both of these instructions may be synthesised from SVP64 Vector
30 instructions. conflictd is an O(N^2) instruction based on
31 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
32 with the appropriate predication
33
34 ## conflictd
35
36 This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
37
38 input = [100, 100, 3, 100, 5, 100, 100, 3]
39 conflict result = [
40 0b00000000, // Note: first element always zero
41 0b00000001, // 100 is present on #0
42 0b00000000,
43 0b00000011, // 100 is present on #0 and #1
44 0b00000000,
45 0b00001011, // 100 is present on #0, #1, #3
46 0b00011011, // .. and #4
47 0b00000100 // 3 is present on #2
48 ]
49
50 Pseudocode:
51
52 for i in range(VL):
53 for j in range(1, i):
54 if src1[i] == src2[j]:
55 result[j] |= 1<<i
56
57 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
58 using `mfcrweird` and `cmpi`. first triangular schedule on src1,
59 secpnd on src2.
60
61 Idea 2: implement using outer loop on varying setvl Horizontal-First
62 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
63 result.
64
65 li r3, 1
66 li result, 0
67 for i in range(target):
68 setvl target
69 sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
70 sv.cmpi src2.v, t0 # compare src2 vector to scalar
71 sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
72 srr t1, t1, i # shift up by i before ORing
73 or result, result, t1
74 srr r3, r3, 1 # shift r3 predicate up by one
75
76 See [[sv/cr_int_predication]] for full details on the crweird instructions:
77 the primary important aspect here is that a Vector of CR Field's EQ bits is
78 transferred into a single GPR. The secondary important aspect is that VL
79 is being adjusted in each loop, testing successively more of the input
80 vector against a given scalar, each time.
81
82 To investigate:
83
84 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
85 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
86
87 ## iota
88
89 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
90
91 When masked, only the bits not masked out are included in the count process.
92
93 viota RT/v, RA, RB
94
95 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
96
97 Example
98
99 7 6 5 4 3 2 1 0 Element number
100
101 1 0 0 1 0 0 0 1 v2 contents
102 viota.m v4, v2 # Unmasked
103 2 2 2 1 1 1 1 0 v4 result
104
105 1 1 1 0 1 0 1 1 v0 contents
106 1 0 0 1 0 0 0 1 v2 contents
107 2 3 4 5 6 7 8 9 v4 contents
108 viota.m v4, v2, v0.t # Masked
109 1 1 1 5 1 7 1 0 v4 results
110
111 def iota(RT, RA, RB):
112 mask = RB ? iregs[RB] : 0b111111...1
113 val = RA ? iregs[RA] : 0b111111...1
114 for i in range(VL):
115 if RA.scalar:
116 testmask = (1<<i)-1 # only count below
117 to_test = val & testmask & mask
118 iregs[RT+i] = popcount(to_test)
119
120 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
121 where bit 2 is inv, bits 0:1 select the bit of the CR.
122
123 def test_CR_bit(CR, BO):
124 return CR[BO[0:1]] == BO[2]
125
126 def iotacr(RT, BA, BO):
127 mask = get_src_predicate()
128 count = 0
129 for i in range(VL):
130 if mask & (1<<i) == 0:
131 count = 0 # reset back to zero
132 continue
133 iregs[RT+i] = count
134 if test_CR_bit(CR[i+BA], BO):
135 count += 1
136
137 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
138
139 scalar variant which can be Vectorised to give iotacr:
140
141 def crtaddi(RT, RA, BA, BO, D):
142 if test_CR_bit(BA, BO):
143 RT = RA + EXTS(D)
144 else:
145 RT = RA
146
147 a Vector for-loop with zero-ing on dest will give the
148 mask-out effect of resetting the count back to zero.
149 However close examination shows that the above may actually
150 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
151
152 # Scalar
153
154 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
155
156 ## sbfm
157
158 sbfm RT, RA, RB!=0
159
160 Example
161
162 7 6 5 4 3 2 1 0 Bit index
163
164 1 0 0 1 0 1 0 0 v3 contents
165 vmsbf.m v2, v3
166 0 0 0 0 0 0 1 1 v2 contents
167
168 1 0 0 1 0 1 0 1 v3 contents
169 vmsbf.m v2, v3
170 0 0 0 0 0 0 0 0 v2
171
172 0 0 0 0 0 0 0 0 v3 contents
173 vmsbf.m v2, v3
174 1 1 1 1 1 1 1 1 v2
175
176 1 1 0 0 0 0 1 1 RB vcontents
177 1 0 0 1 0 1 0 0 v3 contents
178 vmsbf.m v2, v3, v0.t
179 0 1 x x x x 1 1 v2 contents
180
181 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
182
183 pseudocode:
184
185 def sbf(rd, rs1, rs2):
186 rd = 0
187 # start setting if no predicate or if 1st predicate bit set
188 setting_mode = rs2 == x0 or (regs[rs2] & 1)
189 while i < XLEN:
190 bit = 1<<i
191 if rs2 != x0 and (regs[rs2] & bit):
192 # reset searching
193 setting_mode = False
194 if setting_mode:
195 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
196 setting_mode = False
197 else:
198 regs[rd] |= bit
199 else if rs2 != x0: # searching mode
200 if (regs[rs2] & bit):
201 setting_mode = True # back into "setting" mode
202 i += 1
203
204 ## sifm
205
206 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
207
208 sifm RT, RA, RB!=0
209
210 # Example
211
212 7 6 5 4 3 2 1 0 Bit number
213
214 1 0 0 1 0 1 0 0 v3 contents
215 vmsif.m v2, v3
216 0 0 0 0 0 1 1 1 v2 contents
217
218 1 0 0 1 0 1 0 1 v3 contents
219 vmsif.m v2, v3
220 0 0 0 0 0 0 0 1 v2
221
222 1 1 0 0 0 0 1 1 RB vcontents
223 1 0 0 1 0 1 0 0 v3 contents
224 vmsif.m v2, v3, v0.t
225 1 1 x x x x 1 1 v2 contents
226
227 Pseudo-code:
228
229 def sif(rd, rs1, rs2):
230 rd = 0
231 setting_mode = rs2 == x0 or (regs[rs2] & 1)
232
233 while i < XLEN:
234 bit = 1<<i
235
236 # only reenable when predicate in use, and bit valid
237 if !setting_mode && rs2 != x0:
238 if (regs[rs2] & bit):
239 # back into "setting" mode
240 setting_mode = True
241
242 # skipping mode
243 if !setting_mode:
244 # skip any more 1s
245 if regs[rs1] & bit == 1:
246 i += 1
247 continue
248
249 # setting mode, search for 1
250 regs[rd] |= bit # always set during search
251 if regs[rs1] & bit: # found a bit in rs1:
252 setting_mode = False
253 # next loop starts skipping
254
255 i += 1
256
257
258 ## vmsof
259
260 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
261
262 sofm RT, RA, RB
263
264 Example
265
266 7 6 5 4 3 2 1 0 Bit number
267
268 1 0 0 1 0 1 0 0 v3 contents
269 vmsof.m v2, v3
270 0 0 0 0 0 1 0 0 v2 contents
271
272 1 0 0 1 0 1 0 1 v3 contents
273 vmsof.m v2, v3
274 0 0 0 0 0 0 0 1 v2
275
276 1 1 0 0 0 0 1 1 RB vcontents
277 1 1 0 1 0 1 0 0 v3 contents
278 vmsof.m v2, v3, v0.t
279 0 1 x x x x 0 0 v2 content
280
281 Pseudo-code:
282
283 def sof(rd, rs1, rs2):
284 rd = 0
285 setting_mode = rs2 == x0 or (regs[rs2] & 1)
286
287 while i < XLEN:
288 bit = 1<<i
289
290 # only reenable when predicate in use, and bit valid
291 if !setting_mode && rs2 != x0:
292 if (regs[rs2] & bit):
293 # back into "setting" mode
294 setting_mode = True
295
296 # skipping mode
297 if !setting_mode:
298 # skip any more 1s
299 if regs[rs1] & bit == 1:
300 i += 1
301 continue
302
303 # setting mode, search for 1
304 if regs[rs1] & bit: # found a bit in rs1:
305 regs[rd] |= bit # only set when search succeeds
306 setting_mode = False
307 # next loop starts skipping
308
309 i += 1
310
311 # Carry-lookahead
312
313 used not just for carry lookahead, also a special type of predication mask operation.
314
315 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
316 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
317 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
318 * <https://i.stack.imgur.com/QSLKY.png>
319 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
320 `((P|G)+G)^P`
321 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
322
323 From QLSKY.png:
324
325 ```
326 x0 = nand(CIn, P0)
327 C0 = nand(x0, ~G0)
328
329 x1 = nand(CIn, P0, P1)
330 y1 = nand(G0, P1)
331 C1 = nand(x1, y1, ~G1)
332
333 x2 = nand(CIn, P0, P1, P2)
334 y2 = nand(G0, P1, P2)
335 z2 = nand(G1, P2)
336 C1 = nand(x2, y2, z2, ~G2)
337
338 # Gen*
339 x3 = nand(G0, P1, P2, P3)
340 y3 = nand(G1, P2, P3)
341 z3 = nand(G2, P3)
342 G* = nand(x3, y3, z3, ~G3)
343 ```
344
345 ```
346 P = (A | B) & Ci
347 G = (A & B)
348 ```
349
350 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
351
352 ```
353 At each id, compute C[id] = A[id]+B[id]+0
354 Get G[id] = C[id] > radix -1
355 Get P[id] = C[id] == radix-1
356 Join all P[id] together, likewise G[id]
357 Compute newC = ((P|G)+G)^P
358 result[id] = (C[id] + newC[id]) % radix
359 ```
360
361 two versions: scalar int version and CR based version.
362
363 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
364
365 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
366
367 if zero (no propagation) then CR0.eq is zero
368
369 CR based version, TODO.