(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 Links:
6
7 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
8 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
12 out of scope for this document [[openpower/sv/3d_vector_ops]]
13 * [[simple_v_extension/specification/bitmanip]] previous version,
14 contains pseudocode for sof, sif, sbf
15
16
17 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
18
19 Notes:
20
21 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
22 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
23 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
24
25 # Vector
26
27 Both of these instructions may be synthesised from SVP64 Vector
28 instructions. conflictd is an O(N^2) instruction based on
29 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
30 with the appropriate predication
31
32 ## conflictd
33
34 This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
35
36 input = [100, 100, 3, 100, 5, 100, 100, 3]
37 conflict result = [
38 0b00000000, // Note: first element always zero
39 0b00000001, // 100 is present on #0
40 0b00000000,
41 0b00000011, // 100 is present on #0 and #1
42 0b00000000,
43 0b00001011, // 100 is present on #0, #1, #3
44 0b00011011, // .. and #4
45 0b00000100 // 3 is present on #2
46 ]
47
48 Pseudocode:
49
50 for i in range(VL):
51 for j in range(1, i):
52 if src1[i] == src2[j]:
53 result[j] |= 1<<i
54
55 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
56 using `mfcrweird` and `cmpi`. first triangular schedule on src1,
57 secpnd on src2.
58
59 Idea 2: implement using outer loop on varying setvl Horizontal-First
60 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
61 result.
62
63 li r3, 1
64 li result, 0
65 for i in range(target):
66 setvl target
67 sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
68 sv.cmpi src2.v, t0 # compare src2 vector to scalar
69 sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
70 srr t1, t1, i # shift up by i before ORing
71 or result, result, t1
72 srr r3, r3, 1 # shift r3 predicate up by one
73
74 See [[sv/cr_int_predication]] for full details on the crweird instructions:
75 the primary important aspect here is that a Vector of CR Field's EQ bits is
76 transferred into a single GPR. The secondary important aspect is that VL
77 is being adjusted in each loop, testing successively more of the input
78 vector against a given scalar, each time.
79
80 To investigate:
81
82 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
83 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
84
85 ## iota
86
87 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
88
89 When masked, only the bits not masked out are included in the count process.
90
91 viota RT/v, RA, RB
92
93 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
94
95 Example
96
97 7 6 5 4 3 2 1 0 Element number
98
99 1 0 0 1 0 0 0 1 v2 contents
100 viota.m v4, v2 # Unmasked
101 2 2 2 1 1 1 1 0 v4 result
102
103 1 1 1 0 1 0 1 1 v0 contents
104 1 0 0 1 0 0 0 1 v2 contents
105 2 3 4 5 6 7 8 9 v4 contents
106 viota.m v4, v2, v0.t # Masked
107 1 1 1 5 1 7 1 0 v4 results
108
109 def iota(RT, RA, RB):
110 mask = RB ? iregs[RB] : 0b111111...1
111 val = RA ? iregs[RA] : 0b111111...1
112 for i in range(VL):
113 if RA.scalar:
114 testmask = (1<<i)-1 # only count below
115 to_test = val & testmask & mask
116 iregs[RT+i] = popcount(to_test)
117
118 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
119 where bit 2 is inv, bits 0:1 select the bit of the CR.
120
121 def test_CR_bit(CR, BO):
122 return CR[BO[0:1]] == BO[2]
123
124 def iotacr(RT, BA, BO):
125 mask = get_src_predicate()
126 count = 0
127 for i in range(VL):
128 if mask & (1<<i) == 0:
129 count = 0 # reset back to zero
130 continue
131 iregs[RT+i] = count
132 if test_CR_bit(CR[i+BA], BO):
133 count += 1
134
135 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
136
137 scalar variant which can be Vectorised to give iotacr:
138
139 def crtaddi(RT, RA, BA, BO, D):
140 if test_CR_bit(BA, BO):
141 RT = RA + EXTS(D)
142 else:
143 RT = RA
144
145 a Vector for-loop with zero-ing on dest will give the
146 mask-out effect of resetting the count back to zero.
147 However close examination shows that the above may actually
148 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
149
150 # Scalar
151
152 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
153
154 ## sbfm
155
156 sbfm RT, RA, RB!=0
157
158 Example
159
160 7 6 5 4 3 2 1 0 Bit index
161
162 1 0 0 1 0 1 0 0 v3 contents
163 vmsbf.m v2, v3
164 0 0 0 0 0 0 1 1 v2 contents
165
166 1 0 0 1 0 1 0 1 v3 contents
167 vmsbf.m v2, v3
168 0 0 0 0 0 0 0 0 v2
169
170 0 0 0 0 0 0 0 0 v3 contents
171 vmsbf.m v2, v3
172 1 1 1 1 1 1 1 1 v2
173
174 1 1 0 0 0 0 1 1 RB vcontents
175 1 0 0 1 0 1 0 0 v3 contents
176 vmsbf.m v2, v3, v0.t
177 0 1 x x x x 1 1 v2 contents
178
179 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
180
181 pseudocode:
182
183 def sbf(rd, rs1, rs2):
184 rd = 0
185 # start setting if no predicate or if 1st predicate bit set
186 setting_mode = rs2 == x0 or (regs[rs2] & 1)
187 while i < XLEN:
188 bit = 1<<i
189 if rs2 != x0 and (regs[rs2] & bit):
190 # reset searching
191 setting_mode = False
192 if setting_mode:
193 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
194 setting_mode = False
195 else:
196 regs[rd] |= bit
197 else if rs2 != x0: # searching mode
198 if (regs[rs2] & bit):
199 setting_mode = True # back into "setting" mode
200 i += 1
201
202 ## sifm
203
204 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
205
206 sifm RT, RA, RB!=0
207
208 # Example
209
210 7 6 5 4 3 2 1 0 Bit number
211
212 1 0 0 1 0 1 0 0 v3 contents
213 vmsif.m v2, v3
214 0 0 0 0 0 1 1 1 v2 contents
215
216 1 0 0 1 0 1 0 1 v3 contents
217 vmsif.m v2, v3
218 0 0 0 0 0 0 0 1 v2
219
220 1 1 0 0 0 0 1 1 RB vcontents
221 1 0 0 1 0 1 0 0 v3 contents
222 vmsif.m v2, v3, v0.t
223 1 1 x x x x 1 1 v2 contents
224
225 Pseudo-code:
226
227 def sif(rd, rs1, rs2):
228 rd = 0
229 setting_mode = rs2 == x0 or (regs[rs2] & 1)
230
231 while i < XLEN:
232 bit = 1<<i
233
234 # only reenable when predicate in use, and bit valid
235 if !setting_mode && rs2 != x0:
236 if (regs[rs2] & bit):
237 # back into "setting" mode
238 setting_mode = True
239
240 # skipping mode
241 if !setting_mode:
242 # skip any more 1s
243 if regs[rs1] & bit == 1:
244 i += 1
245 continue
246
247 # setting mode, search for 1
248 regs[rd] |= bit # always set during search
249 if regs[rs1] & bit: # found a bit in rs1:
250 setting_mode = False
251 # next loop starts skipping
252
253 i += 1
254
255
256 ## vmsof
257
258 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
259
260 sofm RT, RA, RB
261
262 Example
263
264 7 6 5 4 3 2 1 0 Bit number
265
266 1 0 0 1 0 1 0 0 v3 contents
267 vmsof.m v2, v3
268 0 0 0 0 0 1 0 0 v2 contents
269
270 1 0 0 1 0 1 0 1 v3 contents
271 vmsof.m v2, v3
272 0 0 0 0 0 0 0 1 v2
273
274 1 1 0 0 0 0 1 1 RB vcontents
275 1 1 0 1 0 1 0 0 v3 contents
276 vmsof.m v2, v3, v0.t
277 0 1 x x x x 0 0 v2 content
278
279 Pseudo-code:
280
281 def sof(rd, rs1, rs2):
282 rd = 0
283 setting_mode = rs2 == x0 or (regs[rs2] & 1)
284
285 while i < XLEN:
286 bit = 1<<i
287
288 # only reenable when predicate in use, and bit valid
289 if !setting_mode && rs2 != x0:
290 if (regs[rs2] & bit):
291 # back into "setting" mode
292 setting_mode = True
293
294 # skipping mode
295 if !setting_mode:
296 # skip any more 1s
297 if regs[rs1] & bit == 1:
298 i += 1
299 continue
300
301 # setting mode, search for 1
302 if regs[rs1] & bit: # found a bit in rs1:
303 regs[rd] |= bit # only set when search succeeds
304 setting_mode = False
305 # next loop starts skipping
306
307 i += 1
308
309 # Carry-lookahead
310
311 used not just for carry lookahead, also a special type of predication mask operation.
312
313 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
314 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
315 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
316 * <https://i.stack.imgur.com/QSLKY.png>
317 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
318 `((P|G)+G)^P`
319 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
320
321 From QLSKY.png:
322
323 ```
324 x0 = nand(CIn, P0)
325 C0 = nand(x0, ~G0)
326
327 x1 = nand(CIn, P0, P1)
328 y1 = nand(G0, P1)
329 C1 = nand(x1, y1, ~G1)
330
331 x2 = nand(CIn, P0, P1, P2)
332 y2 = nand(G0, P1, P2)
333 z2 = nand(G1, P2)
334 C1 = nand(x2, y2, z2, ~G2)
335
336 # Gen*
337 x3 = nand(G0, P1, P2, P3)
338 y3 = nand(G1, P2, P3)
339 z3 = nand(G2, P3)
340 G* = nand(x3, y3, z3, ~G3)
341 ```
342
343 ```
344 P = (A | B) & Ci
345 G = (A & B)
346 ```
347
348 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
349
350 ```
351 At each id, compute C[id] = A[id]+B[id]+0
352 Get G[id] = C[id] > radix -1
353 Get P[id] = C[id] == radix-1
354 Join all P[id] together, likewise G[id]
355 Compute newC = ((P|G)+G)^P
356 result[id] = (C[id] + newC[id]) % radix
357 ```
358
359 two versions: scalar int version and CR based version.
360
361 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
362
363 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
364
365 if zero (no propagation) then CR0.eq is zero
366
367 CR based version, TODO.