(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 Links:
6
7 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
8 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
12 out of scope for this document [[openpower/sv/3d_vector_ops]]
13 * [[simple_v_extension/specification/bitmanip]] previous version,
14 contains pseudocode for sof, sif, sbf
15
16 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
17
18 Notes:
19
20 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
21 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
22 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
23
24 # Vector
25
26 Both of these instructions may be synthesised from SVP64 Vector
27 instructions. conflictd is an O(N^2) instruction based on
28 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
29 with the appropriate predication
30
31 ## conflictd
32
33 This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
34
35 input = [100, 100, 3, 100, 5, 100, 100, 3]
36 conflict result = [
37 0b00000000, // Note: first element always zero
38 0b00000001, // 100 is present on #0
39 0b00000000,
40 0b00000011, // 100 is present on #0 and #1
41 0b00000000,
42 0b00001011, // 100 is present on #0, #1, #3
43 0b00011011, // .. and #4
44 0b00000100 // 3 is present on #2
45 ]
46
47 Pseudocode:
48
49 for i in range(VL):
50 for j in range(1, i):
51 if src1[i] == src2[j]:
52 result[j] |= 1<<i
53
54 Idea 1: implement this as a Triangular Schedule, Vertical-First Mode,
55 using `mfcrweird` and `cmpi`. first triangular schedule on src1,
56 secpnd on src2.
57
58 Idea 2: implement using outer loop on varying setvl Horizontal-First
59 with `1<<r3` predicate mask for src2 as scalar, creates CR field vector, transfer into INT with mfcrweird then OR into the
60 result.
61
62 li r3, 1
63 li result, 0
64 for i in range(target):
65 setvl target
66 sv.addi/sm=1<<r3 t0, src1.v, 0 # copy src1[i]
67 sv.cmpi src2.v, t0 # compare src2 vector to scalar
68 sv.mfcrweird t1, cr0.v, eq # copy CR eq result bits to t1
69 srr t1, t1, i # shift up by i before ORing
70 or result, result, t1
71 srr r3, r3, 1 # shift r3 predicate up by one
72
73 See [[sv/cr_int_predication]] for full details on the crweird instructions:
74 the primary important aspect here is that a Vector of CR Field's EQ bits is
75 transferred into a single GPR. The secondary important aspect is that VL
76 is being adjusted in each loop, testing successively more of the input
77 vector against a given scalar, each time.
78
79 To investigate:
80
81 * <https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups>
82 * <https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops>
83
84 ## iota
85
86 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
87
88 When masked, only the bits not masked out are included in the count process.
89
90 viota RT/v, RA, RB
91
92 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
93
94 Example
95
96 7 6 5 4 3 2 1 0 Element number
97
98 1 0 0 1 0 0 0 1 v2 contents
99 viota.m v4, v2 # Unmasked
100 2 2 2 1 1 1 1 0 v4 result
101
102 1 1 1 0 1 0 1 1 v0 contents
103 1 0 0 1 0 0 0 1 v2 contents
104 2 3 4 5 6 7 8 9 v4 contents
105 viota.m v4, v2, v0.t # Masked
106 1 1 1 5 1 7 1 0 v4 results
107
108 def iota(RT, RA, RB):
109 mask = RB ? iregs[RB] : 0b111111...1
110 val = RA ? iregs[RA] : 0b111111...1
111 for i in range(VL):
112 if RA.scalar:
113 testmask = (1<<i)-1 # only count below
114 to_test = val & testmask & mask
115 iregs[RT+i] = popcount(to_test)
116
117 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
118 where bit 2 is inv, bits 0:1 select the bit of the CR.
119
120 def test_CR_bit(CR, BO):
121 return CR[BO[0:1]] == BO[2]
122
123 def iotacr(RT, BA, BO):
124 mask = get_src_predicate()
125 count = 0
126 for i in range(VL):
127 if mask & (1<<i) == 0:
128 count = 0 # reset back to zero
129 continue
130 iregs[RT+i] = count
131 if test_CR_bit(CR[i+BA], BO):
132 count += 1
133
134 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
135
136 scalar variant which can be Vectorised to give iotacr:
137
138 def crtaddi(RT, RA, BA, BO, D):
139 if test_CR_bit(BA, BO):
140 RT = RA + EXTS(D)
141 else:
142 RT = RA
143
144 a Vector for-loop with zero-ing on dest will give the
145 mask-out effect of resetting the count back to zero.
146 However close examination shows that the above may actually
147 be `sv.addi/mr/sm=EQ/dz r0.v, r0.v, 1`
148
149 # Scalar
150
151 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
152
153 ## sbfm
154
155 sbfm RT, RA, RB!=0
156
157 Example
158
159 7 6 5 4 3 2 1 0 Bit index
160
161 1 0 0 1 0 1 0 0 v3 contents
162 vmsbf.m v2, v3
163 0 0 0 0 0 0 1 1 v2 contents
164
165 1 0 0 1 0 1 0 1 v3 contents
166 vmsbf.m v2, v3
167 0 0 0 0 0 0 0 0 v2
168
169 0 0 0 0 0 0 0 0 v3 contents
170 vmsbf.m v2, v3
171 1 1 1 1 1 1 1 1 v2
172
173 1 1 0 0 0 0 1 1 RB vcontents
174 1 0 0 1 0 1 0 0 v3 contents
175 vmsbf.m v2, v3, v0.t
176 0 1 x x x x 1 1 v2 contents
177
178 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
179
180 Executable demo:
181
182 ```
183 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
184 ```
185
186 ## sifm
187
188 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
189
190 sifm RT, RA, RB!=0
191
192 # Example
193
194 7 6 5 4 3 2 1 0 Bit number
195
196 1 0 0 1 0 1 0 0 v3 contents
197 vmsif.m v2, v3
198 0 0 0 0 0 1 1 1 v2 contents
199
200 1 0 0 1 0 1 0 1 v3 contents
201 vmsif.m v2, v3
202 0 0 0 0 0 0 0 1 v2
203
204 1 1 0 0 0 0 1 1 RB vcontents
205 1 0 0 1 0 1 0 0 v3 contents
206 vmsif.m v2, v3, v0.t
207 1 1 x x x x 1 1 v2 contents
208
209 Executable demo:
210
211 ```
212 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
213 ```
214
215 ## vmsof
216
217 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
218
219 sofm RT, RA, RB
220
221 Example
222
223 7 6 5 4 3 2 1 0 Bit number
224
225 1 0 0 1 0 1 0 0 v3 contents
226 vmsof.m v2, v3
227 0 0 0 0 0 1 0 0 v2 contents
228
229 1 0 0 1 0 1 0 1 v3 contents
230 vmsof.m v2, v3
231 0 0 0 0 0 0 0 1 v2
232
233 1 1 0 0 0 0 1 1 RB vcontents
234 1 1 0 1 0 1 0 0 v3 contents
235 vmsof.m v2, v3, v0.t
236 0 1 x x x x 0 0 v2 content
237
238 Executable demo:
239
240 ```
241 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
242 ```
243
244 # Carry-lookahead
245
246 used not just for carry lookahead, also a special type of predication mask operation.
247
248 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
249 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
250 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
251 * <https://i.stack.imgur.com/QSLKY.png>
252 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
253 `((P|G)+G)^P`
254 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
255
256 From QLSKY.png:
257
258 ```
259 x0 = nand(CIn, P0)
260 C0 = nand(x0, ~G0)
261
262 x1 = nand(CIn, P0, P1)
263 y1 = nand(G0, P1)
264 C1 = nand(x1, y1, ~G1)
265
266 x2 = nand(CIn, P0, P1, P2)
267 y2 = nand(G0, P1, P2)
268 z2 = nand(G1, P2)
269 C1 = nand(x2, y2, z2, ~G2)
270
271 # Gen*
272 x3 = nand(G0, P1, P2, P3)
273 y3 = nand(G1, P2, P3)
274 z3 = nand(G2, P3)
275 G* = nand(x3, y3, z3, ~G3)
276 ```
277
278 ```
279 P = (A | B) & Ci
280 G = (A & B)
281 ```
282
283 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
284
285 ```
286 At each id, compute C[id] = A[id]+B[id]+0
287 Get G[id] = C[id] > radix -1
288 Get P[id] = C[id] == radix-1
289 Join all P[id] together, likewise G[id]
290 Compute newC = ((P|G)+G)^P
291 result[id] = (C[id] + newC[id]) % radix
292 ```
293
294 two versions: scalar int version and CR based version.
295
296 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
297
298 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
299
300 if zero (no propagation) then CR0.eq is zero
301
302 CR based version, TODO.