(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 TODO merge old standards page [[simple_v_extension/vector_ops/]]
6
7 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
8
9 Notes:
10
11 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
12 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
13 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
14
15 Links:
16
17 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
18 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
19 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
20 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
21 out of scope for this document
22 * [[simple_v_extension/specification/bitmanip]] previous version,
23 contains pseudocode for sof, sif, sbf
24
25 # Vector
26
27 ## conflictd
28
29 This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
30
31 input = [100, 100, 3, 100, 5, 100, 100, 3]
32 conflict result = [
33 0b00000000, // Note: first element always zero
34 0b00000001, // 100 is present on #0
35 0b00000000,
36 0b00000011, // 100 is present on #0 and #1
37 0b00000000,
38 0b00001011, // 100 is present on #0, #1, #3
39 0b00011011, // .. and #4
40 0b00000100 // 3 is present on #2
41 ]
42
43 Pseudocode:
44
45 for i in range(VL):
46 for j in range(1, i):
47 if src1[i] == src2[j]:
48 result[j] |= 1<<i
49
50 ## iota
51
52 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
53
54 When masked, only the bits not masked out are included in the count process.
55
56 viota RT/v, RA, RB
57
58 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
59
60 Example
61
62 7 6 5 4 3 2 1 0 Element number
63
64 1 0 0 1 0 0 0 1 v2 contents
65 viota.m v4, v2 # Unmasked
66 2 2 2 1 1 1 1 0 v4 result
67
68 1 1 1 0 1 0 1 1 v0 contents
69 1 0 0 1 0 0 0 1 v2 contents
70 2 3 4 5 6 7 8 9 v4 contents
71 viota.m v4, v2, v0.t # Masked
72 1 1 1 5 1 7 1 0 v4 results
73
74 def iota(RT, RA, RB):
75 mask = RB ? iregs[RB] : 0b111111...1
76 val = RA ? iregs[RA] : 0b111111...1
77 for i in range(VL):
78 if RA.scalar:
79 testmask = (1<<i)-1 # only count below
80 to_test = val & testmask & mask
81 iregs[RT+i] = popcount(to_test)
82
83 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
84 where bit 2 is inv, bits 0:1 select the bit of the CR.
85
86 def test_CR_bit(CR, BO):
87 return CR[BO[0:1]] == BO[2]
88
89 def iotacr(RT, BA, BO):
90 mask = get_src_predicate()
91 count = 0
92 for i in range(VL):
93 if mask & (1<<i) == 0: continue
94 iregs[RT+i] = count
95 if test_CR_bit(CR[i+BA], BO):
96 count += 1
97
98 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
99
100 # Scalar
101
102 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
103
104 ## sbfm
105
106 sbfm RT, RA, RB!=0
107
108 Example
109
110 7 6 5 4 3 2 1 0 Bit index
111
112 1 0 0 1 0 1 0 0 v3 contents
113 vmsbf.m v2, v3
114 0 0 0 0 0 0 1 1 v2 contents
115
116 1 0 0 1 0 1 0 1 v3 contents
117 vmsbf.m v2, v3
118 0 0 0 0 0 0 0 0 v2
119
120 0 0 0 0 0 0 0 0 v3 contents
121 vmsbf.m v2, v3
122 1 1 1 1 1 1 1 1 v2
123
124 1 1 0 0 0 0 1 1 RB vcontents
125 1 0 0 1 0 1 0 0 v3 contents
126 vmsbf.m v2, v3, v0.t
127 0 1 x x x x 1 1 v2 contents
128
129 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
130
131 pseudocode:
132
133 def sbf(rd, rs1, rs2):
134 rd = 0
135 # start setting if no predicate or if 1st predicate bit set
136 setting_mode = rs2 == x0 or (regs[rs2] & 1)
137 while i < XLEN:
138 bit = 1<<i
139 if rs2 != x0 and (regs[rs2] & bit):
140 # reset searching
141 setting_mode = False
142 if setting_mode:
143 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
144 setting_mode = False
145 else:
146 regs[rd] |= bit
147 else if rs2 != x0: # searching mode
148 if (regs[rs2] & bit):
149 setting_mode = True # back into "setting" mode
150 i += 1
151
152 ## sifm
153
154 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
155
156 sifm RT, RA, RB!=0
157
158 # Example
159
160 7 6 5 4 3 2 1 0 Bit number
161
162 1 0 0 1 0 1 0 0 v3 contents
163 vmsif.m v2, v3
164 0 0 0 0 0 1 1 1 v2 contents
165
166 1 0 0 1 0 1 0 1 v3 contents
167 vmsif.m v2, v3
168 0 0 0 0 0 0 0 1 v2
169
170 1 1 0 0 0 0 1 1 RB vcontents
171 1 0 0 1 0 1 0 0 v3 contents
172 vmsif.m v2, v3, v0.t
173 1 1 x x x x 1 1 v2 contents
174
175 Pseudo-code:
176
177 def sif(rd, rs1, rs2):
178 rd = 0
179 setting_mode = rs2 == x0 or (regs[rs2] & 1)
180
181 while i < XLEN:
182 bit = 1<<i
183
184 # only reenable when predicate in use, and bit valid
185 if !setting_mode && rs2 != x0:
186 if (regs[rs2] & bit):
187 # back into "setting" mode
188 setting_mode = True
189
190 # skipping mode
191 if !setting_mode:
192 # skip any more 1s
193 if regs[rs1] & bit == 1:
194 i += 1
195 continue
196
197 # setting mode, search for 1
198 regs[rd] |= bit # always set during search
199 if regs[rs1] & bit: # found a bit in rs1:
200 setting_mode = False
201 # next loop starts skipping
202
203 i += 1
204
205
206 ## vmsof
207
208 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
209
210 sofm RT, RA, RB
211
212 Example
213
214 7 6 5 4 3 2 1 0 Bit number
215
216 1 0 0 1 0 1 0 0 v3 contents
217 vmsof.m v2, v3
218 0 0 0 0 0 1 0 0 v2 contents
219
220 1 0 0 1 0 1 0 1 v3 contents
221 vmsof.m v2, v3
222 0 0 0 0 0 0 0 1 v2
223
224 1 1 0 0 0 0 1 1 RB vcontents
225 1 1 0 1 0 1 0 0 v3 contents
226 vmsof.m v2, v3, v0.t
227 0 1 x x x x 0 0 v2 content
228
229 Pseudo-code:
230
231 def sof(rd, rs1, rs2):
232 rd = 0
233 setting_mode = rs2 == x0 or (regs[rs2] & 1)
234
235 while i < XLEN:
236 bit = 1<<i
237
238 # only reenable when predicate in use, and bit valid
239 if !setting_mode && rs2 != x0:
240 if (regs[rs2] & bit):
241 # back into "setting" mode
242 setting_mode = True
243
244 # skipping mode
245 if !setting_mode:
246 # skip any more 1s
247 if regs[rs1] & bit == 1:
248 i += 1
249 continue
250
251 # setting mode, search for 1
252 if regs[rs1] & bit: # found a bit in rs1:
253 regs[rd] |= bit # only set when search succeeds
254 setting_mode = False
255 # next loop starts skipping
256
257 i += 1
258
259 # Carry-lookahead
260
261 used not just for carry lookahead, also a special type of predication mask operation.
262
263 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
264 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
265 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
266 * <https://i.stack.imgur.com/QSLKY.png>
267 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
268 `((P|G)+G)^P`
269 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
270
271 From QLSKY.png:
272
273 ```
274 x0 = nand(CIn, P0)
275 C0 = nand(x0, ~G0)
276
277 x1 = nand(CIn, P0, P1)
278 y1 = nand(G0, P1)
279 C1 = nand(x1, y1, ~G1)
280
281 x2 = nand(CIn, P0, P1, P2)
282 y2 = nand(G0, P1, P2)
283 z2 = nand(G1, P2)
284 C1 = nand(x2, y2, z2, ~G2)
285
286 # Gen*
287 x3 = nand(G0, P1, P2, P3)
288 y3 = nand(G1, P2, P3)
289 z3 = nand(G2, P3)
290 G* = nand(x3, y3, z3, ~G3)
291 ```
292
293 ```
294 P = (A | B) & Ci
295 G = (A & B)
296 ```
297
298 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
299
300 ```
301 At each id, compute C[id] = A[id]+B[id]+0
302 Get G[id] = C[id] > radix -1
303 Get P[id] = C[id] == radix-1
304 Join all P[id] together, likewise G[id]
305 Compute newC = ((P|G)+G)^P
306 result[id] = (C[id] + newC[id]) % radix
307 ```
308
309 two versions: scalar int version and CR based version.
310
311 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
312
313 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
314
315 if zero (no propagation) then CR0.eq is zero
316
317 CR based version, TODO.