(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
6
7 Notes:
8
9 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
10 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
11 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
12
13 Links:
14
15 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
16 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
17 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
18 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
19 out of scope for this document
20 * [[simple_v_extension/specification/bitmanip]] previous version,
21 contains pseudocode for sof, sif, sbf
22
23 # Vector
24
25 ## conflictd
26
27 This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
28
29 input = [100, 100, 3, 100, 5, 100, 100, 3]
30 conflict result = [
31 0b00000000, // Note: first element always zero
32 0b00000001, // 100 is present on #0
33 0b00000000,
34 0b00000011, // 100 is present on #0 and #1
35 0b00000000,
36 0b00001011, // 100 is present on #0, #1, #3
37 0b00011011, // .. and #4
38 0b00000100 // 3 is present on #2
39 ]
40
41 Pseudocode:
42
43 for i in range(VL):
44 for j in range(1, i):
45 if src1[i] == src2[j]:
46 result[j] |= 1<<i
47
48 ## iota
49
50 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
51
52 When masked, only the bits not masked out are included in the count process.
53
54 viota RT/v, RA, RB
55
56 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
57
58 Example
59
60 7 6 5 4 3 2 1 0 Element number
61
62 1 0 0 1 0 0 0 1 v2 contents
63 viota.m v4, v2 # Unmasked
64 2 2 2 1 1 1 1 0 v4 result
65
66 1 1 1 0 1 0 1 1 v0 contents
67 1 0 0 1 0 0 0 1 v2 contents
68 2 3 4 5 6 7 8 9 v4 contents
69 viota.m v4, v2, v0.t # Masked
70 1 1 1 5 1 7 1 0 v4 results
71
72 def iota(RT, RA, RB):
73 mask = RB ? iregs[RB] : 0b111111...1
74 val = RA ? iregs[RA] : 0b111111...1
75 for i in range(VL):
76 if RA.scalar:
77 testmask = (1<<i)-1 # only count below
78 to_test = val & testmask & mask
79 iregs[RT+i] = popcount(to_test)
80
81 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
82 where bit 2 is inv, bits 0:1 select the bit of the CR.
83
84 def test_CR_bit(CR, BO):
85 return CR[BO[0:1]] == BO[2]
86
87 def iotacr(RT, BA, BO):
88 mask = get_src_predicate()
89 count = 0
90 for i in range(VL):
91 if mask & (1<<i) == 0: continue
92 iregs[RT+i] = count
93 if test_CR_bit(CR[i+BA], BO):
94 count += 1
95
96 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
97
98 # Scalar
99
100 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
101
102 ## sbfm
103
104 sbfm RT, RA, RB!=0
105
106 Example
107
108 7 6 5 4 3 2 1 0 Bit index
109
110 1 0 0 1 0 1 0 0 v3 contents
111 vmsbf.m v2, v3
112 0 0 0 0 0 0 1 1 v2 contents
113
114 1 0 0 1 0 1 0 1 v3 contents
115 vmsbf.m v2, v3
116 0 0 0 0 0 0 0 0 v2
117
118 0 0 0 0 0 0 0 0 v3 contents
119 vmsbf.m v2, v3
120 1 1 1 1 1 1 1 1 v2
121
122 1 1 0 0 0 0 1 1 RB vcontents
123 1 0 0 1 0 1 0 0 v3 contents
124 vmsbf.m v2, v3, v0.t
125 0 1 x x x x 1 1 v2 contents
126
127 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
128
129 pseudocode:
130
131 def sbf(rd, rs1, rs2):
132 rd = 0
133 # start setting if no predicate or if 1st predicate bit set
134 setting_mode = rs2 == x0 or (regs[rs2] & 1)
135 while i < XLEN:
136 bit = 1<<i
137 if rs2 != x0 and (regs[rs2] & bit):
138 # reset searching
139 setting_mode = False
140 if setting_mode:
141 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
142 setting_mode = False
143 else:
144 regs[rd] |= bit
145 else if rs2 != x0: # searching mode
146 if (regs[rs2] & bit):
147 setting_mode = True # back into "setting" mode
148 i += 1
149
150 ## sifm
151
152 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
153
154 sifm RT, RA, RB!=0
155
156 # Example
157
158 7 6 5 4 3 2 1 0 Bit number
159
160 1 0 0 1 0 1 0 0 v3 contents
161 vmsif.m v2, v3
162 0 0 0 0 0 1 1 1 v2 contents
163
164 1 0 0 1 0 1 0 1 v3 contents
165 vmsif.m v2, v3
166 0 0 0 0 0 0 0 1 v2
167
168 1 1 0 0 0 0 1 1 RB vcontents
169 1 0 0 1 0 1 0 0 v3 contents
170 vmsif.m v2, v3, v0.t
171 1 1 x x x x 1 1 v2 contents
172
173 Pseudo-code:
174
175 def sif(rd, rs1, rs2):
176 rd = 0
177 setting_mode = rs2 == x0 or (regs[rs2] & 1)
178
179 while i < XLEN:
180 bit = 1<<i
181
182 # only reenable when predicate in use, and bit valid
183 if !setting_mode && rs2 != x0:
184 if (regs[rs2] & bit):
185 # back into "setting" mode
186 setting_mode = True
187
188 # skipping mode
189 if !setting_mode:
190 # skip any more 1s
191 if regs[rs1] & bit == 1:
192 i += 1
193 continue
194
195 # setting mode, search for 1
196 regs[rd] |= bit # always set during search
197 if regs[rs1] & bit: # found a bit in rs1:
198 setting_mode = False
199 # next loop starts skipping
200
201 i += 1
202
203
204 ## vmsof
205
206 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
207
208 sofm RT, RA, RB
209
210 Example
211
212 7 6 5 4 3 2 1 0 Bit number
213
214 1 0 0 1 0 1 0 0 v3 contents
215 vmsof.m v2, v3
216 0 0 0 0 0 1 0 0 v2 contents
217
218 1 0 0 1 0 1 0 1 v3 contents
219 vmsof.m v2, v3
220 0 0 0 0 0 0 0 1 v2
221
222 1 1 0 0 0 0 1 1 RB vcontents
223 1 1 0 1 0 1 0 0 v3 contents
224 vmsof.m v2, v3, v0.t
225 0 1 x x x x 0 0 v2 content
226
227 Pseudo-code:
228
229 def sof(rd, rs1, rs2):
230 rd = 0
231 setting_mode = rs2 == x0 or (regs[rs2] & 1)
232
233 while i < XLEN:
234 bit = 1<<i
235
236 # only reenable when predicate in use, and bit valid
237 if !setting_mode && rs2 != x0:
238 if (regs[rs2] & bit):
239 # back into "setting" mode
240 setting_mode = True
241
242 # skipping mode
243 if !setting_mode:
244 # skip any more 1s
245 if regs[rs1] & bit == 1:
246 i += 1
247 continue
248
249 # setting mode, search for 1
250 if regs[rs1] & bit: # found a bit in rs1:
251 regs[rd] |= bit # only set when search succeeds
252 setting_mode = False
253 # next loop starts skipping
254
255 i += 1
256
257 # Carry-lookahead
258
259 used not just for carry lookahead, also a special type of predication mask operation.
260
261 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
262 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
263 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
264 * <https://i.stack.imgur.com/QSLKY.png>
265 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
266 `((P|G)+G)^P`
267 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
268
269 P = (A | B) & Ci
270 G = (A & B)
271
272 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
273
274
275
276 two versions: scalar int version and CR based version.
277
278 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
279
280 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
281
282 if zero (no propagation) then CR0.eq is zero
283
284 CR based version, TODO.