(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 # SV Vector Operations.
2
3 The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
4
5 Notes:
6
7 * Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
8 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
9 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
10
11 Links:
12
13 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
14 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
15 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
16 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
17 out of scope for this document
18 * [[simple_v_extension/specification/bitmanip]] previous version,
19 contains pseudocode for sof, sif, sbf
20
21 # Vector
22
23 ## conflictd
24
25 This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
26
27 input = [100, 100, 3, 100, 5, 100, 100, 3]
28 conflict result = [
29 0b00000000, // Note: first element always zero
30 0b00000001, // 100 is present on #0
31 0b00000000,
32 0b00000011, // 100 is present on #0 and #1
33 0b00000000,
34 0b00001011, // 100 is present on #0, #1, #3
35 0b00011011, // .. and #4
36 0b00000100 // 3 is present on #2
37 ]
38
39 Pseudocode:
40
41 for i in range(VL):
42 for j in range(1, i):
43 if src1[i] == src2[j]:
44 result[j] |= 1<<i
45
46 ## iota
47
48 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
49
50 When masked, only the bits not masked out are included in the count process.
51
52 viota RT/v, RA, RB
53
54 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
55
56 Example
57
58 7 6 5 4 3 2 1 0 Element number
59
60 1 0 0 1 0 0 0 1 v2 contents
61 viota.m v4, v2 # Unmasked
62 2 2 2 1 1 1 1 0 v4 result
63
64 1 1 1 0 1 0 1 1 v0 contents
65 1 0 0 1 0 0 0 1 v2 contents
66 2 3 4 5 6 7 8 9 v4 contents
67 viota.m v4, v2, v0.t # Masked
68 1 1 1 5 1 7 1 0 v4 results
69
70 def iota(RT, RA, RB):
71 mask = RB ? iregs[RB] : 0b111111...1
72 val = RA ? iregs[RA] : 0b111111...1
73 for i in range(VL):
74 if RA.scalar:
75 testmask = (1<<i)-1 # only count below
76 to_test = val & testmask & mask
77 iregs[RT+i] = popcount(to_test)
78
79 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
80 where bit 2 is inv, bits 0:1 select the bit of the CR.
81
82 def test_CR_bit(CR, BO):
83 return CR[BO[0:1]] == BO[2]
84
85 def iotacr(RT, BA, BO):
86 mask = get_src_predicate()
87 count = 0
88 for i in range(VL):
89 if mask & (1<<i) == 0: continue
90 iregs[RT+i] = count
91 if test_CR_bit(CR[i+BA], BO):
92 count += 1
93
94 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
95
96 # Scalar
97
98 These may all be viewed as suitable for fitting into a scalar bitmanip extension.
99
100 ## sbfm
101
102 sbfm RT, RA, RB!=0
103
104 Example
105
106 7 6 5 4 3 2 1 0 Bit index
107
108 1 0 0 1 0 1 0 0 v3 contents
109 vmsbf.m v2, v3
110 0 0 0 0 0 0 1 1 v2 contents
111
112 1 0 0 1 0 1 0 1 v3 contents
113 vmsbf.m v2, v3
114 0 0 0 0 0 0 0 0 v2
115
116 0 0 0 0 0 0 0 0 v3 contents
117 vmsbf.m v2, v3
118 1 1 1 1 1 1 1 1 v2
119
120 1 1 0 0 0 0 1 1 RB vcontents
121 1 0 0 1 0 1 0 0 v3 contents
122 vmsbf.m v2, v3, v0.t
123 0 1 x x x x 1 1 v2 contents
124
125 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
126
127 pseudocode:
128
129 def sbf(rd, rs1, rs2):
130 rd = 0
131 # start setting if no predicate or if 1st predicate bit set
132 setting_mode = rs2 == x0 or (regs[rs2] & 1)
133 while i < XLEN:
134 bit = 1<<i
135 if rs2 != x0 and (regs[rs2] & bit):
136 # reset searching
137 setting_mode = False
138 if setting_mode:
139 if regs[rs1] & bit: # found a bit in rs1: stop setting rd
140 setting_mode = False
141 else:
142 regs[rd] |= bit
143 else if rs2 != x0: # searching mode
144 if (regs[rs2] & bit):
145 setting_mode = True # back into "setting" mode
146 i += 1
147
148 ## sifm
149
150 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
151
152 sifm RT, RA, RB!=0
153
154 # Example
155
156 7 6 5 4 3 2 1 0 Bit number
157
158 1 0 0 1 0 1 0 0 v3 contents
159 vmsif.m v2, v3
160 0 0 0 0 0 1 1 1 v2 contents
161
162 1 0 0 1 0 1 0 1 v3 contents
163 vmsif.m v2, v3
164 0 0 0 0 0 0 0 1 v2
165
166 1 1 0 0 0 0 1 1 RB vcontents
167 1 0 0 1 0 1 0 0 v3 contents
168 vmsif.m v2, v3, v0.t
169 1 1 x x x x 1 1 v2 contents
170
171 Pseudo-code:
172
173 def sif(rd, rs1, rs2):
174 rd = 0
175 setting_mode = rs2 == x0 or (regs[rs2] & 1)
176
177 while i < XLEN:
178 bit = 1<<i
179
180 # only reenable when predicate in use, and bit valid
181 if !setting_mode && rs2 != x0:
182 if (regs[rs2] & bit):
183 # back into "setting" mode
184 setting_mode = True
185
186 # skipping mode
187 if !setting_mode:
188 # skip any more 1s
189 if regs[rs1] & bit == 1:
190 i += 1
191 continue
192
193 # setting mode, search for 1
194 regs[rd] |= bit # always set during search
195 if regs[rs1] & bit: # found a bit in rs1:
196 setting_mode = False
197 # next loop starts skipping
198
199 i += 1
200
201
202 ## vmsof
203
204 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
205
206 sofm RT, RA, RB
207
208 Example
209
210 7 6 5 4 3 2 1 0 Bit number
211
212 1 0 0 1 0 1 0 0 v3 contents
213 vmsof.m v2, v3
214 0 0 0 0 0 1 0 0 v2 contents
215
216 1 0 0 1 0 1 0 1 v3 contents
217 vmsof.m v2, v3
218 0 0 0 0 0 0 0 1 v2
219
220 1 1 0 0 0 0 1 1 RB vcontents
221 1 1 0 1 0 1 0 0 v3 contents
222 vmsof.m v2, v3, v0.t
223 0 1 x x x x 0 0 v2 content
224
225 Pseudo-code:
226
227 def sof(rd, rs1, rs2):
228 rd = 0
229 setting_mode = rs2 == x0 or (regs[rs2] & 1)
230
231 while i < XLEN:
232 bit = 1<<i
233
234 # only reenable when predicate in use, and bit valid
235 if !setting_mode && rs2 != x0:
236 if (regs[rs2] & bit):
237 # back into "setting" mode
238 setting_mode = True
239
240 # skipping mode
241 if !setting_mode:
242 # skip any more 1s
243 if regs[rs1] & bit == 1:
244 i += 1
245 continue
246
247 # setting mode, search for 1
248 if regs[rs1] & bit: # found a bit in rs1:
249 regs[rd] |= bit # only set when search succeeds
250 setting_mode = False
251 # next loop starts skipping
252
253 i += 1
254
255 # Carry-lookahead
256
257 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
258 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
259 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
260 * <https://i.stack.imgur.com/QSLKY.png>
261 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
262 `((P|G)+G)^P`