bug 1244: add pospopcount cookbook link
[libreriscv.git] / openpower / sv / vector_ops / discussion.mdwn
1 [[!tag standards]]
2
3 # sof/sif/sbfm etc.
4
5 These all it turns out can be done as bitmanip of the form
6 x/~x &/|/^ (x / -x / x+1 / x-1) so are being superceded.
7 needs some work though
8
9 ```
10 if _RB = 0 then mask <- [1] * XLEN else mask = (RB)
11 a1 <- (RA) & mask
12 if mode[1] then a1 <- ¬ra
13 mode2 <- mode[2:3]
14 if mode2 = 0 then a2 <- (¬ra)+1
15 if mode2 = 1 then a2 <- ra-1
16 if mode2 = 2 then a2 <- ra+1
17 if mode2 = 3 then a2 <- ¬(ra+1)
18 a1 <- a1 & mask
19 a2 <- a2 & mask
20 # select operator
21 mode3 <- mode[3:4]
22 if mode3 = 0 then result <- a1 | a2
23 if mode3 = 1 then result <- a1 & a2
24 if mode3 = 2 then result <- a1 ^ a2
25 if mode3 = 3 then result <- UNDEFINED
26 result <- result & mask
27 # optionally restore masked-out bits
28 if L = 1 then
29 result <- result | (RA & ¬mask)
30 RT <- result
31
32 SBF = 0b01010 # set before first
33 SOF = 0b01001 # set only first
34 SIF = 0b10000 # set including first 10011 also works no idea why yet
35 ```
36
37
38 ## OBSOLETE sbfm
39
40 sbfm RT, RA, RB!=0
41
42 Example
43
44 7 6 5 4 3 2 1 0 Bit index
45
46 1 0 0 1 0 1 0 0 v3 contents
47 vmsbf.m v2, v3
48 0 0 0 0 0 0 1 1 v2 contents
49
50 1 0 0 1 0 1 0 1 v3 contents
51 vmsbf.m v2, v3
52 0 0 0 0 0 0 0 0 v2
53
54 0 0 0 0 0 0 0 0 v3 contents
55 vmsbf.m v2, v3
56 1 1 1 1 1 1 1 1 v2
57
58 1 1 0 0 0 0 1 1 RB vcontents
59 1 0 0 1 0 1 0 0 v3 contents
60 vmsbf.m v2, v3, v0.t
61 0 1 x x x x 1 1 v2 contents
62
63 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
64
65 Executable pseudocode demo:
66
67 ```
68 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
69 ```
70
71 ## OBSOLETE sifm
72
73 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
74
75 sifm RT, RA, RB!=0
76
77 # Example
78
79 7 6 5 4 3 2 1 0 Bit number
80
81 1 0 0 1 0 1 0 0 v3 contents
82 vmsif.m v2, v3
83 0 0 0 0 0 1 1 1 v2 contents
84
85 1 0 0 1 0 1 0 1 v3 contents
86 vmsif.m v2, v3
87 0 0 0 0 0 0 0 1 v2
88
89 1 1 0 0 0 0 1 1 RB vcontents
90 1 0 0 1 0 1 0 0 v3 contents
91 vmsif.m v2, v3, v0.t
92 1 1 x x x x 1 1 v2 contents
93
94 Executable pseudocode demo:
95
96 ```
97 [[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
98 ```
99
100 ## OBSOLETE vmsof
101
102 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
103
104 sofm RT, RA, RB
105
106 Example
107
108 7 6 5 4 3 2 1 0 Bit number
109
110 1 0 0 1 0 1 0 0 v3 contents
111 vmsof.m v2, v3
112 0 0 0 0 0 1 0 0 v2 contents
113
114 1 0 0 1 0 1 0 1 v3 contents
115 vmsof.m v2, v3
116 0 0 0 0 0 0 0 1 v2
117
118 1 1 0 0 0 0 1 1 RB vcontents
119 1 1 0 1 0 1 0 0 v3 contents
120 vmsof.m v2, v3, v0.t
121 0 1 x x x x 0 0 v2 content
122
123 Executable pseudocode demo:
124
125 ```
126 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
127 ```
128
129
130 # SV Vector Operations not added
131
132 Links:
133
134 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
135 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
136 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
137 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
138
139 Both of these instructions may be synthesised from SVP64 Vector
140 instructions. conflictd is an O(N^2) instruction based on
141 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
142 with the appropriate predication
143
144 # conflictd
145
146 moved to [[sv/cookbook/conflictd]]
147
148
149
150 # iota
151
152 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
153
154 When masked, only the bits not masked out are included in the count process.
155
156 viota RT/v, RA, RB
157
158 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
159
160 Example
161
162 7 6 5 4 3 2 1 0 Element number
163
164 1 0 0 1 0 0 0 1 v2 contents
165 viota.m v4, v2 # Unmasked
166 2 2 2 1 1 1 1 0 v4 result
167
168 1 1 1 0 1 0 1 1 v0 contents
169 1 0 0 1 0 0 0 1 v2 contents
170 2 3 4 5 6 7 8 9 v4 contents
171 viota.m v4, v2, v0.t # Masked
172 1 1 1 5 1 7 1 0 v4 results
173
174 def iota(RT, RA, RB):
175 mask = RB ? iregs[RB] : 0b111111...1
176 val = RA ? iregs[RA] : 0b111111...1
177 for i in range(VL):
178 if RA.scalar:
179 testmask = (1<<i)-1 # only count below
180 to_test = val & testmask & mask
181 iregs[RT+i] = popcount(to_test)
182
183 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
184 where bit 2 is inv, bits 0:1 select the bit of the CR.
185
186 def test_CR_bit(CR, BO):
187 return CR[BO[0:1]] == BO[2]
188
189 def iotacr(RT, BA, BO):
190 mask = get_src_predicate()
191 count = 0
192 for i in range(VL):
193 if mask & (1<<i) == 0:
194 count = 0 # reset back to zero
195 continue
196 iregs[RT+i] = count
197 if test_CR_bit(CR[i+BA], BO):
198 count += 1
199
200 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
201
202 scalar variant which can be Vectorized to give iotacr:
203
204 def crtaddi(RT, RA, BA, BO, D):
205 if test_CR_bit(BA, BO):
206 RT = RA + EXTS(D)
207 else:
208 RT = RA
209
210 a Vector for-loop with zero-ing on dest will give the
211 mask-out effect of resetting the count back to zero.
212 However close examination shows that the above may actually
213 be `sv.addi/mr/sm=EQ/sz *r1, *r0, 1` TODO check this
214
215 this looks promising:
216
217 sv.add *v4+1, *v4, *v2
218
219 where v2 is guaranteed to contain 0 or 1, the dependency chain
220 accumulates v2 via the overlap between src1 and dest vectors.
221
222 # cprop
223
224 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
225 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
226 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
227 * <https://i.stack.imgur.com/QSLKY.png>
228 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
229 `((P|G)+G)^P`
230 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
231
232 From QLSKY.png:
233
234 ```
235 x0 = nand(CIn, P0)
236 C0 = nand(x0, ~G0)
237
238 x1 = nand(CIn, P0, P1)
239 y1 = nand(G0, P1)
240 C1 = nand(x1, y1, ~G1)
241
242 x2 = nand(CIn, P0, P1, P2)
243 y2 = nand(G0, P1, P2)
244 z2 = nand(G1, P2)
245 C1 = nand(x2, y2, z2, ~G2)
246
247 # Gen*
248 x3 = nand(G0, P1, P2, P3)
249 y3 = nand(G1, P2, P3)
250 z3 = nand(G2, P3)
251 G* = nand(x3, y3, z3, ~G3)
252 ```
253
254 ```
255 P = (A | B) & Ci
256 G = (A & B)
257 ```
258
259 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
260
261 ```
262 At each id, compute C[id] = A[id]+B[id]+0
263 Get G[id] = C[id] > radix -1
264 Get P[id] = C[id] == radix-1
265 Join all P[id] together, likewise G[id]
266 Compute newC = ((P|G)+G)^P
267 result[id] = (C[id] + newC[id]) % radix
268 ```
269
270 two versions: scalar int version and CR based version.
271
272 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
273
274 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
275
276 if zero (no propagation) then CR0.eq is zero
277
278 CR based version, TODO.