5 These all it turns out can be done as bitmanip of the form
6 x/~x &/|/^ (x / -x / x+1 / x-1) so are being superceded.
10 if _RB = 0 then mask <- [1] * XLEN else mask = (RB)
12 if mode[1] then a1 <- ¬ra
14 if mode2 = 0 then a2 <- (¬ra)+1
15 if mode2 = 1 then a2 <- ra-1
16 if mode2 = 2 then a2 <- ra+1
17 if mode2 = 3 then a2 <- ¬(ra+1)
22 if mode3 = 0 then result <- a1 | a2
23 if mode3 = 1 then result <- a1 & a2
24 if mode3 = 2 then result <- a1 ^ a2
25 if mode3 = 3 then result <- UNDEFINED
26 result <- result & mask
27 # optionally restore masked-out bits
29 result <- result | (RA & ¬mask)
32 SBF = 0b01010 # set before first
33 SOF = 0b01001 # set only first
34 SIF = 0b10000 # set including first 10011 also works no idea why yet
44 7 6 5 4 3 2 1 0 Bit index
46 1 0 0 1 0 1 0 0 v3 contents
48 0 0 0 0 0 0 1 1 v2 contents
50 1 0 0 1 0 1 0 1 v3 contents
54 0 0 0 0 0 0 0 0 v3 contents
58 1 1 0 0 0 0 1 1 RB vcontents
59 1 0 0 1 0 1 0 0 v3 contents
61 0 1 x x x x 1 1 v2 contents
63 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
65 Executable pseudocode demo:
68 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
73 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
79 7 6 5 4 3 2 1 0 Bit number
81 1 0 0 1 0 1 0 0 v3 contents
83 0 0 0 0 0 1 1 1 v2 contents
85 1 0 0 1 0 1 0 1 v3 contents
89 1 1 0 0 0 0 1 1 RB vcontents
90 1 0 0 1 0 1 0 0 v3 contents
92 1 1 x x x x 1 1 v2 contents
94 Executable pseudocode demo:
97 [[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
102 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
108 7 6 5 4 3 2 1 0 Bit number
110 1 0 0 1 0 1 0 0 v3 contents
112 0 0 0 0 0 1 0 0 v2 contents
114 1 0 0 1 0 1 0 1 v3 contents
118 1 1 0 0 0 0 1 1 RB vcontents
119 1 1 0 1 0 1 0 0 v3 contents
121 0 1 x x x x 0 0 v2 content
123 Executable pseudocode demo:
126 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
130 # SV Vector Operations not added
134 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
135 * <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
136 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
137 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
139 Both of these instructions may be synthesised from SVP64 Vector
140 instructions. conflictd is an O(N^2) instruction based on
141 `sv.cmpi` and iota is an O(N) instruction based on `sv.addi`
142 with the appropriate predication
146 moved to [[sv/cookbook/conflictd]]
152 Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
154 When masked, only the bits not masked out are included in the count process.
158 Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
162 7 6 5 4 3 2 1 0 Element number
164 1 0 0 1 0 0 0 1 v2 contents
165 viota.m v4, v2 # Unmasked
166 2 2 2 1 1 1 1 0 v4 result
168 1 1 1 0 1 0 1 1 v0 contents
169 1 0 0 1 0 0 0 1 v2 contents
170 2 3 4 5 6 7 8 9 v4 contents
171 viota.m v4, v2, v0.t # Masked
172 1 1 1 5 1 7 1 0 v4 results
174 def iota(RT, RA, RB):
175 mask = RB ? iregs[RB] : 0b111111...1
176 val = RA ? iregs[RA] : 0b111111...1
179 testmask = (1<<i)-1 # only count below
180 to_test = val & testmask & mask
181 iregs[RT+i] = popcount(to_test)
183 a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
184 where bit 2 is inv, bits 0:1 select the bit of the CR.
186 def test_CR_bit(CR, BO):
187 return CR[BO[0:1]] == BO[2]
189 def iotacr(RT, BA, BO):
190 mask = get_src_predicate()
193 if mask & (1<<i) == 0:
194 count = 0 # reset back to zero
197 if test_CR_bit(CR[i+BA], BO):
200 the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
202 scalar variant which can be Vectorized to give iotacr:
204 def crtaddi(RT, RA, BA, BO, D):
205 if test_CR_bit(BA, BO):
210 a Vector for-loop with zero-ing on dest will give the
211 mask-out effect of resetting the count back to zero.
212 However close examination shows that the above may actually
213 be `sv.addi/mr/sm=EQ/sz *r1, *r0, 1` TODO check this
215 this looks promising:
217 sv.add *v4+1, *v4, *v2
219 where v2 is guaranteed to contain 0 or 1, the dependency chain
220 accumulates v2 via the overlap between src1 and dest vectors.
224 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
225 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
226 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
227 * <https://i.stack.imgur.com/QSLKY.png>
228 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
230 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
238 x1 = nand(CIn, P0, P1)
240 C1 = nand(x1, y1, ~G1)
242 x2 = nand(CIn, P0, P1, P2)
243 y2 = nand(G0, P1, P2)
245 C1 = nand(x2, y2, z2, ~G2)
248 x3 = nand(G0, P1, P2, P3)
249 y3 = nand(G1, P2, P3)
251 G* = nand(x3, y3, z3, ~G3)
259 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
262 At each id, compute C[id] = A[id]+B[id]+0
263 Get G[id] = C[id] > radix -1
264 Get P[id] = C[id] == radix-1
265 Join all P[id] together, likewise G[id]
266 Compute newC = ((P|G)+G)^P
267 result[id] = (C[id] + newC[id]) % radix
270 two versions: scalar int version and CR based version.
272 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
274 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
276 if zero (no propagation) then CR0.eq is zero
278 CR based version, TODO.