(no commit message)
[libreriscv.git] / openpower / sv / bitmanip.mdwn
1 [[!tag standards]]
2
3 [[!toc levels=1]]
4
5 # Implementation Log
6
7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
10
11
12 # bitmanipulation
13
14 **DRAFT STATUS**
15
16 pseudocode: [[openpower/isa/bitmanip]]
17
18 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
19 Also included are DSP/Multimedia operations suitable for
20 Audio/Video. Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
21 Vectorisation Context is provided by [[openpower/sv]].
22
23 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that the Packed SIMD aspects of VSX may be retired as "legacy"
24 in the far future (10 to 20 years). Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
25
26 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD". It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs. Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
27
28 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science. however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added. The expectation is that these operations would all be covered by the same pipeline.
29
30 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], [[sv/int_fp_mv]] and
31 the [[sv/av_opcodes]] as well as [[sv/setvl]], [[sv/svstep]], [[sv/remap]]
32
33 Useful resource:
34
35 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
36 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
37
38 # summary
39
40 two major opcodes are needed
41
42 ternlog has its own major opcode
43
44 | 29.30 |31| name | Form |
45 | ------ |--| --------- | ---- |
46 | 0 0 |Rc| ternlogi | TLI-Form |
47 | 0 1 | | crternlogi | TLI-Form |
48 | 1 iv | | grevlogi | TLI-Form |
49
50 2nd major opcode for other bitmanip: minor opcode allocation
51
52 | 28.30 |31| name |
53 | ------ |--| --------- |
54 | -00 |0 | xpermi |
55 | -00 |1 | binary lut |
56 | -01 |0 | grevlog |
57 | -01 |1 | grevlogw |
58 | 010 |Rc| bitmask |
59 | 011 | | SVP64 |
60 | 110 |Rc| 1/2-op |
61 | 111 | | bmrevi |
62
63
64 1-op and variants
65
66 | dest | src1 | subop | op |
67 | ---- | ---- | ----- | -------- |
68 | RT | RA | .. | bmatflip |
69
70 2-op and variants
71
72 | dest | src1 | src2 | subop | op |
73 | ---- | ---- | ---- | ----- | -------- |
74 | RT | RA | RB | or | bmatflip |
75 | RT | RA | RB | xor | bmatflip |
76 | RT | RA | RB | | grev |
77 | RT | RA | RB | | clmul\* |
78 | RT | RA | RB | | gorc |
79 | RT | RA | RB | shuf | shuffle |
80 | RT | RA | RB | unshuf| shuffle |
81 | RT | RA | RB | width | xperm |
82 | RT | RA | RB | type | av minmax |
83 | RT | RA | RB | | av abs avgadd |
84 | RT | RA | RB | type | vmask ops |
85 | RT | RA | RB | type | abs accumulate (overwrite) |
86
87 3 ops
88
89 * grevlog[w]
90 * GF mul-add
91 * bitmask-reverse
92
93 TODO: convert all instructions to use RT and not RS
94
95 | 0.5|6.10|11.15|16.20 |21..25 | 26....30 |31| name | Form |
96 | -- | -- | --- | --- | ----- | -------- |--| ------ | -------- |
97 | NN | RT | RA |itype/| im0-4 | im5-7 00 |0 | xpermi | TLI-Form |
98 | NN | RT | RA | RB | RC | nh 00 00 |1 | binlut | VA-Form |
99 | NN | RT | RA | RB | /BFA/ | 0 01 00 |1 | bincrflut | VA-Form |
100 | NN | | | | | 1 01 00 |1 | rsvd | |
101 | NN | | | | | - 10 00 |1 | rsvd | |
102 | NN | | | | | 0 11 00 |1 | svshape | SVM-Form |
103 | NN | | | | | 1 11 00 |1 | svremap | SVRM-Form |
104 | NN | RT | RA | RB | im0-4 | im5-7 01 |0 | grevlog | TLI-Form |
105 | NN | RT | RA | RB | im0-4 | im5-7 01 |1 | grevlogw | TLI-Form |
106 | NN | RT | RA | RB | RC | mode 010 |Rc| bitmask\* | VA2-Form |
107 | NN |FRT | d1 | d0 | d0 | 00 011 |d2| fmvis | DX-Form |
108 | NN | | | | | 01 011 | | rsvd | |
109 | NN | | | | | 10 011 |Rc| svstep | SVL-Form |
110 | NN | | | | | 11 011 |Rc| setvl | SVL-Form |
111 | NN | | | | | ---- 110 | | 1/2 ops | other table |
112 | NN | RT | RA | RB | sh0-4 | sh5 1 111 |Rc| bmrevi | MDS-Form |
113
114 ops (note that av avg and abs as well as vec scalar mask
115 are included here [[sv/vector_ops]], and
116 the [[sv/av_opcodes]])
117
118 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name | Form |
119 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- | ------- |
120 | NN | RS | me | sh | SH | ME 0 | nn00 110 |Rc| bmopsi | BM-Form |
121 | NN | RS | RA | sh | SH | 0 1 | nn00 110 |Rc| bmopsi | XB-Form |
122 | NN | | | | | 1 1 | --00 110 |Rc| rsvd | |
123 | NN | RT | RA | RB | 1 | 00 | 0001 110 |Rc| cldiv | X-Form |
124 | NN | RT | RA | RB | 1 | 01 | 0001 110 |Rc| clmod | X-Form |
125 | NN | RT | RA | | 1 | 10 | 0001 110 |Rc| clmulh | X-Form |
126 | NN | RT | RA | RB | 1 | 11 | 0001 110 |Rc| clmul | X-Form |
127 | NN | RT | RA | RB | 0 | 00 | 0001 110 |Rc| vec sbfm | X-Form |
128 | NN | RT | RA | RB | 0 | 01 | 0001 110 |Rc| vec sofm | X-Form |
129 | NN | RT | RA | RB | 0 | 10 | 0001 110 |Rc| vec sifm | X-Form |
130 | NN | RT | RA | RB | 0 | 11 | 0001 110 |Rc| vec cprop | X-Form |
131 | NN | | | | | -0 | 0101 110 |Rc| crfbinlog | {TODO} |
132 | NN | | | | | -1 | 0101 110 |Rc| rsvd | |
133 | NN | RT | RA | RB | 0 | itype | 1001 110 |Rc| av minmax | X-Form |
134 | NN | RT | RA | RB | 1 | 00 | 1001 110 |Rc| av abss | X-Form |
135 | NN | RT | RA | RB | 1 | 01 | 1001 110 |Rc| av absu | X-Form |
136 | NN | RT | RA | RB | 1 | 10 | 1001 110 |Rc| av avgadd | X-Form |
137 | NN | RT | RA | RB | 1 | 11 | 1001 110 |Rc| grevlutr | X-Form |
138 | NN | RT | RA | RB | 0 | itype | 1101 110 |Rc| shadd | X-Form |
139 | NN | RT | RA | RB | 1 | itype | 1101 110 |Rc| shadduw | X-Form |
140 | NN | RT | RA | RB | 0 | 00 | 0010 110 |Rc| gorc | X-Form |
141 | NN | RS | RA | sh | SH | 00 | 1010 110 |Rc| gorci | XB-Form |
142 | NN | RT | RA | RB | 0 | 00 | 0110 110 |Rc| gorcw | X-Form |
143 | NN | RS | RA | SH | 0 | 00 | 1110 110 |Rc| gorcwi | X-Form |
144 | NN | RT | RA | RB | 1 | 00 | 1110 110 |Rc| rsvd | |
145 | NN | RT | RA | RB | 0 | 01 | 0010 110 |Rc| grev | X-Form |
146 | NN | RT | RA | RB | 1 | 01 | 0010 110 |Rc| clmulr | X-Form |
147 | NN | RS | RA | sh | SH | 01 | 1010 110 |Rc| grevi | XB-Form |
148 | NN | RT | RA | RB | 0 | 01 | 0110 110 |Rc| grevw | X-Form |
149 | NN | RS | RA | SH | 0 | 01 | 1110 110 |Rc| grevwi | X-Form |
150 | NN | RT | RA | RB | 1 | 01 | 1110 110 |Rc| rsvd | |
151 | NN | RS | RA | RB | 0 | 10 | 0010 110 |Rc| bmator | X-Form |
152 | NN | RS | RA | RB | 0 | 10 | 0110 110 |Rc| bmatand | X-Form |
153 | NN | RS | RA | RB | 0 | 10 | 1010 110 |Rc| bmatxor | X-Form |
154 | NN | RS | RA | RB | 0 | 10 | 1110 110 |Rc| bmatflip | X-Form |
155 | NN | RT | RA | RB | 1 | 10 | 0010 110 |Rc| xpermn | X-Form |
156 | NN | RT | RA | RB | 1 | 10 | 0110 110 |Rc| xpermb | X-Form |
157 | NN | RT | RA | RB | 1 | 10 | 1010 110 |Rc| xpermh | X-Form |
158 | NN | RT | RA | RB | 1 | 10 | 1110 110 |Rc| xpermw | X-Form |
159 | NN | RT | RA | RB | 0 | 11 | 1110 110 |Rc| abssa | X-Form |
160 | NN | RT | RA | RB | 1 | 11 | 1110 110 |Rc| absua | X-Form |
161 | NN | | | | | | --11 110 |Rc| rsvd | |
162
163 # binary and ternary bitops
164
165 Similar to FPGA LUTs: for two (binary) or three (ternary) inputs take
166 bits from each input, concatenate them and
167 perform a lookup into a table using an 8-8-bit immediate (for the ternary instructions), or in another register (4-bit
168 for the binary instructions). The binary lookup instructions have CR Field
169 lookup variants due to CR Fields being 4 bit.
170
171 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
172
173 ## ternlogi
174
175 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
176 | -- | -- | --- | --- | ----- | --- |--|
177 | NN | RT | RA | RB | im0-7 | 00 |Rc|
178
179 lut3(imm, a, b, c):
180 idx = c << 2 | b << 1 | a
181 return imm[idx] # idx by LSB0 order
182
183 for i in range(64):
184 RT[i] = lut3(imm, RB[i], RA[i], RT[i])
185
186 ## binlut
187
188 Binary lookup is a dynamic LUT2 version of ternlogi. Firstly, the
189 lookup table is 4 bits wide not 8 bits, and secondly the lookup
190 table comes from a register not an immediate.
191
192 | 0.5|6.10|11.15|16.20| 21..25|26..31 | Form |
193 | -- | -- | --- | --- | ----- |--------|---------|
194 | NN | RT | RA | RB | RC |nh 00001| VA-Form |
195 | NN | RT | RA | RB | /BFA/ |0 01001| VA-Form |
196
197 For binlut, the 4-bit LUT may be selected from either the high nibble
198 or the low nibble of the first byte of RC:
199
200 lut2(imm, a, b):
201 idx = b << 1 | a
202 return imm[idx] # idx by LSB0 order
203
204 imm = (RC>>(nh*4))&0b1111
205 for i in range(64):
206 RT[i] = lut2(imm, RB[i], RA[i])
207
208 For bincrlut, `BFA` selects the 4-bit CR Field as the LUT2:
209
210 for i in range(64):
211 RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
212
213 When Vectorised with SVP64, as usual both source and destination may be
214 Vector or Scalar.
215
216 *Programmer's note: a dynamic ternary lookup may be synthesised from
217 a pair of `binlut` instructions followed by a `ternlogi` to select which
218 to merge. Use `nh` to select which nibble to use as the lookup table
219 from the RC source register (`nh=1` nibble high), i.e. keeping
220 an 8-bit LUT3 in RC, the first `binlut` instruction may set nh=0 and
221 the second nh=1.*
222
223 ## crternlogi
224
225 another mode selection would be CRs not Ints.
226
227 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
228 | -- | -- | --- | --- | --- |-----|----- | -----|--|
229 | NN | BT | BA | BB | BC |m0-2 | imm | 01 |m3|
230
231 mask = m0-3,m4
232 for i in range(4):
233 a,b,c = CRs[BA][i], CRs[BB][i], CRs[BC][i])
234 if mask[i] CRs[BT][i] = lut3(imm, a, b, c)
235
236 This instruction is remarkably similar to the existing crops, `crand` etc.
237 which have been noted to be a 4-bit (binary) LUT. In effect `crternlogi`
238 is the ternary LUT version of crops, having an 8-bit LUT.
239
240 ## crbinlog
241
242 With ternary (LUT3) dynamic instructions being very costly,
243 and CR Fields being only 4 bit, a binary (LUT2) variant is better
244
245 | 0.5|6.8 | 9.11|12.14|15.17|18.22|23...30 |31|
246 | -- | -- | --- | --- | --- |-----| --------|--|
247 | NN | BT | BA | BB | BC |m0-m2|00101110 |m3|
248
249 mask = m0-3,m4
250 for i in range(4):
251 a,b = CRs[BA][i], CRs[BB][i])
252 if mask[i] CRs[BT][i] = lut2(CRs[BC], a, b)
253
254 When SVP64 Vectorised any of the 4 operands may be Scalar or
255 Vector, including `BC` meaning that multiple different dynamic
256 lookups may be performed with a single instruction.
257
258 *Programmer's note: just as with binlut and ternlogi, a pair
259 of crbinlog instructions followed by a merging crternlogi may
260 be deployed to synthesise dynamic ternary (LUT3) CR Field
261 manipulation*
262
263 # int ops
264
265 ## min/m
266
267 required for the [[sv/av_opcodes]]
268
269 signed and unsigned min/max for integer. this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned. when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
270
271 signed/unsigned min/max gives more flexibility.
272
273 ```
274 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
275 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
276 }
277 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
278 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
279 }
280 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
281 { return rs1 < rs2 ? rs1 : rs2;
282 }
283 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
284 { return rs1 > rs2 ? rs1 : rs2;
285 }
286 ```
287
288 ## average
289
290 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
291 but not scalar
292
293 ```
294 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
295 return (rs1 + rs2 + 1) >> 1:
296 }
297 ```
298
299 ## abs
300
301 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
302 but not scalar
303
304 ```
305 uint_xlen_t intabs(uint_xlen_t rs1, uint_xlen_t rs2) {
306 return (src1 > src2) ? (src1-src2) : (src2-src1)
307 }
308 ```
309
310 ## abs-accumulate
311
312 required for the [[sv/av_opcodes]], these are needed for motion estimation.
313 both are overwrite on RS.
314
315 ```
316 uint_xlen_t uintabsacc(uint_xlen_t rs, uint_xlen_t ra, uint_xlen_t rb) {
317 return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
318 }
319 uint_xlen_t intabsacc(uint_xlen_t rs, int_xlen_t ra, int_xlen_t rb) {
320 return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
321 }
322 ```
323
324 For SVP64, the twin Elwidths allows e.g. a 16 bit accumulator for 8 bit
325 differences. Form is `RM-1P-3S1D` where RS-as-source has a separate
326 SVP64 designation from RS-as-dest. This gives a limited range of
327 non-overwrite capability.
328
329 # shift-and-add
330
331 Power ISA is missing LD/ST with shift, which is present in both ARM and x86.
332 Too complex to add more LD/ST, a compromise is to add shift-and-add.
333 Replaces a pair of explicit instructions in hot-loops.
334
335 ```
336 uint_xlen_t shadd(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
337 return (rs1 << (sh+1)) + rs2;
338 }
339
340 uint_xlen_t shadduw(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
341 uint_xlen_t rs1z = rs1 & 0xFFFFFFFF;
342 return (rs1z << (sh+1)) + rs2;
343 }
344 ```
345
346 # bitmask set
347
348 based on RV bitmanip singlebit set, instruction format similar to shift
349 [[isa/fixedshift]]. bmext is actually covered already (shift-with-mask rldicl but only immediate version).
350 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
351
352 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate. for the immediate version this does not matter, so a bmexti is not required.
353 bmrev however there is no direct equivalent and consequently a bmrevi is required.
354
355 bmset (register for mask amount) is particularly useful for creating
356 predicate masks where the length is a dynamic runtime quantity.
357 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
358
359 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31| name |
360 | -- | -- | --- | --- | --- | ------- |--| ----- |
361 | NN | RS | RA | RB | RC | mode 010 |Rc| bm\* |
362
363 Immediate-variant is an overwrite form:
364
365 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
366 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
367 | NN | RS | RB | sh | SH | itype | 1000 110 |Rc| bm\*i |
368
369 ```
370 def MASK(x, y):
371 if x < y:
372 x = x+1
373 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
374 mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
375 elif x == y:
376 return 1 << x
377 else:
378 x = x+1
379 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
380 mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
381 return mask_a ^ mask_b
382
383
384 uint_xlen_t bmset(RS, RB, sh)
385 {
386 int shamt = RB & (XLEN - 1);
387 mask = (2<<sh)-1;
388 return RS | (mask << shamt);
389 }
390
391 uint_xlen_t bmclr(RS, RB, sh)
392 {
393 int shamt = RB & (XLEN - 1);
394 mask = (2<<sh)-1;
395 return RS & ~(mask << shamt);
396 }
397
398 uint_xlen_t bminv(RS, RB, sh)
399 {
400 int shamt = RB & (XLEN - 1);
401 mask = (2<<sh)-1;
402 return RS ^ (mask << shamt);
403 }
404
405 uint_xlen_t bmext(RS, RB, sh)
406 {
407 int shamt = RB & (XLEN - 1);
408 mask = (2<<sh)-1;
409 return mask & (RS >> shamt);
410 }
411 ```
412
413 bitmask extract with reverse. can be done by bit-order-inverting all of RB and getting bits of RB from the opposite end.
414
415 when RA is zero, no shift occurs. this makes bmextrev useful for
416 simply reversing all bits of a register.
417
418 ```
419 msb = ra[5:0];
420 rev[0:msb] = rb[msb:0];
421 rt = ZE(rev[msb:0]);
422
423 uint_xlen_t bmextrev(RA, RB, sh)
424 {
425 int shamt = XLEN-1;
426 if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
427 shamt = (XLEN-1)-shamt; # shift other end
428 bra = bitreverse(RB) # swap LSB-MSB
429 mask = (2<<sh)-1;
430 return mask & (bra >> shamt);
431 }
432 ```
433
434 | 0.5|6.10|11.15|16.20|21.26| 27..30 |31| name |
435 | -- | -- | --- | --- | --- | ------- |--| ------ |
436 | NN | RT | RA | RB | sh | 1111 |Rc| bmrevi |
437
438
439 # grevlut
440
441 ([3x lower latency alternative](grev_gorc_design/) which is
442 not equivalent and has limited constant-generation capability)
443
444 generalised reverse combined with a pair of LUT2s and allowing
445 a constant `0b0101...0101` when RA=0, and an option to invert
446 (including when RA=0, giving a constant 0b1010...1010 as the
447 initial value) provides a wide range of instructions
448 and a means to set hundreds of regular 64 bit patterns with one
449 single 32 bit instruction.
450
451 the two LUT2s are applied left-half (when not swapping)
452 and right-half (when swapping) so as to allow a wider
453 range of options.
454
455 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
456
457 * A value of `0b11001010` for the immediate provides
458 the functionality of a standard "grev".
459 * `0b11101110` provides gorc
460
461 grevlut should be arranged so as to produce the constants
462 needed to put into bext (bitextract) so as in turn to
463 be able to emulate x86 pmovmask instructions <https://www.felixcloutier.com/x86/pmovmskb>.
464 This only requires 2 instructions (grevlut, bext).
465
466 Note that if the mask is required to be placed
467 directly into CR Fields (for use as CR Predicate
468 masks rather than a integer mask) then sv.cmpi or sv.ori
469 may be used instead, bearing in mind that sv.ori
470 is a 64-bit instruction, and `VL` must have been
471 set to the required length:
472
473 sv.ori./elwid=8 r10.v, r10.v, 0
474
475 The following settings provide the required mask constants:
476
477 | RA=0 | RB | imm | iv | result |
478 | ------- | ------- | ---------- | -- | ---------- |
479 | 0x555.. | 0b10 | 0b01101100 | 0 | 0x111111... |
480 | 0x555.. | 0b110 | 0b01101100 | 0 | 0x010101... |
481 | 0x555.. | 0b1110 | 0b01101100 | 0 | 0x00010001... |
482 | 0x555.. | 0b10 | 0b11000110 | 1 | 0x88888... |
483 | 0x555.. | 0b110 | 0b11000110 | 1 | 0x808080... |
484 | 0x555.. | 0b1110 | 0b11000110 | 1 | 0x80008000... |
485
486 Better diagram showing the correct ordering of shamt (RB). A LUT2
487 is applied to all locations marked in red using the first 4
488 bits of the immediate, and a separate LUT2 applied to all
489 locations in green using the upper 4 bits of the immediate.
490
491 <img src="/openpower/sv/grevlut.png" width=700 />
492
493 demo code [[openpower/sv/grevlut.py]]
494
495 ```
496 lut2(imm, a, b):
497 idx = b << 1 | a
498 return imm[idx] # idx by LSB0 order
499
500 dorow(imm8, step_i, chunksize, us32b):
501 for j in 0 to 31 if is32b else 63:
502 if (j&chunk_size) == 0
503 imm = imm8[0..3]
504 else
505 imm = imm8[4..7]
506 step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
507 return step_o
508
509 uint64_t grevlut(uint64_t RA, uint64_t RB, uint8 imm, bool iv, bool is32b)
510 {
511 uint64_t x = 0x5555_5555_5555_5555;
512 if (RA != 0) x = GPR(RA);
513 if (iv) x = ~x;
514 int shamt = RB & 31 if is32b else 63
515 for i in 0 to (6-is32b)
516 step = 1<<i
517 if (shamt & step) x = dorow(imm, x, step, is32b)
518 return x;
519 }
520 ```
521
522 A variant may specify different LUT-pairs per row,
523 using one byte of RB for each. If it is desired that
524 a particular row-crossover shall not be applied it is
525 a simple matter to set the appropriate LUT-pair in RB
526 to effect an identity transform for that row (`0b11001010`).
527
528 ```
529 uint64_t grevlutr(uint64_t RA, uint64_t RB, bool iv, bool is32b)
530 {
531 uint64_t x = 0x5555_5555_5555_5555;
532 if (RA != 0) x = GPR(RA);
533 if (iv) x = ~x;
534 for i in 0 to (6-is32b)
535 step = 1<<i
536 imm = (RB>>(i*8))&0xff
537 x = dorow(imm, x, step, is32b)
538 return x;
539 }
540
541 ```
542
543 | 0.5|6.10|11.15|16.20 |21..28 | 29.30|31| name |
544 | -- | -- | --- | --- | ----- | -----|--| ------ |
545 | NN | RT | RA | s0-4 | im0-7 | 1 iv |s5| grevlogi |
546 | NN | RT | RA | RB | im0-7 | 01 |0 | grevlog | |
547 | NN | RT | RA | RB | im0-7 | 01 |1 | grevlogw | |
548
549 # grev
550
551 superceded by grevlut
552
553 based on RV bitmanip, this is also known as a butterfly network. however
554 where a butterfly network allows setting of every crossbar setting in
555 every row and every column, generalised-reverse (grev) only allows
556 a per-row decision: every entry in the same row must either switch or
557 not-switch.
558
559 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
560
561 ```
562 uint64_t grev64(uint64_t RA, uint64_t RB)
563 {
564 uint64_t x = RA;
565 int shamt = RB & 63;
566 if (shamt & 1) x = ((x & 0x5555555555555555LL) << 1) |
567 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
568 if (shamt & 2) x = ((x & 0x3333333333333333LL) << 2) |
569 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
570 if (shamt & 4) x = ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
571 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
572 if (shamt & 8) x = ((x & 0x00FF00FF00FF00FFLL) << 8) |
573 ((x & 0xFF00FF00FF00FF00LL) >> 8);
574 if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
575 ((x & 0xFFFF0000FFFF0000LL) >> 16);
576 if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
577 ((x & 0xFFFFFFFF00000000LL) >> 32);
578 return x;
579 }
580
581 ```
582
583 # gorc
584
585 based on RV bitmanip, gorc is superceded by grevlut
586
587 ```
588 uint32_t gorc32(uint32_t RA, uint32_t RB)
589 {
590 uint32_t x = RA;
591 int shamt = RB & 31;
592 if (shamt & 1) x |= ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
593 if (shamt & 2) x |= ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
594 if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
595 if (shamt & 8) x |= ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8);
596 if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
597 return x;
598 }
599 uint64_t gorc64(uint64_t RA, uint64_t RB)
600 {
601 uint64_t x = RA;
602 int shamt = RB & 63;
603 if (shamt & 1) x |= ((x & 0x5555555555555555LL) << 1) |
604 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
605 if (shamt & 2) x |= ((x & 0x3333333333333333LL) << 2) |
606 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
607 if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
608 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
609 if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL) << 8) |
610 ((x & 0xFF00FF00FF00FF00LL) >> 8);
611 if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL) << 16) |
612 ((x & 0xFFFF0000FFFF0000LL) >> 16);
613 if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL) << 32) |
614 ((x & 0xFFFFFFFF00000000LL) >> 32);
615 return x;
616 }
617
618 ```
619
620 # xperm
621
622 based on RV bitmanip.
623
624 RA contains a vector of indices to select parts of RB to be
625 copied to RT. The immediate-variant allows up to an 8 bit
626 pattern (repeated) to be targetted at different parts of RT.
627
628 xperm shares some similarity with one of the uses of bmator
629 in that xperm indices are binary addressing where bitmator
630 may be considered to be unary addressing.
631
632 ```
633 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
634 {
635 uint_xlen_t r = 0;
636 uint_xlen_t sz = 1LL << sz_log2;
637 uint_xlen_t mask = (1LL << sz) - 1;
638 uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
639 for (int i = 0; i < XLEN; i += sz) {
640 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
641 if (pos < XLEN)
642 r |= ((RB >> pos) & mask) << i;
643 }
644 return r;
645 }
646 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
647 {
648 uint_xlen_t r = 0;
649 uint_xlen_t sz = 1LL << sz_log2;
650 uint_xlen_t mask = (1LL << sz) - 1;
651 for (int i = 0; i < XLEN; i += sz) {
652 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
653 if (pos < XLEN)
654 r |= ((RB >> pos) & mask) << i;
655 }
656 return r;
657 }
658 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
659 { return xperm(RA, RB, 2); }
660 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
661 { return xperm(RA, RB, 3); }
662 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
663 { return xperm(RA, RB, 4); }
664 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
665 { return xperm(RA, RB, 5); }
666 ```
667
668 # bitmatrix
669
670 ```
671 uint64_t bmatflip(uint64_t RA)
672 {
673 uint64_t x = RA;
674 x = shfl64(x, 31);
675 x = shfl64(x, 31);
676 x = shfl64(x, 31);
677 return x;
678 }
679 uint64_t bmatxor(uint64_t RA, uint64_t RB)
680 {
681 // transpose of RB
682 uint64_t RBt = bmatflip(RB);
683 uint8_t u[8]; // rows of RA
684 uint8_t v[8]; // cols of RB
685 for (int i = 0; i < 8; i++) {
686 u[i] = RA >> (i*8);
687 v[i] = RBt >> (i*8);
688 }
689 uint64_t x = 0;
690 for (int i = 0; i < 64; i++) {
691 if (pcnt(u[i / 8] & v[i % 8]) & 1)
692 x |= 1LL << i;
693 }
694 return x;
695 }
696 uint64_t bmator(uint64_t RA, uint64_t RB)
697 {
698 // transpose of RB
699 uint64_t RBt = bmatflip(RB);
700 uint8_t u[8]; // rows of RA
701 uint8_t v[8]; // cols of RB
702 for (int i = 0; i < 8; i++) {
703 u[i] = RA >> (i*8);
704 v[i] = RBt >> (i*8);
705 }
706 uint64_t x = 0;
707 for (int i = 0; i < 64; i++) {
708 if ((u[i / 8] & v[i % 8]) != 0)
709 x |= 1LL << i;
710 }
711 return x;
712 }
713 uint64_t bmatand(uint64_t RA, uint64_t RB)
714 {
715 // transpose of RB
716 uint64_t RBt = bmatflip(RB);
717 uint8_t u[8]; // rows of RA
718 uint8_t v[8]; // cols of RB
719 for (int i = 0; i < 8; i++) {
720 u[i] = RA >> (i*8);
721 v[i] = RBt >> (i*8);
722 }
723 uint64_t x = 0;
724 for (int i = 0; i < 64; i++) {
725 if ((u[i / 8] & v[i % 8]) == 0xff)
726 x |= 1LL << i;
727 }
728 return x;
729 }
730 ```
731
732 # Introduction to Carry-less and GF arithmetic
733
734 * obligatory xkcd <https://xkcd.com/2595/>
735
736 There are three completely separate types of Galois-Field-based arithmetic
737 that we implement which are not well explained even in introductory
738 literature. A slightly oversimplified explanation is followed by more
739 accurate descriptions:
740
741 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
742 but is accidentally referred to as GF(2) - see below as to why.
743 * `GF(p)` modulo arithmetic with a Prime number, these are "proper"
744 Galois Fields
745 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
746 (2^N) and a second "reducing" polynomial (similar to a prime number), these
747 are said to be GF(2^N) arithmetic.
748
749 further detailed and more precise explanations are provided below
750
751 * **Polynomials with coefficients in `GF(2)`**
752 (aka. Carry-less arithmetic -- the `cl*` instructions).
753 This isn't actually a Galois Field, but its coefficients are. This is
754 basically binary integer addition, subtraction, and multiplication like
755 usual, except that carries aren't propagated at all, effectively turning
756 both addition and subtraction into the bitwise xor operation. Division and
757 remainder are defined to match how addition and multiplication works.
758 * **Galois Fields with a prime size**
759 (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
760 This is basically just the integers mod `p`.
761 * **Galois Fields with a power-of-a-prime size**
762 (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
763 integer `n > 0`).
764 We only implement these for `p == 2`, called Binary Galois Fields
765 (`GF(2^n)` -- the `gfb*` instructions).
766 For any prime `p`, `GF(p^n)` is implemented as polynomials with
767 coefficients in `GF(p)` and degree `< n`, where the polynomials are the
768 remainders of dividing by a specificly chosen polynomial in `GF(p)` called
769 the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
770 Polynomial must be an irreducable polynomial (like primes, but for
771 polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
772 and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
773 affect `GF(p^n)`'s mathematical shape, all that changes is the specific
774 polynomials used to implement `GF(p^n)`.
775
776 Many implementations and much of the literature do not make a clear
777 distinction between these three categories, which makes it confusing
778 to understand what their purpose and value is.
779
780 * carry-less multiply is extremely common and is used for the ubiquitous
781 CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
782 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
783 has significant uses throughout cryptography
784 * GF(p) is the basis again of a significant quantity of algorithms
785 (TODO, list them, jacob knows what they are), even though the
786 modulo is limited to be below 64-bit (size of a scalar int)
787
788 # Instructions for Carry-less Operations
789
790 aka. Polynomials with coefficients in `GF(2)`
791
792 Carry-less addition/subtraction is simply XOR, so a `cladd`
793 instruction is not provided since the `xor[i]` instruction can be used instead.
794
795 These are operations on polynomials with coefficients in `GF(2)`, with the
796 polynomial's coefficients packed into integers with the following algorithm:
797
798 ```python
799 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
800 ```
801
802 ## Carry-less Multiply Instructions
803
804 based on RV bitmanip
805 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
806 <https://www.felixcloutier.com/x86/pclmulqdq> and
807 <https://en.m.wikipedia.org/wiki/Carry-less_product>
808
809 They are worth adding as their own non-overwrite operations
810 (in the same pipeline).
811
812 ### `clmul` Carry-less Multiply
813
814 ```python
815 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
816 ```
817
818 ### `clmulh` Carry-less Multiply High
819
820 ```python
821 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
822 ```
823
824 ### `clmulr` Carry-less Multiply (Reversed)
825
826 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
827 bit-reversed inputs.
828
829 ```python
830 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
831 ```
832
833 ## `clmadd` Carry-less Multiply-Add
834
835 ```
836 clmadd RT, RA, RB, RC
837 ```
838
839 ```
840 (RT) = clmul((RA), (RB)) ^ (RC)
841 ```
842
843 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
844
845 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
846 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
847 to avoid the need for a temp register. RS is written to as well as RT.
848
849 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
850 definition of the Inverse Discrete Fourier Transform involves calculating a
851 multiplicative inverse, which may not exist in every Ring, therefore the
852 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
853 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
854 TODO: check with someone who knows for sure if that's correct.)
855
856 ```
857 cltmadd RT, RA, RB, RC
858 ```
859
860 TODO: add link to explanation for where `RS` comes from.
861
862 ```
863 a = (RA)
864 c = (RC)
865 # read all inputs before writing to any outputs in case
866 # an input overlaps with an output register.
867 (RT) = clmul(a, (RB)) ^ c
868 (RS) = a ^ c
869 ```
870
871 ## `cldivrem` Carry-less Division and Remainder
872
873 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
874 for other instructions.
875
876 ```python
877 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
878 ```
879
880 ## `cldiv` Carry-less Division
881
882 ```
883 cldiv RT, RA, RB
884 ```
885
886 ```
887 n = (RA)
888 d = (RB)
889 q, r = cldivrem(n, d, width=XLEN)
890 (RT) = q
891 ```
892
893 ## `clrem` Carry-less Remainder
894
895 ```
896 clrem RT, RA, RB
897 ```
898
899 ```
900 n = (RA)
901 d = (RB)
902 q, r = cldivrem(n, d, width=XLEN)
903 (RT) = r
904 ```
905
906 # Instructions for Binary Galois Fields `GF(2^m)`
907
908 see:
909
910 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
911 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
912 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
913
914 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
915 instruction is not provided since the `xor[i]` instruction can be used instead.
916
917 ## `GFBREDPOLY` SPR -- Reducing Polynomial
918
919 In order to save registers and to make operations orthogonal with standard
920 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
921 This also allows hardware to pre-compute useful parameters (such as the
922 degree, or look-up tables) based on the reducing polynomial, and store them
923 alongside the SPR in hidden registers, only recomputing them whenever the SPR
924 is written to, rather than having to recompute those values for every
925 instruction.
926
927 Because Galois Fields require the reducing polynomial to be an irreducible
928 polynomial, that guarantees that any polynomial of `degree > 1` must have
929 the LSB set, since otherwise it would be divisible by the polynomial `x`,
930 making it reducible, making whatever we're working on no longer a Field.
931 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
932
933 ```python
934 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
935 ```
936
937 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
938
939 unless this is an immediate op, `mtspr` is completely sufficient.
940
941 ```python
942 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
943 ```
944
945 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
946
947 ```
948 gfbmul RT, RA, RB
949 ```
950
951 ```python
952 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
953 ```
954
955 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
956
957 ```
958 gfbmadd RT, RA, RB, RC
959 ```
960
961 ```python
962 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
963 ```
964
965 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
966
967 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
968 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
969 for a temp register. RS is written to as well as RT.
970
971 ```
972 gfbtmadd RT, RA, RB, RC
973 ```
974
975 TODO: add link to explanation for where `RS` comes from.
976
977 ```
978 a = (RA)
979 c = (RC)
980 # read all inputs before writing to any outputs in case
981 # an input overlaps with an output register.
982 (RT) = gfbmadd(a, (RB), c)
983 # use gfbmadd again since it reduces the result
984 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
985 ```
986
987 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
988
989 ```
990 gfbinv RT, RA
991 ```
992
993 ```python
994 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
995 ```
996
997 # Instructions for Prime Galois Fields `GF(p)`
998
999 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
1000
1001 ## `gfpadd` Prime Galois Field `GF(p)` Addition
1002
1003 ```
1004 gfpadd RT, RA, RB
1005 ```
1006
1007 ```python
1008 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
1009 ```
1010
1011 the addition happens on infinite-precision integers
1012
1013 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
1014
1015 ```
1016 gfpsub RT, RA, RB
1017 ```
1018
1019 ```python
1020 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
1021 ```
1022
1023 the subtraction happens on infinite-precision integers
1024
1025 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
1026
1027 ```
1028 gfpmul RT, RA, RB
1029 ```
1030
1031 ```python
1032 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
1033 ```
1034
1035 the multiplication happens on infinite-precision integers
1036
1037 ## `gfpinv` Prime Galois Field `GF(p)` Invert
1038
1039 ```
1040 gfpinv RT, RA
1041 ```
1042
1043 Some potential hardware implementations are found in:
1044 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
1045
1046 ```python
1047 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
1048 ```
1049
1050 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
1051
1052 ```
1053 gfpmadd RT, RA, RB, RC
1054 ```
1055
1056 ```python
1057 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
1058 ```
1059
1060 the multiplication and addition happens on infinite-precision integers
1061
1062 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
1063
1064 ```
1065 gfpmsub RT, RA, RB, RC
1066 ```
1067
1068 ```python
1069 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
1070 ```
1071
1072 the multiplication and subtraction happens on infinite-precision integers
1073
1074 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
1075
1076 ```
1077 gfpmsubr RT, RA, RB, RC
1078 ```
1079
1080 ```python
1081 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
1082 ```
1083
1084 the multiplication and subtraction happens on infinite-precision integers
1085
1086 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
1087
1088 Used in combination with SV FFT REMAP to perform
1089 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
1090 to avoid the need for a temp register. RS is written
1091 to as well as RT.
1092
1093 ```
1094 gfpmaddsubr RT, RA, RB, RC
1095 ```
1096
1097 TODO: add link to explanation for where `RS` comes from.
1098
1099 ```
1100 factor1 = (RA)
1101 factor2 = (RB)
1102 term = (RC)
1103 # read all inputs before writing to any outputs in case
1104 # an input overlaps with an output register.
1105 (RT) = gfpmadd(factor1, factor2, term)
1106 (RS) = gfpmsubr(factor1, factor2, term)
1107 ```
1108
1109 # Already in POWER ISA or subsumed
1110
1111 Lists operations either included as part of
1112 other bitmanip operations, or are already in
1113 Power ISA.
1114
1115 ## cmix
1116
1117 based on RV bitmanip, covered by ternlog bitops
1118
1119 ```
1120 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
1121 return (RA & RB) | (RC & ~RB);
1122 }
1123 ```
1124
1125 ## count leading/trailing zeros with mask
1126
1127 in v3.1 p105
1128
1129 ```
1130 count = 0
1131 do i = 0 to 63 if((RB)i=1) then do
1132 if((RS)i=1) then break end end count ← count + 1
1133 RA ← EXTZ64(count)
1134 ```
1135
1136 ## bit deposit
1137
1138 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1139
1140 do while(m < 64)
1141 if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1142 result = VSR[VRA+32].dword[i].bit[63-k]
1143 VSR[VRT+32].dword[i].bit[63-m] = result
1144 k = k + 1
1145 m = m + 1
1146
1147 ```
1148
1149 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1150 {
1151 uint_xlen_t r = 0;
1152 for (int i = 0, j = 0; i < XLEN; i++)
1153 if ((RB >> i) & 1) {
1154 if ((RA >> j) & 1)
1155 r |= uint_xlen_t(1) << i;
1156 j++;
1157 }
1158 return r;
1159 }
1160
1161 ```
1162
1163 ## bit extract
1164
1165 other way round: identical to RV bext: pextd, found in v3.1 p196
1166
1167 ```
1168 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1169 {
1170 uint_xlen_t r = 0;
1171 for (int i = 0, j = 0; i < XLEN; i++)
1172 if ((RB >> i) & 1) {
1173 if ((RA >> i) & 1)
1174 r |= uint_xlen_t(1) << j;
1175 j++;
1176 }
1177 return r;
1178 }
1179 ```
1180
1181 ## centrifuge
1182
1183 found in v3.1 p106 so not to be added here
1184
1185 ```
1186 ptr0 = 0
1187 ptr1 = 0
1188 do i = 0 to 63
1189 if((RB)i=0) then do
1190 resultptr0 = (RS)i
1191 end
1192 ptr0 = ptr0 + 1
1193 if((RB)63-i==1) then do
1194 result63-ptr1 = (RS)63-i
1195 end
1196 ptr1 = ptr1 + 1
1197 RA = result
1198 ```
1199
1200 ## bit to byte permute
1201
1202 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1203 these perform a transpose (bmatflip).
1204 TODO this looks VSX is there a scalar variant
1205 in v3.0/1 already
1206
1207 do j = 0 to 7
1208 do k = 0 to 7
1209 b = VSR[VRB+32].dword[i].byte[k].bit[j]
1210 VSR[VRT+32].dword[i].byte[j].bit[k] = b
1211
1212 # Appendix
1213
1214 see [[bitmanip/appendix]]
1215