openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 # Implementation Log
   4
   5 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   6 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   7 * remove Rc=1 from ternlog due to conflicts in encoding as well
   8   as saving space <https://bugs.libre-soc.org/show_bug.cgi?id=753#c5>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11 # bitmanipulation
  12
  13 **DRAFT STATUS**
  14
  15 pseudocode: <https://libre-soc.org/openpower/isa/bitmanip/>
  16
  17 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  18 Vectorisation Context is provided by [[openpower/sv]].
  19
  20 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  21
  22 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  23
  24 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  25
  26 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  27 the [[sv/av_opcodes]] as well as [[sv/setvl]]
  28
  29 Useful resource:
  30
  31 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  32 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  33
  34 # summary
  35
  36 two major opcodes are needed
  37
  38 ternlog has its own major opcode
  39
  40 |  29.30 |31| name      |
  41 | ------ |--| --------- |
  42 |   00   |Rc| ternlogi  |
  43 |   01   |0 | ternlog   |
  44 |   01   |1 | ternlogv  |
  45 |   10   |0 | crternlog |
  46
  47 2nd major opcode for other bitmanip: minor opcode allocation
  48
  49 |  28.30 |31| name      |
  50 | ------ |--| --------- |
  51 |  -00   |0 |           |
  52 |  -00   |1 | grevlog   |
  53 |  -01   |  | grevlogi  |
  54 |  010   |Rc| bitmask   |
  55 |  011   |  | gf/cl madd*  |
  56 |  110   |Rc| 1/2-op    |
  57 |  111   |  |           |
  58
  59
  60 1-op and variants
  61
  62 | dest | src1 | subop | op       |
  63 | ---- | ---- | ----- | -------- |
  64 | RT   | RA   | ..    | bmatflip |
  65
  66 2-op and variants
  67
  68 | dest | src1 | src2 | subop | op       |
  69 | ---- | ---- | ---- | ----- | -------- |
  70 | RT   | RA   | RB   | or    | bmatflip |
  71 | RT   | RA   | RB   | xor   | bmatflip |
  72 | RT   | RA   | RB   |       | grev  |
  73 | RT   | RA   | RB   |       | clmul*  |
  74 | RT   | RA   | RB   |       | gorc |
  75 | RT   | RA   | RB   | shuf  | shuffle |
  76 | RT   | RA   | RB   | unshuf| shuffle |
  77 | RT   | RA   | RB   | width | xperm  |
  78 | RT   | RA   | RB   | type | minmax |
  79 | RT   | RA   | RB   |      | av abs avgadd  |
  80 | RT   | RA   | RB   | type | vmask ops |
  81 | RT   | RA   | RB   |      |       |
  82
  83 3 ops
  84
  85 * grevlog
  86 * GF mul-add
  87 * bitmask-reverse
  88
  89 TODO: convert all instructions to use RT and not RS
  90
  91 | 0.5|6.10|11.15|16.20 |21..25   | 26....30  |31| name |
  92 | -- | -- | --- | ---  | -----   | --------  |--| ------ |
  93 | NN | RT | RA  | RB   |         |        00 |0 | rsvd   |
  94 | NN | RT | RA  | RB   | im0-4   | im5-7  00 |1 | grevlog |
  95 | NN | RT | RA  | s0-4 | im0-4   | im5-7  01 |s5| grevlogi |
  96 | NN | RT | RA  | RB   | sh0-4   | sh5 1 010 |Rc| bmrevi |
  97 | NN | RS | RA  | RB   | RC      | 00    011 |0 | gfbmadd |
  98 | NN | RS | RA  | RB   | RC      | 00    011 |1 | gfbmaddsub |
  99 | NN | RS | RA  | RB   | RC      | 01    011 |0 | clmadd |
 100 | NN | RS | RA  | RB   | RC      | 01    011 |1 | clmaddsub |
 101 | NN | RS | RA  | RB   | RC      | 10    011 |0 | gfpmadd |
 102 | NN | RS | RA  | RB   | RC      | 10    011 |1 | gfpmaddsub |
 103 | NN | RS | RA  | RB   | RC      | 11    011 |  | rsvd |
 104
 105 ops (note that av avg and abs as well as vec scalar mask
 106 are included here)
 107
 108 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 109 double check that instructions didn't need 3 inputs.
 110
 111 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 112 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 113 | NN | RA | RB  |     | 0  |       | 0000 110 |Rc| rsvd   |
 114 | NN | RA | RB  | RC  | 1  | itype | 0000 110 |Rc| xperm |
 115 | NN | RA | RB  | RC  | 0  | itype | 0100 110 |Rc| minmax |
 116 | NN | RA | RB  | RC  | 1  |   00  | 0100 110 |Rc| av avgadd |
 117 | NN | RA | RB  | RC  | 1  |   01  | 0100 110 |Rc| av abs |
 118 | NN | RA | RB  |     | 1  |   10  | 0100 110 |Rc| rsvd |
 119 | NN | RA | RB  |     | 1  |   11  | 0100 110 |Rc| rsvd |
 120 | NN | RA | RB  | sh  | SH | itype | 1000 110 |Rc| bmopsi |
 121 | NN | RA | RB  |     |    |       | 1100 110 |Rc| rsvd |
 122 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| cldiv |
 123 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| clmod |
 124 | NN | RT | RA  | RB  | 1  |  10   | 0001 110 |Rc|       |
 125 | NN | RT | RB  | RB  | 1  |  11   | 0001 110 |Rc| clinv |
 126 | NN | RA | RB  | RC  | 0  |   00  | 0001 110 |Rc| vec sbfm |
 127 | NN | RA | RB  | RC  | 0  |   01  | 0001 110 |Rc| vec sofm |
 128 | NN | RA | RB  | RC  | 0  |   10  | 0001 110 |Rc| vec sifm |
 129 | NN | RA | RB  | RC  | 0  |   11  | 0001 110 |Rc| vec cprop |
 130 | NN | RA | RB  |     | 0  |       | 0101 110 |Rc| rsvd |
 131 | NN | RA | RB  | RC  | 0  | 00    | 0010 110 |Rc| gorc |
 132 | NN | RA | RB  | sh  | SH | 00    | 1010 110 |Rc| gorci |
 133 | NN | RA | RB  | RC  | 0  | 00    | 0110 110 |Rc| gorcw |
 134 | NN | RA | RB  | sh  | 0  | 00    | 1110 110 |Rc| gorcwi |
 135 | NN | RA | RB  | RC  | 1  | 00    | 1110 110 |Rc| bmator  |
 136 | NN | RA | RB  | RC  | 0  | 01    | 0010 110 |Rc| grev |
 137 | NN | RA | RB  | RC  | 1  | 01    | 0010 110 |Rc| clmul |
 138 | NN | RA | RB  | sh  | SH | 01    | 1010 110 |Rc| grevi |
 139 | NN | RA | RB  | RC  | 0  | 01    | 0110 110 |Rc| grevw |
 140 | NN | RA | RB  | sh  | 0  | 01    | 1110 110 |Rc| grevwi |
 141 | NN | RA | RB  | RC  | 1  | 01    | 1110 110 |Rc| bmatxor   |
 142 | NN | RA | RB  | RC  | 0  | 10    | 0010 110 |Rc| shfl |
 143 | NN | RA | RB  | sh  | SH | 10    | 1010 110 |Rc| shfli |
 144 | NN | RA | RB  | RC  | 0  | 10    | 0110 110 |Rc| shflw |
 145 | NN | RA | RB  | RC  |    | 10    | 1110 110 |Rc| rsvd    |
 146 | NN | RA | RB  | RC  | 0  | 11    | 1110 110 |Rc| clmulr  |
 147 | NN | RA | RB  | RC  | 1  | 11    | 1110 110 |Rc| clmulh  |
 148 | NN |    |     |     |    |       | --11 110 |Rc| setvl  |
 149
 150 # ternlog bitops
 151
 152 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 153
 154 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 155
 156 ## ternlogi
 157
 158 TODO: if/when we get more encoding space, add Rc=1 option back to ternlogi, for consistency with OpenPower base logical instructions (and./xor./or./etc.). <https://bugs.libre-soc.org/show_bug.cgi?id=745#c56>
 159
 160 | 0.5|6.10|11.15|16.20| 21..25| 26..30   |31|
 161 | -- | -- | --- | --- | ----- | -------- |--|
 162 | NN | RT | RA  | RB  | im0-4 | im5-7 00 |Rc|
 163
 164     lut3(imm, a, b, c):
 165         idx = c << 2 | b << 1 | a
 166         return imm[idx] # idx by LSB0 order
 167
 168     for i in range(64):
 169         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 170
 171 bits 21..22 may be used to specify a mode, such as treating the whole integer zero/nonzero and putting 1/0 in the result, rather than bitwise test.
 172
 173 ## ternlog
 174
 175 a 5 operand variant which becomes more along the lines of an FPGA,
 176 this is very expensive: 4 in and 1 out
 177
 178 | 0.5|6.10|11.15|16.20|21.25| 26...30  |31|
 179 | -- | -- | --- | --- | --- | -------- |--|
 180 | NN | RT | RA  | RB  | RC  | mode  01 |1 |
 181
 182     for i in range(64):
 183         j = (i//8)*8 # 0,8,16,24,..,56
 184         lookup = RC[j:j+8]
 185         RT[i] = lut3(lookup, RT[i], RA[i], RB[i])
 186
 187 mode (3 bit) may be used to do inversion of ordering, similar to carryless mul,
 188 3 modes.
 189
 190 ## ternlogv
 191
 192 also, another possible variant involving swizzle-like selection
 193 and masking, this only requires 2 64 bit registers (RA, RT) and
 194 only up to 16 LUT3s
 195
 196 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31|
 197 | -- | -- | --- | ----- | ---- | ----- |--|
 198 | NN | RT | RA  | idx0-3| mask | sz 01   |0 |
 199
 200     SZ = (1+sz) * 8 # 8 or 16
 201     raoff = MIN(XLEN, idx0 * SZ)
 202     rboff = MIN(XLEN, idx1 * SZ)
 203     rcoff = MIN(XLEN, idx2 * SZ)
 204     imoff = MIN(XLEN, idx3 * SZ)
 205     imm = RA[imoff:imoff+SZ]
 206     for i in range(MIN(XLEN, SZ)):
 207         ra = RA[raoff:+i]
 208         rb = RA[rboff+i]
 209         rc = RA[rcoff+i]
 210         res = lut3(imm, ra, rb, rc)
 211         for j in range(MIN(XLEN//8, 4)):
 212              if mask[j]: RT[i+j*SZ] = res
 213
 214 ## ternlogcr
 215
 216 another mode selection would be CRs not Ints.
 217
 218 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
 219 | -- | -- | --- | --- | --- |-----|----- | -----|--|
 220 | NN | BT | BA  | BB  | BC  |m0-3 | imm  |  10  |m4|
 221
 222     mask = m0-3,m4
 223     for i in range(4):
 224         if not mask[i] continue
 225         crregs[BT][i] = lut3(imm,
 226                              crregs[BA][i],
 227                              crregs[BB][i],
 228                              crregs[BC][i])
 229
 230
 231 # int min/max
 232
 233 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 234
 235 signed/unsigned min/max gives more flexibility.
 236
 237 ```
 238 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 239 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 240 }
 241 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 242 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 243 }
 244 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 245 { return rs1 < rs2 ? rs1 : rs2;
 246 }
 247 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 248 { return rs1 > rs2 ? rs1 : rs2;
 249 }
 250 ```
 251
 252
 253 ## cmix
 254
 255 based on RV bitmanip, covered by ternlog bitops
 256
 257 ```
 258 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 259     return (RA & RB) | (RC & ~RB);
 260 }
 261 ```
 262
 263
 264 # bitmask set
 265
 266 based on RV bitmanip singlebit set, instruction format similar to shift
 267 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 268 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 269
 270 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 271 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 272
 273 bmset (register for mask amount) is particularly useful for creating
 274 predicate masks where the length is a dynamic runtime quantity.
 275 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 276
 277 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 278 | -- | -- | --- | --- | --- | ------- |--| ----- |
 279 | NN | RT | RA  | RB  | RC  | mode 010 |Rc| bm*   |
 280
 281
 282 ```
 283 uint_xlen_t bmset(RA, RB, sh)
 284 {
 285     int shamt = RB & (XLEN - 1);
 286     mask = (2<<sh)-1;
 287     return RA | (mask << shamt);
 288 }
 289
 290 uint_xlen_t bmclr(RA, RB, sh)
 291 {
 292     int shamt = RB & (XLEN - 1);
 293     mask = (2<<sh)-1;
 294     return RA & ~(mask << shamt);
 295 }
 296
 297 uint_xlen_t bminv(RA, RB, sh)
 298 {
 299     int shamt = RB & (XLEN - 1);
 300     mask = (2<<sh)-1;
 301     return RA ^ (mask << shamt);
 302 }
 303
 304 uint_xlen_t bmext(RA, RB, sh)
 305 {
 306     int shamt = RB & (XLEN - 1);
 307     mask = (2<<sh)-1;
 308     return mask & (RA >> shamt);
 309 }
 310 ```
 311
 312 bitmask extract with reverse.  can be done by bitinverting all of RA and getting bits of RA from the opposite end.
 313
 314 ```
 315 msb = rb[5:0];
 316 rev[0:msb] = ra[msb:0];
 317 rt = ZE(rev[msb:0]);
 318
 319 uint_xlen_t bmextrev(RA, RB, sh)
 320 {
 321     int shamt = (RB & (XLEN - 1));
 322     shamt = (XLEN-1)-shamt;  # shift other end
 323     bra = bitreverse(RA)     # swap LSB-MSB
 324     mask = (2<<sh)-1;
 325     return mask & (bra >> shamt);
 326 }
 327 ```
 328
 329 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 330 | -- | -- | --- | --- | --- | ------- |--| ------ |
 331 | NN | RT | RA  | RB  | sh  | 1   011 |Rc| bmrevi |
 332
 333
 334 # grevlut
 335
 336 generalised reverse combined with a pair of LUT2s and allowing
 337 zero when RA=0 provides a wide range of instructions
 338 and a means to set regular 64 bit patterns in one
 339 32 bit instruction.
 340
 341 the two LUT2s are applied left-half (when not swapping)
 342 and right-half (when swapping) so as to allow a wider
 343 range of options
 344
 345 grevlut should be arranged so as to produce the constants
 346 needed to put into bext (bitextract) so as in turn to
 347 be able to emulate x86 pmovmask instructions <https://www.felixcloutier.com/x86/pmovmskb>
 348
 349 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
 350
 351 ```
 352 lut2(imm, a, b):
 353     idx = b << 1 | a
 354     return imm[idx] # idx by LSB0 order
 355
 356 dorow(imm8, step_i, chunksize):
 357     for j in 0 to 63:
 358         if (j&chunk_size) == 0
 359            imm = imm8[0..3]
 360         else
 361            imm = imm8[4..7]
 362         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 363     return step_o
 364
 365 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 imm)
 366 {
 367     uint64_t x = RA;
 368     int shamt = RB & 63;
 369     for i in 0 to 6
 370         step = 1<<i
 371         if (shamt & step) x = dorow(imm, x, step)
 372     return x;
 373 }
 374
 375 ```
 376
 377 # grev
 378
 379 based on RV bitmanip, this is also known as a butterfly network. however
 380 where a butterfly network allows setting of every crossbar setting in
 381 every row and every column, generalised-reverse (grev) only allows
 382 a per-row decision: every entry in the same row must either switch or
 383 not-switch.
 384
 385 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 386
 387 ```
 388 uint64_t grev64(uint64_t RA, uint64_t RB)
 389 {
 390     uint64_t x = RA;
 391     int shamt = RB & 63;
 392     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 393                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 394     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 395                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 396     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 397                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 398     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 399                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 400     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 401                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 402     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 403                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 404     return x;
 405 }
 406
 407 ```
 408
 409 # shuffle / unshuffle
 410
 411 based on RV bitmanip
 412
 413 ```
 414 uint32_t shfl32(uint32_t RA, uint32_t RB)
 415 {
 416     uint32_t x = RA;
 417     int shamt = RB & 15;
 418     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 419     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 420     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 421     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 422     return x;
 423 }
 424 uint32_t unshfl32(uint32_t RA, uint32_t RB)
 425 {
 426     uint32_t x = RA;
 427     int shamt = RB & 15;
 428     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 429     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 430     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 431     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 432     return x;
 433 }
 434
 435 uint64_t shuffle64_stage(uint64_t src, uint64_t maskL, uint64_t maskR, int N)
 436 {
 437     uint64_t x = src & ~(maskL | maskR);
 438     x |= ((src << N) & maskL) | ((src >> N) & maskR);
 439     return x;
 440 }
 441 uint64_t shfl64(uint64_t RA, uint64_t RB)
 442 {
 443     uint64_t x = RA;
 444     int shamt = RB & 31;
 445     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 446                                            0x00000000ffff0000LL, 16);
 447     if (shamt & 8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 448                                            0x0000ff000000ff00LL, 8);
 449     if (shamt & 4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 450                                            0x00f000f000f000f0LL, 4);
 451     if (shamt & 2) x = shuffle64_stage(x, 0x3030303030303030LL,
 452                                            0x0c0c0c0c0c0c0c0cLL, 2);
 453     if (shamt & 1) x = shuffle64_stage(x, 0x4444444444444444LL,
 454                                            0x2222222222222222LL, 1);
 455     return x;
 456 }
 457 uint64_t unshfl64(uint64_t RA, uint64_t RB)
 458 {
 459     uint64_t x = RA;
 460     int shamt = RB & 31;
 461     if (shamt &  1) x = shuffle64_stage(x, 0x4444444444444444LL,
 462                                            0x2222222222222222LL, 1);
 463     if (shamt &  2) x = shuffle64_stage(x, 0x3030303030303030LL,
 464                                            0x0c0c0c0c0c0c0c0cLL, 2);
 465     if (shamt &  4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 466                                            0x00f000f000f000f0LL, 4);
 467     if (shamt &  8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 468                                            0x0000ff000000ff00LL, 8);
 469     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 470                                            0x00000000ffff0000LL, 16);
 471     return x;
 472 }
 473 ```
 474
 475 # xperm
 476
 477 based on RV bitmanip
 478
 479 ```
 480 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 481 {
 482     uint_xlen_t r = 0;
 483     uint_xlen_t sz = 1LL << sz_log2;
 484     uint_xlen_t mask = (1LL << sz) - 1;
 485     for (int i = 0; i < XLEN; i += sz) {
 486         uint_xlen_t pos = ((RB >> i) & mask) << sz_log2;
 487         if (pos < XLEN)
 488             r |= ((RA >> pos) & mask) << i;
 489     }
 490     return r;
 491 }
 492 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 493 {  return xperm(RA, RB, 2); }
 494 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 495 {  return xperm(RA, RB, 3); }
 496 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 497 {  return xperm(RA, RB, 4); }
 498 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 499 {  return xperm(RA, RB, 5); }
 500 ```
 501
 502 # gorc
 503
 504 based on RV bitmanip
 505
 506 ```
 507 uint32_t gorc32(uint32_t RA, uint32_t RB)
 508 {
 509     uint32_t x = RA;
 510     int shamt = RB & 31;
 511     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 512     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 513     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 514     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 515     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 516     return x;
 517 }
 518 uint64_t gorc64(uint64_t RA, uint64_t RB)
 519 {
 520     uint64_t x = RA;
 521     int shamt = RB & 63;
 522     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 523                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 524     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 525                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 526     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 527                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 528     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 529                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 530     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 531                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 532     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 533                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 534     return x;
 535 }
 536
 537 ```
 538
 539 # Instructions for Carry-less Operations aka. Polynomials with coefficients in `GF(2)`
 540
 541 Carry-less addition/subtraction is simply XOR, so a `cladd`
 542 instruction is not provided since the `xor[i]` instruction can be used instead.
 543
 544 These are operations on polynomials with coefficients in `GF(2)`, with the
 545 polynomial's coefficients packed into integers with the following algorithm:
 546
 547 ```python
 548 def pack_poly(poly):
 549     """`poly` is a list where `poly[i]` is the coefficient for `x ** i`"""
 550     retval = 0
 551     for i, v in enumerate(poly):
 552         retval |= v << i
 553     return retval
 554
 555 def unpack_poly(v):
 556     """returns a list `poly`, where `poly[i]` is the coefficient for `x ** i`.
 557     """
 558     poly = []
 559     while v != 0:
 560         poly.append(v & 1)
 561         v >>= 1
 562     return poly
 563 ```
 564
 565 ## Carry-less Multiply Instructions
 566
 567 based on RV bitmanip
 568 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
 569 <https://www.felixcloutier.com/x86/pclmulqdq> and
 570 <https://en.m.wikipedia.org/wiki/Carry-less_product>
 571
 572 They are worth adding as their own non-overwrite operations
 573 (in the same pipeline).
 574
 575 ### `clmul` Carry-less Multiply
 576
 577 ```c
 578 uint_xlen_t clmul(uint_xlen_t RA, uint_xlen_t RB)
 579 {
 580     uint_xlen_t x = 0;
 581     for (int i = 0; i < XLEN; i++)
 582         if ((RB >> i) & 1)
 583             x ^= RA << i;
 584     return x;
 585 }
 586 ```
 587
 588 ### `clmulh` Carry-less Multiply High
 589
 590 ```c
 591 uint_xlen_t clmulh(uint_xlen_t RA, uint_xlen_t RB)
 592 {
 593     uint_xlen_t x = 0;
 594     for (int i = 1; i < XLEN; i++)
 595         if ((RB >> i) & 1)
 596             x ^= RA >> (XLEN-i);
 597     return x;
 598 }
 599 ```
 600
 601 ### `clmulr` Carry-less Multiply (Reversed)
 602
 603 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
 604 bit-reversed inputs.
 605
 606 ```c
 607 uint_xlen_t clmulr(uint_xlen_t RA, uint_xlen_t RB)
 608 {
 609     uint_xlen_t x = 0;
 610     for (int i = 0; i < XLEN; i++)
 611         if ((RB >> i) & 1)
 612             x ^= RA >> (XLEN-i-1);
 613     return x;
 614 }
 615 ```
 616
 617 ## `clmadd` Carry-less Multiply-Add
 618
 619 ```
 620 clmadd RT, RA, RB, RC
 621 ```
 622
 623 ```
 624 (RT) = clmul((RA), (RB)) ^ (RC)
 625 ```
 626
 627 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
 628
 629 ```
 630 cltmadd RT, RA, RB, RC
 631 ```
 632
 633 TODO: add link to explanation for where `RS` comes from.
 634
 635 ```
 636 temp = clmul((RA), (RB)) ^ (RC)
 637 (RT) = temp
 638 (RS) = temp
 639 ```
 640
 641 ## `cldiv` Carry-less Division
 642
 643 ```
 644 cldiv RT, RA, RB
 645 ```
 646
 647 TODO: decide what happens on division by zero
 648
 649 ```
 650 (RT) = cldiv((RA), (RB))
 651 ```
 652
 653 ## `clrem` Carry-less Remainder
 654
 655 ```
 656 clrem RT, RA, RB
 657 ```
 658
 659 TODO: decide what happens on division by zero
 660
 661 ```
 662 (RT) = clrem((RA), (RB))
 663 ```
 664
 665 # Instructions for Binary Galois Fields `GF(2^m)`
 666
 667 see:
 668
 669 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 670 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 671 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 672
 673 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
 674 instruction is not provided since the `xor[i]` instruction can be used instead.
 675
 676 ## `GFBREDPOLY` SPR -- Reducing Polynomial
 677
 678 In order to save registers and to make operations orthogonal with standard
 679 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
 680 This also allows hardware to pre-compute useful parameters (such as the
 681 degree, or look-up tables) based on the reducing polynomial, and store them
 682 alongside the SPR in hidden registers, only recomputing them whenever the SPR
 683 is written to, rather than having to recompute those values for every
 684 instruction.
 685
 686 Because Galois Fields require the reducing polynomial to be an irreducible
 687 polynomial, that guarantees that any polynomial of `degree > 1` must have
 688 the LSB set, since otherwise it would be divisible by the polynomial `x`,
 689 making it reducible, making whatever we're working on no longer a Field.
 690 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
 691
 692 ```python
 693 def decode_reducing_polynomial(GFBREDPOLY, XLEN):
 694     """returns the decoded coefficient list in LSB to MSB order,
 695         len(retval) == degree + 1"""
 696     v = GFBREDPOLY & ((1 << XLEN) - 1) # mask to XLEN bits
 697     if v == 0 or v == 2: # GF(2)
 698         return [0, 1] # degree = 1, poly = x
 699     if v & 1:
 700         degree = floor_log2(v)
 701     else:
 702         # all reducing polynomials of degree > 1 must have the LSB set,
 703         # because they must be irreducible polynomials (meaning they
 704         # can't be factored), if the LSB was clear, then they would
 705         # have `x` as a factor. Therefore, we can reuse the LSB clear
 706         # to instead mean the polynomial has degree XLEN.
 707         degree = XLEN
 708         v |= 1 << XLEN
 709         v |= 1 # LSB must be set
 710     return [(v >> i) & 1 for i in range(1 + degree)]
 711 ```
 712
 713 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
 714
 715 unless this is an immediate op, `mtspr` is completely sufficient.
 716
 717 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
 718
 719 ```
 720 gfbmul RT, RA, RB
 721 ```
 722
 723 ```
 724 (RT) = gfbmul((RA), (RB))
 725 ```
 726
 727 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
 728
 729 ```
 730 gfbmadd RT, RA, RB, RC
 731 ```
 732
 733 ```
 734 (RT) = gfbadd(gfbmul((RA), (RB)), (RC))
 735 ```
 736
 737 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
 738
 739 ```
 740 gfbtmadd RT, RA, RB, RC
 741 ```
 742
 743 TODO: add link to explanation for where `RS` comes from.
 744
 745 ```
 746 temp = gfbadd(gfbmul((RA), (RB)), (RC))
 747 (RT) = temp
 748 (RS) = temp
 749 ```
 750
 751 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
 752
 753 ```
 754 gfbinv RT, RA
 755 ```
 756
 757 ```
 758 (RT) = gfbinv((RA))
 759 ```
 760
 761 # Instructions for Prime Galois Fields `GF(p)`
 762
 763 ## Helper algorithms
 764
 765 ```python
 766 def int_to_gfp(int_value, prime):
 767     return int_value % prime # follows Python remainder semantics
 768 ```
 769
 770 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
 771
 772 ## `gfpadd` Prime Galois Field `GF(p)` Addition
 773
 774 ```
 775 gfpadd RT, RA, RB
 776 ```
 777
 778 ```
 779 (RT) = int_to_gfp((RA) + (RB), GFPRIME)
 780 ```
 781
 782 the addition happens on infinite-precision integers
 783
 784 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
 785
 786 ```
 787 gfpsub RT, RA, RB
 788 ```
 789
 790 ```
 791 (RT) = int_to_gfp((RA) - (RB), GFPRIME)
 792 ```
 793
 794 the subtraction happens on infinite-precision integers
 795
 796 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
 797
 798 ```
 799 gfpmul RT, RA, RB
 800 ```
 801
 802 ```
 803 (RT) = int_to_gfp((RA) * (RB), GFPRIME)
 804 ```
 805
 806 the multiplication happens on infinite-precision integers
 807
 808 ## `gfpinv` Prime Galois Field `GF(p)` Invert
 809
 810 ```
 811 gfpinv RT, RA
 812 ```
 813
 814 Some potential hardware implementations are found in:
 815 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
 816
 817 ```
 818 (RT) = gfpinv((RA), GFPRIME)
 819 ```
 820
 821 the multiplication happens on infinite-precision integers
 822
 823 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
 824
 825 ```
 826 gfpmadd RT, RA, RB, RC
 827 ```
 828
 829 ```
 830 (RT) = int_to_gfp((RA) * (RB) + (RC), GFPRIME)
 831 ```
 832
 833 the multiplication and addition happens on infinite-precision integers
 834
 835 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
 836
 837 ```
 838 gfpmsub RT, RA, RB, RC
 839 ```
 840
 841 ```
 842 (RT) = int_to_gfp((RA) * (RB) - (RC), GFPRIME)
 843 ```
 844
 845 the multiplication and subtraction happens on infinite-precision integers
 846
 847 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
 848
 849 ```
 850 gfpmsubr RT, RA, RB, RC
 851 ```
 852
 853 ```
 854 (RT) = int_to_gfp((RC) - (RA) * (RB), GFPRIME)
 855 ```
 856
 857 the multiplication and subtraction happens on infinite-precision integers
 858
 859 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
 860
 861 ```
 862 gfpmaddsubr RT, RA, RB, RC
 863 ```
 864
 865 TODO: add link to explanation for where `RS` comes from.
 866
 867 ```
 868 product = (RA) * (RB)
 869 term = (RC)
 870 (RT) = int_to_gfp(product + term, GFPRIME)
 871 (RS) = int_to_gfp(term - product, GFPRIME)
 872 ```
 873
 874 the multiplication, addition, and subtraction happens on infinite-precision integers
 875
 876 ## Twin Butterfly (Tukey-Cooley) Mul-add-sub
 877
 878 used in combination with SV FFT REMAP to perform
 879 a full NTT in-place. possible by having 3-in 2-out,
 880 to avoid the need for a temp register.  RS is written
 881 to as well as RT.
 882
 883     gffmadd  RT,RA,RC,RB (Rc=0)
 884     gffmadd. RT,RA,RC,RB (Rc=1)
 885
 886 Pseudo-code:
 887
 888     RT <- GFADD(GFMUL(RA, RC), RB))
 889     RS <- GFADD(GFMUL(RA, RC), RB))
 890
 891
 892 ## Multiply
 893
 894 with the modulo and degree being in an SPR, multiply can be identical
 895 equivalent to standard integer add
 896
 897     RS = GFMUL(RA, RB)
 898
 899 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31|
 900 | -- | -- | --- | --- | --- | ------ |--|
 901 | NN | RT | RA  | RB  |11000|  01110 |Rc|
 902
 903
 904
 905 ```
 906 from functools import reduce
 907
 908 def gf_degree(a) :
 909   res = 0
 910   a >>= 1
 911   while (a != 0) :
 912     a >>= 1;
 913     res += 1;
 914   return res
 915
 916 # constants used in the multGF2 function
 917 mask1 = mask2 = polyred = None
 918
 919 def setGF2(irPoly):
 920     """Define parameters of binary finite field GF(2^m)/g(x)
 921        - irPoly: coefficients of irreducible polynomial g(x)
 922     """
 923     # degree: extension degree of binary field
 924     degree = gf_degree(irPoly)
 925
 926     def i2P(sInt):
 927         """Convert an integer into a polynomial"""
 928         return [(sInt >> i) & 1
 929                 for i in reversed(range(sInt.bit_length()))]
 930
 931     global mask1, mask2, polyred
 932     mask1 = mask2 = 1 << degree
 933     mask2 -= 1
 934     polyred = reduce(lambda x, y: (x << 1) + y, i2P(irPoly)[1:])
 935
 936 def multGF2(p1, p2):
 937     """Multiply two polynomials in GF(2^m)/g(x)"""
 938     p = 0
 939     while p2:
 940         # standard long-multiplication: check LSB and add
 941         if p2 & 1:
 942             p ^= p1
 943         p1 <<= 1
 944         # standard modulo: check MSB and add polynomial
 945         if p1 & mask1:
 946             p1 ^= polyred
 947         p2 >>= 1
 948     return p & mask2
 949
 950 if __name__ == "__main__":
 951
 952     # Define binary field GF(2^3)/x^3 + x + 1
 953     setGF2(0b1011) # degree 3
 954
 955     # Evaluate the product (x^2 + x + 1)(x^2 + 1)
 956     print("{:02x}".format(multGF2(0b111, 0b101)))
 957
 958     # Define binary field GF(2^8)/x^8 + x^4 + x^3 + x + 1
 959     # (used in the Advanced Encryption Standard-AES)
 960     setGF2(0b100011011) # degree 8
 961
 962     # Evaluate the product (x^7)(x^7 + x + 1)
 963     print("{:02x}".format(multGF2(0b10000000, 0b10000011)))
 964 ```
 965
 966 ## GF(2^M) Inverse
 967
 968 ```
 969 # https://bugs.libre-soc.org/show_bug.cgi?id=782#c33
 970 # https://ftp.libre-soc.org/ARITH18_Kobayashi.pdf
 971 def gf_invert(a) :
 972
 973     s = getGF2() # get the full polynomial (including the MSB)
 974     r = a
 975     v = 0
 976     u = 1
 977     j = 0
 978
 979     for i in range(1, 2*degree+1):
 980         # could use count-trailing-1s here to skip ahead
 981         if r & mask1:          # test MSB of r
 982             if s & mask1:      # test MSB of s
 983                 s ^= r
 984                 v ^= u
 985             s <<= 1            # shift left 1
 986             if j == 0:
 987                 r, s = s, r    # swap r,s
 988                 u, v = v<<1, u # shift v and swap
 989                 j = 1
 990             else:
 991                 u >>= 1        # right shift left
 992                 j -= 1
 993         else:
 994             r <<= 1            # shift left 1
 995             u <<= 1            # shift left 1
 996             j += 1
 997
 998     return u
 999 ```
1000
1001 # GF2 (Carryless)
1002
1003 ## GF2 (carryless) div and mod
1004
1005 ```
1006 def gf_degree(a) :
1007   res = 0
1008   a >>= 1
1009   while (a != 0) :
1010     a >>= 1;
1011     res += 1;
1012   return res
1013
1014 def FullDivision(self, f, v):
1015         """
1016         Takes two arguments, f, v
1017         fDegree and vDegree are the degrees of the field elements
1018         f and v represented as a polynomials.
1019         This method returns the field elements a and b such that
1020
1021             f(x) = a(x) * v(x) + b(x).
1022
1023         That is, a is the divisor and b is the remainder, or in
1024         other words a is like floor(f/v) and b is like f modulo v.
1025         """
1026
1027         fDegree, vDegree = gf_degree(f), gf_degree(v)
1028         res, rem = 0, f
1029         for i in reversed(range(vDegree, fDegree+1):
1030             if ((rem >> i) & 1): # check bit
1031                 res ^= (1 << (i - vDegree))
1032                 rem ^= ( v << (i - vDegree)))
1033         return (res, rem)
1034 ```
1035
1036 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
1037 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
1038 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| cldiv |
1039 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| clmod |
1040
1041 ## GF2 carryless mul
1042
1043 based on RV bitmanip
1044 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
1045 <https://www.felixcloutier.com/x86/pclmulqdq> and
1046 <https://en.m.wikipedia.org/wiki/Carry-less_product>
1047
1048 these are GF2 operations with the modulo set to 2^degree.
1049 they are worth adding as their own non-overwrite operations
1050 (in the same pipeline).
1051
1052 ```
1053 uint_xlen_t clmul(uint_xlen_t RA, uint_xlen_t RB)
1054 {
1055     uint_xlen_t x = 0;
1056     for (int i = 0; i < XLEN; i++)
1057         if ((RB >> i) & 1)
1058             x ^= RA << i;
1059     return x;
1060 }
1061 uint_xlen_t clmulh(uint_xlen_t RA, uint_xlen_t RB)
1062 {
1063     uint_xlen_t x = 0;
1064     for (int i = 1; i < XLEN; i++)
1065         if ((RB >> i) & 1)
1066             x ^= RA >> (XLEN-i);
1067     return x;
1068 }
1069 uint_xlen_t clmulr(uint_xlen_t RA, uint_xlen_t RB)
1070 {
1071     uint_xlen_t x = 0;
1072     for (int i = 0; i < XLEN; i++)
1073         if ((RB >> i) & 1)
1074             x ^= RA >> (XLEN-i-1);
1075     return x;
1076 }
1077 ```
1078 ## carryless Twin Butterfly (Tukey-Cooley) Mul-add-sub
1079
1080 used in combination with SV FFT REMAP to perform
1081 a full NTT in-place. possible by having 3-in 2-out,
1082 to avoid the need for a temp register.  RS is written
1083 to as well as RT.
1084
1085     clfmadd  RT,RA,RC,RB (Rc=0)
1086     clfmadd. RT,RA,RC,RB (Rc=1)
1087
1088 Pseudo-code:
1089
1090     RT <- CLMUL(RA, RC) ^ RB
1091     RS <- CLMUL(RA, RC) ^ RB
1092
1093
1094 # bitmatrix
1095
1096 ```
1097 uint64_t bmatflip(uint64_t RA)
1098 {
1099     uint64_t x = RA;
1100     x = shfl64(x, 31);
1101     x = shfl64(x, 31);
1102     x = shfl64(x, 31);
1103     return x;
1104 }
1105 uint64_t bmatxor(uint64_t RA, uint64_t RB)
1106 {
1107     // transpose of RB
1108     uint64_t RBt = bmatflip(RB);
1109     uint8_t u[8]; // rows of RA
1110     uint8_t v[8]; // cols of RB
1111     for (int i = 0; i < 8; i++) {
1112         u[i] = RA >> (i*8);
1113         v[i] = RBt >> (i*8);
1114     }
1115     uint64_t x = 0;
1116     for (int i = 0; i < 64; i++) {
1117         if (pcnt(u[i / 8] & v[i % 8]) & 1)
1118             x |= 1LL << i;
1119     }
1120     return x;
1121 }
1122 uint64_t bmator(uint64_t RA, uint64_t RB)
1123 {
1124     // transpose of RB
1125     uint64_t RBt = bmatflip(RB);
1126     uint8_t u[8]; // rows of RA
1127     uint8_t v[8]; // cols of RB
1128     for (int i = 0; i < 8; i++) {
1129         u[i] = RA >> (i*8);
1130         v[i] = RBt >> (i*8);
1131     }
1132     uint64_t x = 0;
1133     for (int i = 0; i < 64; i++) {
1134         if ((u[i / 8] & v[i % 8]) != 0)
1135             x |= 1LL << i;
1136     }
1137     return x;
1138 }
1139
1140 ```
1141
1142 # Already in POWER ISA
1143
1144 ## count leading/trailing zeros with mask
1145
1146 in v3.1 p105
1147
1148 ```
1149 count = 0
1150 do i = 0 to 63 if((RB)i=1) then do
1151 if((RS)i=1) then break end end count ← count + 1
1152 RA ← EXTZ64(count)
1153 ```
1154
1155 ##  bit deposit
1156
1157 vpdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1158
1159     do while(m < 64)
1160        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1161           result = VSR[VRA+32].dword[i].bit[63-k]
1162           VSR[VRT+32].dword[i].bit[63-m] = result
1163           k = k + 1
1164        m = m + 1
1165
1166 ```
1167
1168 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1169 {
1170     uint_xlen_t r = 0;
1171     for (int i = 0, j = 0; i < XLEN; i++)
1172         if ((RB >> i) & 1) {
1173             if ((RA >> j) & 1)
1174                 r |= uint_xlen_t(1) << i;
1175             j++;
1176         }
1177     return r;
1178 }
1179
1180 ```
1181
1182 # bit extract
1183
1184 other way round: identical to RV bext, found in v3.1 p196
1185
1186 ```
1187 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1188 {
1189     uint_xlen_t r = 0;
1190     for (int i = 0, j = 0; i < XLEN; i++)
1191         if ((RB >> i) & 1) {
1192             if ((RA >> i) & 1)
1193                 r |= uint_xlen_t(1) << j;
1194             j++;
1195         }
1196     return r;
1197 }
1198 ```
1199
1200 # centrifuge
1201
1202 found in v3.1 p106 so not to be added here
1203
1204 ```
1205 ptr0 = 0
1206 ptr1 = 0
1207 do i = 0 to 63
1208     if((RB)i=0) then do
1209        resultptr0 = (RS)i
1210     end
1211     ptr0 = ptr0 + 1
1212     if((RB)63-i==1) then do
1213         result63-ptr1 = (RS)63-i
1214     end
1215     ptr1 = ptr1 + 1
1216 RA = result
1217 ```
1218
1219 # bit to byte permute
1220
1221 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1222 these perform a transpose.
1223
1224     do j = 0 to 7
1225       do k = 0 to 7
1226          b = VSR[VRB+32].dword[i].byte[k].bit[j]
1227          VSR[VRT+32].dword[i].byte[j].bit[k] = b
1228