openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 # Implementation Log
   4
   5 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   6 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   7 * remove Rc=1 from ternlog due to conflicts in encoding as well
   8   as saving space <https://bugs.libre-soc.org/show_bug.cgi?id=753#c5>
   9
  10 # bitmanipulation
  11
  12 **DRAFT STATUS**
  13
  14 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  15 Vectorisation Context is provided by [[openpower/sv]].
  16
  17 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  18
  19 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  20
  21 general-purpose Galois Field operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  22
  23 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  24 the [[sv/av_opcodes]] as well as [[sv/setvl]]
  25
  26 Useful resource:
  27
  28 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  29 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  30
  31 # summary
  32
  33 minor opcode allocation
  34
  35     |  28.30 |31| name      |
  36     | ------ |--| --------- |
  37     |   00   |0 | ternlogi  |
  38     |  000   |1 | ternlog   |
  39     |  100   |1 | grevlog   |
  40     |  010   |Rc| bitmask   |
  41     |  011   |Rc| gf*       |
  42     |  101   |1 | ternlogv  |
  43     |  101   |0 | ternlogcr |
  44     |  110   |Rc| 1/2-op    |
  45     |  111   |Rc| 3-op      |
  46
  47 1-op and variants
  48
  49 | dest | src1 | subop | op       |
  50 | ---- | ---- | ----- | -------- |
  51 | RT   | RA   | ..    | bmatflip |
  52
  53 2-op and variants
  54
  55 | dest | src1 | src2 | subop | op       |
  56 | ---- | ---- | ---- | ----- | -------- |
  57 | RT   | RA   | RB   | or    | bmatflip |
  58 | RT   | RA   | RB   | xor   | bmatflip |
  59 | RT   | RA   | RB   |       | grev  |
  60 | RT   | RA   | RB   |       | clmul*  |
  61 | RT   | RA   | RB   |       | gorc |
  62 | RT   | RA   | RB   | shuf  | shuffle |
  63 | RT   | RA   | RB   | unshuf| shuffle |
  64 | RT   | RA   | RB   | width | xperm  |
  65 | RT   | RA   | RB   | type | minmax |
  66 | RT   | RA   | RB   |      | av abs avgadd  |
  67 | RT   | RA   | RB   | type | vmask ops |
  68 | RT   | RA   | RB   |  |  |
  69
  70 3 ops
  71
  72 * bitmask set/extract
  73 * ternlog bitops
  74 * GF
  75
  76 TODO: convert all instructions to use RT and not RS
  77
  78 | 0.5|6.10|11.15|16.20 |21..25   | 26....30 |31| name |
  79 | -- | -- | --- | ---  | -----   | -------- |--| ------ |
  80 | NN | RT | RA  | RB   | RC      | mode 000 |1 | ternlog |
  81 | NN | RT | RA  | RB   | im0-4   | im5-7 00 |0 | ternlogi |
  82 | NN | RT | RA  | RB   | / im0-3 | 00   100 |1 | grevlog |
  83 | NN | RT | RA  | s0-5 | s6 im0-3| 01   100 |1 | grevlogi |
  84 | NN | RT | RA  |      |         | 1-   100 |1 | rsvd |
  85 | NN | RS | RA  | RB   | RC      | 00  011  |Rc| gfmul |
  86 | NN | RS | RA  | RB   | RC      | 01  011  |Rc| gfadd |
  87 | NN | RT | RA  | RB   | deg     | 10  011  |Rc| gfinv |
  88 | NN | RS | RA  | RB   | deg     | 11  011  |Rc| gfmuli |
  89 | NN | RS | RA  | RB   | deg     | 11  111  |Rc| gfaddi |
  90
  91 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31| name |
  92 | -- | -- | --- | ----- | ---- | ----- |--| ------ |
  93 | NN | RT | RA  | imm   | mask | 101   |1 | ternlogv |
  94
  95 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31| name |
  96 | -- | -- | --- | --- |- |-----|----- | -----|--| -------|
  97 | NN | BA | BB  | BC  |0 |imm  | mask | 101  |0 | ternlogcr |
  98
  99 ops (note that av avg and abs as well as vec scalar mask
 100 are included here)
 101
 102 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 103 double check that instructions didn't need 3 inputs.
 104
 105 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 106 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 107 | NN | RA | RB  |     | 0  |       | 0000 110 |Rc| rsvd   |
 108 | NN | RA | RB  | RC  | 1  | itype | 0000 110 |Rc| xperm |
 109 | NN | RA | RB  | RC  | 0  | itype | 0100 110 |Rc| minmax |
 110 | NN | RA | RB  | RC  | 1  |   00  | 0100 110 |Rc| av avgadd |
 111 | NN | RA | RB  | RC  | 1  |   01  | 0100 110 |Rc| av abs |
 112 | NN | RA | RB  |     | 1  |   10  | 0100 110 |Rc| rsvd |
 113 | NN | RA | RB  |     | 1  |   11  | 0100 110 |Rc| rsvd |
 114 | NN | RA | RB  | sh  | SH | itype | 1000 110 |Rc| bmopsi |
 115 | NN | RA | RB  |     |    |       | 1100 110 |Rc| rsvd |
 116 | NN | RA | RB  |     | 1  |       | 0001 110 |Rc| rsvd |
 117 | NN | RA | RB  | RC  | 0  |   00  | 0001 110 |Rc| vec sbfm |
 118 | NN | RA | RB  | RC  | 0  |   01  | 0001 110 |Rc| vec sofm |
 119 | NN | RA | RB  | RC  | 0  |   10  | 0001 110 |Rc| vec sifm |
 120 | NN | RA | RB  | RC  | 0  |   11  | 0001 110 |Rc| vec cprop |
 121 | NN | RA | RB  |     | 0  |       | 0101 110 |Rc| rsvd |
 122 | NN | RA | RB  | RC  | 0  | 00    | 0010 110 |Rc| gorc |
 123 | NN | RA | RB  | sh  | SH | 00    | 1010 110 |Rc| gorci |
 124 | NN | RA | RB  | RC  | 0  | 00    | 0110 110 |Rc| gorcw |
 125 | NN | RA | RB  | sh  | 0  | 00    | 1110 110 |Rc| gorcwi |
 126 | NN | RA | RB  | RC  | 1  | 00    | 1110 110 |Rc| bmator  |
 127 | NN | RA | RB  | RC  | 0  | 01    | 0010 110 |Rc| grev |
 128 | NN | RA | RB  | RC  | 1  | 01    | 0010 110 |Rc| clmul |
 129 | NN | RA | RB  | sh  | SH | 01    | 1010 110 |Rc| grevi |
 130 | NN | RA | RB  | RC  | 0  | 01    | 0110 110 |Rc| grevw |
 131 | NN | RA | RB  | sh  | 0  | 01    | 1110 110 |Rc| grevwi |
 132 | NN | RA | RB  | RC  | 1  | 01    | 1110 110 |Rc| bmatxor   |
 133 | NN | RA | RB  | RC  | 0  | 10    | 0010 110 |Rc| shfl |
 134 | NN | RA | RB  | sh  | SH | 10    | 1010 110 |Rc| shfli |
 135 | NN | RA | RB  | RC  | 0  | 10    | 0110 110 |Rc| shflw |
 136 | NN | RA | RB  | RC  |    | 10    | 1110 110 |Rc| rsvd   |
 137 | NN | RA | RB  | RC  | 0  | 11    | 1110 110 |Rc| clmulr  |
 138 | NN | RA | RB  | RC  | 1  | 11    | 1110 110 |Rc| clmulh  |
 139 | NN |    |     |     |    |       | --11 110 |Rc| setvl  |
 140
 141 # bit to byte permute
 142
 143 similar to matrix permute in RV bitmanip, which has XOR and OR variants
 144
 145     do j = 0 to 7
 146       do k = 0 to 7
 147          b = VSR[VRB+32].dword[i].byte[k].bit[j]
 148          VSR[VRT+32].dword[i].byte[j].bit[k] = b
 149
 150 # int min/max
 151
 152 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 153
 154 signed/unsigned min/max gives more flexibility.
 155
 156 ```
 157 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 158 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 159 }
 160 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 161 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 162 }
 163 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 164 { return rs1 < rs2 ? rs1 : rs2;
 165 }
 166 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 167 { return rs1 > rs2 ? rs1 : rs2;
 168 }
 169 ```
 170
 171
 172 # ternlog bitops
 173
 174 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 175
 176 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 177
 178 ## ternlogi
 179
 180 | 0.5|6.10|11.15|16.20| 21..25| 26..30   |31|
 181 | -- | -- | --- | --- | ----- | -------- |--|
 182 | NN | RT | RA  | RB  | im0-4 | im5-7 00 |0 |
 183
 184     lut3(imm, a, b, c):
 185         idx = c << 2 | b << 1 | a
 186         return imm[idx] # idx by LSB0 order
 187
 188     for i in range(64):
 189         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 190
 191 bits 21..22 may be used to specify a mode, such as treating the whole integer zero/nonzero and putting 1/0 in the result, rather than bitwise test.
 192
 193 ## ternlog
 194
 195 a 4 operand variant which becomes more along the lines of an FPGA:
 196
 197 | 0.5|6.10|11.15|16.20|21.25| 26...30  |31|
 198 | -- | -- | --- | --- | --- | -------- |--|
 199 | NN | RT | RA  | RB  | RC  | mode 100 |1 |
 200
 201     for i in range(64):
 202         idx = RT[i] << 2 | RA[i] << 1 | RB[i]
 203         RT[i] = (RC & (1<<idx)) != 0
 204
 205 mode (2 bit) may be used to do inversion of ordering, similar to carryless mul,
 206 3 modes.
 207
 208 ## ternlogv
 209
 210 also, another possible variant involving swizzle and vec4:
 211
 212 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31|
 213 | -- | -- | --- | ----- | ---- | ----- |--|
 214 | NN | RT | RA  | imm   | mask | 101   |1 |
 215
 216     for i in range(8):
 217         idx = RA.x[i] << 2 | RA.y[i] << 1 | RA.z[i]
 218         res = (imm & (1<<idx)) != 0
 219         for j in range(3):
 220              if mask[j]: RT[i+j*8] = res
 221
 222 ## ternlogcr
 223
 224 another mode selection would be CRs not Ints.
 225
 226 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31|
 227 | -- | -- | --- | --- |- |-----|----- | -----|--|
 228 | NN | BA | BB  | BC  |0 |imm  | mask | 101  |0 |
 229
 230     for i in range(4):
 231         if not mask[i] continue
 232         idx = crregs[BA][i] << 2 |
 233               crregs[BB][i] << 1 |
 234               crregs[BC][i]
 235         crregs[BA][i] = (imm & (1<<idx)) != 0
 236
 237 ## cmix
 238
 239 based on RV bitmanip, covered by ternlog bitops
 240
 241 ```
 242 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 243     return (RA & RB) | (RC & ~RB);
 244 }
 245 ```
 246
 247
 248 # bitmask set
 249
 250 based on RV bitmanip singlebit set, instruction format similar to shift
 251 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 252 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 253
 254 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 255 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 256
 257 bmset (register for mask amount) is particularly useful for creating
 258 predicate masks where the length is a dynamic runtime quantity.
 259 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 260
 261 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 262 | -- | -- | --- | --- | --- | ------- |--| ----- |
 263 | NN | RT | RA  | RB  | RC  | mode 010 |Rc| bm*   |
 264 | NN | RT | RA  | RB  | RC  | 0 1  111 |Rc| bmrev |
 265
 266
 267 ```
 268 uint_xlen_t bmset(RA, RB, sh)
 269 {
 270     int shamt = RB & (XLEN - 1);
 271     mask = (2<<sh)-1;
 272     return RA | (mask << shamt);
 273 }
 274
 275 uint_xlen_t bmclr(RA, RB, sh)
 276 {
 277     int shamt = RB & (XLEN - 1);
 278     mask = (2<<sh)-1;
 279     return RA & ~(mask << shamt);
 280 }
 281
 282 uint_xlen_t bminv(RA, RB, sh)
 283 {
 284     int shamt = RB & (XLEN - 1);
 285     mask = (2<<sh)-1;
 286     return RA ^ (mask << shamt);
 287 }
 288
 289 uint_xlen_t bmext(RA, RB, sh)
 290 {
 291     int shamt = RB & (XLEN - 1);
 292     mask = (2<<sh)-1;
 293     return mask & (RA >> shamt);
 294 }
 295 ```
 296
 297 bitmask extract with reverse.  can be done by bitinverting all of RA and getting bits of RA from the opposite end.
 298
 299 ```
 300 msb = rb[5:0];
 301 rev[0:msb] = ra[msb:0];
 302 rt = ZE(rev[msb:0]);
 303
 304 uint_xlen_t bmextrev(RA, RB, sh)
 305 {
 306     int shamt = (RB & (XLEN - 1));
 307     shamt = (XLEN-1)-shamt;  # shift other end
 308     bra = bitreverse(RA)     # swap LSB-MSB
 309     mask = (2<<sh)-1;
 310     return mask & (bra >> shamt);
 311 }
 312 ```
 313
 314 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 315 | -- | -- | --- | --- | --- | ------- |--| ------ |
 316 | NN | RT | RA  | RB  | sh  | 0   111 |Rc| bmrevi |
 317
 318
 319
 320 # grev
 321
 322 based on RV bitmanip, this is also known as a butterfly network. however
 323 where a butterfly network allows setting of every crossbar setting in
 324 every row and every column, generalised-reverse (grev) only allows
 325 a per-row decision: every entry in the same row must either switch or
 326 not-switch.
 327
 328 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 329
 330 ```
 331 uint64_t grev64(uint64_t RA, uint64_t RB)
 332 {
 333     uint64_t x = RA;
 334     int shamt = RB & 63;
 335     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 336                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 337     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 338                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 339     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 340                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 341     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 342                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 343     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 344                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 345     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 346                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 347     return x;
 348 }
 349
 350 ```
 351
 352 # shuffle / unshuffle
 353
 354 based on RV bitmanip
 355
 356 ```
 357 uint32_t shfl32(uint32_t RA, uint32_t RB)
 358 {
 359     uint32_t x = RA;
 360     int shamt = RB & 15;
 361     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 362     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 363     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 364     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 365     return x;
 366 }
 367 uint32_t unshfl32(uint32_t RA, uint32_t RB)
 368 {
 369     uint32_t x = RA;
 370     int shamt = RB & 15;
 371     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 372     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 373     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 374     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 375     return x;
 376 }
 377
 378 uint64_t shuffle64_stage(uint64_t src, uint64_t maskL, uint64_t maskR, int N)
 379 {
 380     uint64_t x = src & ~(maskL | maskR);
 381     x |= ((src << N) & maskL) | ((src >> N) & maskR);
 382     return x;
 383 }
 384 uint64_t shfl64(uint64_t RA, uint64_t RB)
 385 {
 386     uint64_t x = RA;
 387     int shamt = RB & 31;
 388     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 389                                            0x00000000ffff0000LL, 16);
 390     if (shamt & 8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 391                                            0x0000ff000000ff00LL, 8);
 392     if (shamt & 4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 393                                            0x00f000f000f000f0LL, 4);
 394     if (shamt & 2) x = shuffle64_stage(x, 0x3030303030303030LL,
 395                                            0x0c0c0c0c0c0c0c0cLL, 2);
 396     if (shamt & 1) x = shuffle64_stage(x, 0x4444444444444444LL,
 397                                            0x2222222222222222LL, 1);
 398     return x;
 399 }
 400 uint64_t unshfl64(uint64_t RA, uint64_t RB)
 401 {
 402     uint64_t x = RA;
 403     int shamt = RB & 31;
 404     if (shamt &  1) x = shuffle64_stage(x, 0x4444444444444444LL,
 405                                            0x2222222222222222LL, 1);
 406     if (shamt &  2) x = shuffle64_stage(x, 0x3030303030303030LL,
 407                                            0x0c0c0c0c0c0c0c0cLL, 2);
 408     if (shamt &  4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 409                                            0x00f000f000f000f0LL, 4);
 410     if (shamt &  8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 411                                            0x0000ff000000ff00LL, 8);
 412     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 413                                            0x00000000ffff0000LL, 16);
 414     return x;
 415 }
 416 ```
 417
 418 # xperm
 419
 420 based on RV bitmanip
 421
 422 ```
 423 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 424 {
 425     uint_xlen_t r = 0;
 426     uint_xlen_t sz = 1LL << sz_log2;
 427     uint_xlen_t mask = (1LL << sz) - 1;
 428     for (int i = 0; i < XLEN; i += sz) {
 429         uint_xlen_t pos = ((RB >> i) & mask) << sz_log2;
 430         if (pos < XLEN)
 431             r |= ((RA >> pos) & mask) << i;
 432     }
 433     return r;
 434 }
 435 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 436 {  return xperm(RA, RB, 2); }
 437 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 438 {  return xperm(RA, RB, 3); }
 439 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 440 {  return xperm(RA, RB, 4); }
 441 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 442 {  return xperm(RA, RB, 5); }
 443 ```
 444
 445 # gorc
 446
 447 based on RV bitmanip
 448
 449 ```
 450 uint32_t gorc32(uint32_t RA, uint32_t RB)
 451 {
 452     uint32_t x = RA;
 453     int shamt = RB & 31;
 454     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 455     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 456     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 457     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 458     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 459     return x;
 460 }
 461 uint64_t gorc64(uint64_t RA, uint64_t RB)
 462 {
 463     uint64_t x = RA;
 464     int shamt = RB & 63;
 465     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 466                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 467     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 468                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 469     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 470                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 471     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 472                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 473     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 474                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 475     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 476                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 477     return x;
 478 }
 479
 480 ```
 481
 482 # Galois Field
 483
 484 see <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 485
 486 ## Multiply
 487
 488 this requires 3 parameters and a "degree"
 489
 490     RT = GFMUL(RA, RB, gfdegree, modulo=RC)
 491
 492 realistically with the degree also needing to be an immediate it should be brought down to an overwrite version:
 493
 494     RS = GFMUL(RS, RA, gfdegree, modulo=RC)
 495     RS = GFMUL(RS, RA, gfdegree=RB, modulo=RC)
 496
 497 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31|
 498 | -- | -- | --- | --- | --- | ------- |--|
 499 | NN | RS | RA  | deg | RC  | 00  011 |Rc|
 500 | NN | RS | RA  | RB  | RC  | 11  011 |Rc|
 501
 502 where the SimpleV variant may override RS-as-src differently from RS-as-dest
 503
 504
 505
 506 ```
 507 from functools import reduce
 508
 509 # constants used in the multGF2 function
 510 mask1 = mask2 = polyred = None
 511
 512 def setGF2(degree, irPoly):
 513     """Define parameters of binary finite field GF(2^m)/g(x)
 514        - degree: extension degree of binary field
 515        - irPoly: coefficients of irreducible polynomial g(x)
 516     """
 517     def i2P(sInt):
 518         """Convert an integer into a polynomial"""
 519         return [(sInt >> i) & 1
 520                 for i in reversed(range(sInt.bit_length()))]
 521
 522     global mask1, mask2, polyred
 523     mask1 = mask2 = 1 << degree
 524     mask2 -= 1
 525     polyred = reduce(lambda x, y: (x << 1) + y, i2P(irPoly)[1:])
 526
 527 def multGF2(p1, p2):
 528     """Multiply two polynomials in GF(2^m)/g(x)"""
 529     p = 0
 530     while p2:
 531         if p2 & 1:
 532             p ^= p1
 533         p1 <<= 1
 534         if p1 & mask1:
 535             p1 ^= polyred
 536         p2 >>= 1
 537     return p & mask2
 538
 539 if __name__ == "__main__":
 540
 541     # Define binary field GF(2^3)/x^3 + x + 1
 542     setGF2(3, 0b1011)
 543
 544     # Evaluate the product (x^2 + x + 1)(x^2 + 1)
 545     print("{:02x}".format(multGF2(0b111, 0b101)))
 546
 547     # Define binary field GF(2^8)/x^8 + x^4 + x^3 + x + 1
 548     # (used in the Advanced Encryption Standard-AES)
 549     setGF2(8, 0b100011011)
 550
 551     # Evaluate the product (x^7)(x^7 + x + 1)
 552     print("{:02x}".format(multGF2(0b10000000, 0b10000011)))
 553 ```
 554 ## GF add
 555
 556     RS = GFADDI(RS, RA|0, gfdegree, modulo=RC)
 557     RS = GFADD(RS, RA|0, gfdegree=RB, modulo=RC)
 558
 559 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 560 | -- | -- | --- | --- | --- | ------- |--| ----- |
 561 | NN | RS | RA  | deg | RC  | 0 1  011 |Rc| gfaddi |
 562 | NN | RS | RA  | RB  | RC  | 1 1  111 |Rc| gfadd |
 563
 564 GFMOD is a pseudo-op where RA=0
 565
 566 ## gf invert
 567
 568 ```
 569 def gf_degree(a) :
 570   res = 0
 571   a >>= 1
 572   while (a != 0) :
 573     a >>= 1;
 574     res += 1;
 575   return res
 576
 577 def gf_invert(a, mod=0x1B) :
 578   v = mod
 579   g1 = 1
 580   g2 = 0
 581   j = gf_degree(a) - 8
 582
 583   while (a != 1) :
 584     if (j < 0) :
 585       a, v = v, a
 586       g1, g2 = g2, g1
 587       j = -j
 588
 589     a ^= v << j
 590     g1 ^= g2 << j
 591
 592     a %= 256  # Emulating 8-bit overflow
 593     g1 %= 256 # Emulating 8-bit overflow
 594
 595     j = gf_degree(a) - gf_degree(v)
 596
 597   return g1
 598 ```
 599
 600 ## carryless mul
 601
 602 based on RV bitmanip
 603 see https://en.wikipedia.org/wiki/CLMUL_instruction_set
 604
 605 these are GF2 operations with the modulo set to 2^degree.
 606 they are worth adding as their own non-overwrite operations
 607 (in the same pipeline).
 608
 609 ```
 610 uint_xlen_t clmul(uint_xlen_t RA, uint_xlen_t RB)
 611 {
 612     uint_xlen_t x = 0;
 613     for (int i = 0; i < XLEN; i++)
 614         if ((RB >> i) & 1)
 615             x ^= RA << i;
 616     return x;
 617 }
 618 uint_xlen_t clmulh(uint_xlen_t RA, uint_xlen_t RB)
 619 {
 620     uint_xlen_t x = 0;
 621     for (int i = 1; i < XLEN; i++)
 622         if ((RB >> i) & 1)
 623             x ^= RA >> (XLEN-i);
 624     return x;
 625 }
 626 uint_xlen_t clmulr(uint_xlen_t RA, uint_xlen_t RB)
 627 {
 628     uint_xlen_t x = 0;
 629     for (int i = 0; i < XLEN; i++)
 630         if ((RB >> i) & 1)
 631             x ^= RA >> (XLEN-i-1);
 632     return x;
 633 }
 634 ```
 635
 636 # bitmatrix
 637
 638 ```
 639 uint64_t bmatflip(uint64_t RA)
 640 {
 641     uint64_t x = RA;
 642     x = shfl64(x, 31);
 643     x = shfl64(x, 31);
 644     x = shfl64(x, 31);
 645     return x;
 646 }
 647 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 648 {
 649     // transpose of RB
 650     uint64_t RBt = bmatflip(RB);
 651     uint8_t u[8]; // rows of RA
 652     uint8_t v[8]; // cols of RB
 653     for (int i = 0; i < 8; i++) {
 654         u[i] = RA >> (i*8);
 655         v[i] = RBt >> (i*8);
 656     }
 657     uint64_t x = 0;
 658     for (int i = 0; i < 64; i++) {
 659         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 660             x |= 1LL << i;
 661     }
 662     return x;
 663 }
 664 uint64_t bmator(uint64_t RA, uint64_t RB)
 665 {
 666     // transpose of RB
 667     uint64_t RBt = bmatflip(RB);
 668     uint8_t u[8]; // rows of RA
 669     uint8_t v[8]; // cols of RB
 670     for (int i = 0; i < 8; i++) {
 671         u[i] = RA >> (i*8);
 672         v[i] = RBt >> (i*8);
 673     }
 674     uint64_t x = 0;
 675     for (int i = 0; i < 64; i++) {
 676         if ((u[i / 8] & v[i % 8]) != 0)
 677             x |= 1LL << i;
 678     }
 679     return x;
 680 }
 681
 682 ```
 683
 684 # Already in POWER ISA
 685
 686 ## count leading/trailing zeros with mask
 687
 688 in v3.1 p105
 689
 690 ```
 691 count = 0
 692 do i = 0 to 63 if((RB)i=1) then do
 693 if((RS)i=1) then break end end count ← count + 1
 694 RA ← EXTZ64(count)
 695 ```
 696
 697 ##  bit deposit
 698
 699 vpdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
 700
 701     do while(m < 64)
 702        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
 703           result = VSR[VRA+32].dword[i].bit[63-k]
 704           VSR[VRT+32].dword[i].bit[63-m] = result
 705           k = k + 1
 706        m = m + 1
 707
 708 ```
 709
 710 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
 711 {
 712     uint_xlen_t r = 0;
 713     for (int i = 0, j = 0; i < XLEN; i++)
 714         if ((RB >> i) & 1) {
 715             if ((RA >> j) & 1)
 716                 r |= uint_xlen_t(1) << i;
 717             j++;
 718         }
 719     return r;
 720 }
 721
 722 ```
 723
 724 # bit extract
 725
 726 other way round: identical to RV bext, found in v3.1 p196
 727
 728 ```
 729 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
 730 {
 731     uint_xlen_t r = 0;
 732     for (int i = 0, j = 0; i < XLEN; i++)
 733         if ((RB >> i) & 1) {
 734             if ((RA >> i) & 1)
 735                 r |= uint_xlen_t(1) << j;
 736             j++;
 737         }
 738     return r;
 739 }
 740 ```
 741
 742 # centrifuge
 743
 744 found in v3.1 p106 so not to be added here
 745
 746 ```
 747 ptr0 = 0
 748 ptr1 = 0
 749 do i = 0 to 63
 750     if((RB)i=0) then do
 751        resultptr0 = (RS)i
 752     end
 753     ptr0 = ptr0 + 1
 754     if((RB)63-i==1) then do
 755         result63-ptr1 = (RS)63-i
 756     end
 757     ptr1 = ptr1 + 1
 758 RA = result
 759 ```
 760