openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 # Implementation Log
   4
   5 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   6 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   7 * remove Rc=1 from ternlog due to conflicts in encoding as well
   8   as saving space <https://bugs.libre-soc.org/show_bug.cgi?id=753#c5>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11 # bitmanipulation
  12
  13 **DRAFT STATUS**
  14
  15 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  16 Vectorisation Context is provided by [[openpower/sv]].
  17
  18 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  19
  20 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  21
  22 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  23
  24 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  25 the [[sv/av_opcodes]] as well as [[sv/setvl]]
  26
  27 Useful resource:
  28
  29 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  30 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  31
  32 # summary
  33
  34 minor opcode allocation
  35
  36     |  28.30 |31| name      |
  37     | ------ |--| --------- |
  38     |  -00   |0 | ternlogi  |
  39     |  -00   |1 | grevlog   |
  40     |  -01   |  | grevlogi  |
  41     |  010   |Rc| bitmask   |
  42     |  011   |Rc| gfmadd*   |
  43     |  110   |Rc| 1/2-op    |
  44     |  111   |1 | ternlogv  |
  45     |  111   |0 | ternlogcr |
  46
  47 1-op and variants
  48
  49 | dest | src1 | subop | op       |
  50 | ---- | ---- | ----- | -------- |
  51 | RT   | RA   | ..    | bmatflip |
  52
  53 2-op and variants
  54
  55 | dest | src1 | src2 | subop | op       |
  56 | ---- | ---- | ---- | ----- | -------- |
  57 | RT   | RA   | RB   | or    | bmatflip |
  58 | RT   | RA   | RB   | xor   | bmatflip |
  59 | RT   | RA   | RB   |       | grev  |
  60 | RT   | RA   | RB   |       | clmul*  |
  61 | RT   | RA   | RB   |       | gorc |
  62 | RT   | RA   | RB   | shuf  | shuffle |
  63 | RT   | RA   | RB   | unshuf| shuffle |
  64 | RT   | RA   | RB   | width | xperm  |
  65 | RT   | RA   | RB   | type | minmax |
  66 | RT   | RA   | RB   |      | av abs avgadd  |
  67 | RT   | RA   | RB   | type | vmask ops |
  68 | RT   | RA   | RB   |      |       |
  69
  70 3 ops
  71
  72 * grevlog
  73 * ternlog bitops
  74 * GF mul-add
  75 * bitmask-reverse
  76
  77 TODO: convert all instructions to use RT and not RS
  78
  79 | 0.5|6.10|11.15|16.20 |21..25   | 26....30  |31| name |
  80 | -- | -- | --- | ---  | -----   | --------  |--| ------ |
  81 | NN | RT | RA  | RB   | im0-4   | im5-7  00 |0 | ternlogi |
  82 | NN | RT | RA  | RB   | im0-4   | im5-7  00 |1 | grevlog |
  83 | NN | RT | RA  | s0-4 | im0-4   | im5-7  01 |s5| grevlogi |
  84 | NN | RS | RA  | RB   | RC      | 00    011 |Rc| gfmadd |
  85 | NN | RS | RA  | RB   | RC      | 10    011 |Rc| gfmaddsub |
  86 | NN | RT | RA  | RB   | sh0-4   | sh5 1 011 |Rc| bmrevi |
  87
  88 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31| name |
  89 | -- | -- | --- | ----- | ---- | ----- |--| ------ |
  90 | NN | RT | RA  | imm   | mask | 111   |1 | ternlogv |
  91
  92 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31| name |
  93 | -- | -- | --- | --- |- |-----|----- | -----|--| -------|
  94 | NN | BA | BB  | BC  |0 |imm  | mask | 111  |0 | ternlogcr |
  95
  96 ops (note that av avg and abs as well as vec scalar mask
  97 are included here)
  98
  99 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 100 double check that instructions didn't need 3 inputs.
 101
 102 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 103 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 104 | NN | RA | RB  |     | 0  |       | 0000 110 |Rc| rsvd   |
 105 | NN | RA | RB  | RC  | 1  | itype | 0000 110 |Rc| xperm |
 106 | NN | RA | RB  | RC  | 0  | itype | 0100 110 |Rc| minmax |
 107 | NN | RA | RB  | RC  | 1  |   00  | 0100 110 |Rc| av avgadd |
 108 | NN | RA | RB  | RC  | 1  |   01  | 0100 110 |Rc| av abs |
 109 | NN | RA | RB  |     | 1  |   10  | 0100 110 |Rc| rsvd |
 110 | NN | RA | RB  |     | 1  |   11  | 0100 110 |Rc| rsvd |
 111 | NN | RA | RB  | sh  | SH | itype | 1000 110 |Rc| bmopsi |
 112 | NN | RA | RB  |     |    |       | 1100 110 |Rc| rsvd |
 113 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| gfdiv |
 114 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| gfmod |
 115 | NN | RT | RA  | RB  | 1  |  10   | 0001 110 |Rc| gfmul |
 116 | NN | RA | RB  |     | 1  |  11   | 0001 110 |Rc| rsvd |
 117 | NN | RA | RB  | RC  | 0  |   00  | 0001 110 |Rc| vec sbfm |
 118 | NN | RA | RB  | RC  | 0  |   01  | 0001 110 |Rc| vec sofm |
 119 | NN | RA | RB  | RC  | 0  |   10  | 0001 110 |Rc| vec sifm |
 120 | NN | RA | RB  | RC  | 0  |   11  | 0001 110 |Rc| vec cprop |
 121 | NN | RA | RB  |     | 0  |       | 0101 110 |Rc| rsvd |
 122 | NN | RA | RB  | RC  | 0  | 00    | 0010 110 |Rc| gorc |
 123 | NN | RA | RB  | sh  | SH | 00    | 1010 110 |Rc| gorci |
 124 | NN | RA | RB  | RC  | 0  | 00    | 0110 110 |Rc| gorcw |
 125 | NN | RA | RB  | sh  | 0  | 00    | 1110 110 |Rc| gorcwi |
 126 | NN | RA | RB  | RC  | 1  | 00    | 1110 110 |Rc| bmator  |
 127 | NN | RA | RB  | RC  | 0  | 01    | 0010 110 |Rc| grev |
 128 | NN | RA | RB  | RC  | 1  | 01    | 0010 110 |Rc| clmul |
 129 | NN | RA | RB  | sh  | SH | 01    | 1010 110 |Rc| grevi |
 130 | NN | RA | RB  | RC  | 0  | 01    | 0110 110 |Rc| grevw |
 131 | NN | RA | RB  | sh  | 0  | 01    | 1110 110 |Rc| grevwi |
 132 | NN | RA | RB  | RC  | 1  | 01    | 1110 110 |Rc| bmatxor   |
 133 | NN | RA | RB  | RC  | 0  | 10    | 0010 110 |Rc| shfl |
 134 | NN | RA | RB  | sh  | SH | 10    | 1010 110 |Rc| shfli |
 135 | NN | RA | RB  | RC  | 0  | 10    | 0110 110 |Rc| shflw |
 136 | NN | RA | RB  | RC  |    | 10    | 1110 110 |Rc| rsvd    |
 137 | NN | RA | RB  | RC  | 0  | 11    | 1110 110 |Rc| clmulr  |
 138 | NN | RA | RB  | RC  | 1  | 11    | 1110 110 |Rc| clmulh  |
 139 | NN |    |     |     |    |       | --11 110 |Rc| setvl  |
 140
 141 # bit to byte permute
 142
 143 similar to matrix permute in RV bitmanip, which has XOR and OR variants
 144
 145     do j = 0 to 7
 146       do k = 0 to 7
 147          b = VSR[VRB+32].dword[i].byte[k].bit[j]
 148          VSR[VRT+32].dword[i].byte[j].bit[k] = b
 149
 150 # int min/max
 151
 152 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 153
 154 signed/unsigned min/max gives more flexibility.
 155
 156 ```
 157 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 158 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 159 }
 160 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 161 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 162 }
 163 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 164 { return rs1 < rs2 ? rs1 : rs2;
 165 }
 166 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 167 { return rs1 > rs2 ? rs1 : rs2;
 168 }
 169 ```
 170
 171
 172 # ternlog bitops
 173
 174 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 175
 176 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 177
 178 ## ternlogi
 179
 180 | 0.5|6.10|11.15|16.20| 21..25| 26..30   |31|
 181 | -- | -- | --- | --- | ----- | -------- |--|
 182 | NN | RT | RA  | RB  | im0-4 | im5-7 00 |0 |
 183
 184     lut3(imm, a, b, c):
 185         idx = c << 2 | b << 1 | a
 186         return imm[idx] # idx by LSB0 order
 187
 188     for i in range(64):
 189         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 190
 191 bits 21..22 may be used to specify a mode, such as treating the whole integer zero/nonzero and putting 1/0 in the result, rather than bitwise test.
 192
 193 ## ternlog
 194
 195 a 4 operand variant which becomes more along the lines of an FPGA:
 196
 197 | 0.5|6.10|11.15|16.20|21.25| 26...30  |31|
 198 | -- | -- | --- | --- | --- | -------- |--|
 199 | NN | RT | RA  | RB  | RC  | mode 100 |1 |
 200
 201     for i in range(64):
 202         idx = RT[i] << 2 | RA[i] << 1 | RB[i]
 203         RT[i] = (RC & (1<<idx)) != 0
 204
 205 mode (2 bit) may be used to do inversion of ordering, similar to carryless mul,
 206 3 modes.
 207
 208 ## ternlogv
 209
 210 also, another possible variant involving swizzle and vec4:
 211
 212 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31|
 213 | -- | -- | --- | ----- | ---- | ----- |--|
 214 | NN | RT | RA  | imm   | mask | 111   |1 |
 215
 216     for i in range(8):
 217         idx = RA.x[i] << 2 | RA.y[i] << 1 | RA.z[i]
 218         res = (imm & (1<<idx)) != 0
 219         for j in range(3):
 220              if mask[j]: RT[i+j*8] = res
 221
 222 ## ternlogcr
 223
 224 another mode selection would be CRs not Ints.
 225
 226 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31|
 227 | -- | -- | --- | --- |- |-----|----- | -----|--|
 228 | NN | BA | BB  | BC  |0 |imm  | mask | 111  |0 |
 229
 230     for i in range(4):
 231         if not mask[i] continue
 232         idx = crregs[BA][i] << 2 |
 233               crregs[BB][i] << 1 |
 234               crregs[BC][i]
 235         crregs[BA][i] = (imm & (1<<idx)) != 0
 236
 237 ## cmix
 238
 239 based on RV bitmanip, covered by ternlog bitops
 240
 241 ```
 242 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 243     return (RA & RB) | (RC & ~RB);
 244 }
 245 ```
 246
 247
 248 # bitmask set
 249
 250 based on RV bitmanip singlebit set, instruction format similar to shift
 251 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 252 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 253
 254 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 255 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 256
 257 bmset (register for mask amount) is particularly useful for creating
 258 predicate masks where the length is a dynamic runtime quantity.
 259 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 260
 261 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 262 | -- | -- | --- | --- | --- | ------- |--| ----- |
 263 | NN | RT | RA  | RB  | RC  | mode 010 |Rc| bm*   |
 264
 265
 266 ```
 267 uint_xlen_t bmset(RA, RB, sh)
 268 {
 269     int shamt = RB & (XLEN - 1);
 270     mask = (2<<sh)-1;
 271     return RA | (mask << shamt);
 272 }
 273
 274 uint_xlen_t bmclr(RA, RB, sh)
 275 {
 276     int shamt = RB & (XLEN - 1);
 277     mask = (2<<sh)-1;
 278     return RA & ~(mask << shamt);
 279 }
 280
 281 uint_xlen_t bminv(RA, RB, sh)
 282 {
 283     int shamt = RB & (XLEN - 1);
 284     mask = (2<<sh)-1;
 285     return RA ^ (mask << shamt);
 286 }
 287
 288 uint_xlen_t bmext(RA, RB, sh)
 289 {
 290     int shamt = RB & (XLEN - 1);
 291     mask = (2<<sh)-1;
 292     return mask & (RA >> shamt);
 293 }
 294 ```
 295
 296 bitmask extract with reverse.  can be done by bitinverting all of RA and getting bits of RA from the opposite end.
 297
 298 ```
 299 msb = rb[5:0];
 300 rev[0:msb] = ra[msb:0];
 301 rt = ZE(rev[msb:0]);
 302
 303 uint_xlen_t bmextrev(RA, RB, sh)
 304 {
 305     int shamt = (RB & (XLEN - 1));
 306     shamt = (XLEN-1)-shamt;  # shift other end
 307     bra = bitreverse(RA)     # swap LSB-MSB
 308     mask = (2<<sh)-1;
 309     return mask & (bra >> shamt);
 310 }
 311 ```
 312
 313 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 314 | -- | -- | --- | --- | --- | ------- |--| ------ |
 315 | NN | RT | RA  | RB  | sh  | 1   011 |Rc| bmrevi |
 316
 317
 318 # grevlut
 319
 320 generalised reverse combined with a pair of LUT2s and allowing
 321 zero when RA=0 provides a wide range of instructions
 322 and a means to set regular 64 bit patterns in one
 323 32 bit instruction.
 324
 325 the two LUT2s are applied left-half (when not swapping)
 326 and right-half (when swapping) so as to allow a wider
 327 range of options
 328
 329 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
 330
 331 ```
 332 lut2(imm, a, b):
 333     idx = b << 1 | a
 334     return imm[idx] # idx by LSB0 order
 335
 336 dorow(imm8, step_i, chunksize):
 337     for j in 0 to 63:
 338         if (j&chunk_size) == 0
 339            imm = imm8[0..3]
 340         else
 341            imm = imm8[4..7]
 342         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 343     return step_o
 344
 345 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 imm)
 346 {
 347     uint64_t x = RA;
 348     int shamt = RB & 63;
 349     for i in 0 to 6
 350         step = 1<<i
 351         if (shamt & step) x = dorow(imm, x, step)
 352     return x;
 353 }
 354
 355 ```
 356
 357 # grev
 358
 359 based on RV bitmanip, this is also known as a butterfly network. however
 360 where a butterfly network allows setting of every crossbar setting in
 361 every row and every column, generalised-reverse (grev) only allows
 362 a per-row decision: every entry in the same row must either switch or
 363 not-switch.
 364
 365 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 366
 367 ```
 368 uint64_t grev64(uint64_t RA, uint64_t RB)
 369 {
 370     uint64_t x = RA;
 371     int shamt = RB & 63;
 372     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 373                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 374     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 375                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 376     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 377                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 378     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 379                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 380     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 381                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 382     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 383                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 384     return x;
 385 }
 386
 387 ```
 388
 389 # shuffle / unshuffle
 390
 391 based on RV bitmanip
 392
 393 ```
 394 uint32_t shfl32(uint32_t RA, uint32_t RB)
 395 {
 396     uint32_t x = RA;
 397     int shamt = RB & 15;
 398     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 399     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 400     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 401     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 402     return x;
 403 }
 404 uint32_t unshfl32(uint32_t RA, uint32_t RB)
 405 {
 406     uint32_t x = RA;
 407     int shamt = RB & 15;
 408     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 409     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 410     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 411     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 412     return x;
 413 }
 414
 415 uint64_t shuffle64_stage(uint64_t src, uint64_t maskL, uint64_t maskR, int N)
 416 {
 417     uint64_t x = src & ~(maskL | maskR);
 418     x |= ((src << N) & maskL) | ((src >> N) & maskR);
 419     return x;
 420 }
 421 uint64_t shfl64(uint64_t RA, uint64_t RB)
 422 {
 423     uint64_t x = RA;
 424     int shamt = RB & 31;
 425     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 426                                            0x00000000ffff0000LL, 16);
 427     if (shamt & 8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 428                                            0x0000ff000000ff00LL, 8);
 429     if (shamt & 4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 430                                            0x00f000f000f000f0LL, 4);
 431     if (shamt & 2) x = shuffle64_stage(x, 0x3030303030303030LL,
 432                                            0x0c0c0c0c0c0c0c0cLL, 2);
 433     if (shamt & 1) x = shuffle64_stage(x, 0x4444444444444444LL,
 434                                            0x2222222222222222LL, 1);
 435     return x;
 436 }
 437 uint64_t unshfl64(uint64_t RA, uint64_t RB)
 438 {
 439     uint64_t x = RA;
 440     int shamt = RB & 31;
 441     if (shamt &  1) x = shuffle64_stage(x, 0x4444444444444444LL,
 442                                            0x2222222222222222LL, 1);
 443     if (shamt &  2) x = shuffle64_stage(x, 0x3030303030303030LL,
 444                                            0x0c0c0c0c0c0c0c0cLL, 2);
 445     if (shamt &  4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 446                                            0x00f000f000f000f0LL, 4);
 447     if (shamt &  8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 448                                            0x0000ff000000ff00LL, 8);
 449     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 450                                            0x00000000ffff0000LL, 16);
 451     return x;
 452 }
 453 ```
 454
 455 # xperm
 456
 457 based on RV bitmanip
 458
 459 ```
 460 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 461 {
 462     uint_xlen_t r = 0;
 463     uint_xlen_t sz = 1LL << sz_log2;
 464     uint_xlen_t mask = (1LL << sz) - 1;
 465     for (int i = 0; i < XLEN; i += sz) {
 466         uint_xlen_t pos = ((RB >> i) & mask) << sz_log2;
 467         if (pos < XLEN)
 468             r |= ((RA >> pos) & mask) << i;
 469     }
 470     return r;
 471 }
 472 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 473 {  return xperm(RA, RB, 2); }
 474 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 475 {  return xperm(RA, RB, 3); }
 476 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 477 {  return xperm(RA, RB, 4); }
 478 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 479 {  return xperm(RA, RB, 5); }
 480 ```
 481
 482 # gorc
 483
 484 based on RV bitmanip
 485
 486 ```
 487 uint32_t gorc32(uint32_t RA, uint32_t RB)
 488 {
 489     uint32_t x = RA;
 490     int shamt = RB & 31;
 491     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 492     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 493     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 494     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 495     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 496     return x;
 497 }
 498 uint64_t gorc64(uint64_t RA, uint64_t RB)
 499 {
 500     uint64_t x = RA;
 501     int shamt = RB & 63;
 502     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 503                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 504     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 505                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 506     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 507                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 508     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 509                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 510     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 511                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 512     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 513                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 514     return x;
 515 }
 516
 517 ```
 518
 519 # Galois Field 2^M
 520
 521 see:
 522
 523 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 524 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 525 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 526
 527 ## SPRs to set modulo and degree
 528
 529 to save registers and make operations orthogonal with standard
 530 arithmetic the modulo is to be set in an SPR
 531
 532 ## Twin Butterfly (Tukey-Cooley) Mul-add-sub
 533
 534 used in combination with SV FFT REMAP to perform
 535 a full NTT in-place
 536
 537     gffmadd  RT,RA,RC,RB (Rc=0)
 538     gffmadd. RT,RA,RC,RB (Rc=1)
 539
 540 Pseudo-code:
 541
 542     RT <- GFMULADD(RA, RC, RB)
 543     RS <- GFMULADD(RA, RC, RB)
 544
 545
 546 ## Multiply
 547
 548 with the modulo and degree being in an SPR, multiply can be identical
 549 equivalent to standard integer add
 550
 551     RS = GFMUL(RA, RB)
 552
 553 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31|
 554 | -- | -- | --- | --- | --- | ------ |--|
 555 | NN | RT | RA  | RB  |11000|  01110 |Rc|
 556
 557
 558
 559 ```
 560 from functools import reduce
 561
 562 def gf_degree(a) :
 563   res = 0
 564   a >>= 1
 565   while (a != 0) :
 566     a >>= 1;
 567     res += 1;
 568   return res
 569
 570 # constants used in the multGF2 function
 571 mask1 = mask2 = polyred = None
 572
 573 def setGF2(irPoly):
 574     """Define parameters of binary finite field GF(2^m)/g(x)
 575        - irPoly: coefficients of irreducible polynomial g(x)
 576     """
 577     # degree: extension degree of binary field
 578     degree = gf_degree(irPoly)
 579
 580     def i2P(sInt):
 581         """Convert an integer into a polynomial"""
 582         return [(sInt >> i) & 1
 583                 for i in reversed(range(sInt.bit_length()))]
 584
 585     global mask1, mask2, polyred
 586     mask1 = mask2 = 1 << degree
 587     mask2 -= 1
 588     polyred = reduce(lambda x, y: (x << 1) + y, i2P(irPoly)[1:])
 589
 590 def multGF2(p1, p2):
 591     """Multiply two polynomials in GF(2^m)/g(x)"""
 592     p = 0
 593     while p2:
 594         # standard long-multiplication: check LSB and add
 595         if p2 & 1:
 596             p ^= p1
 597         p1 <<= 1
 598         # standard modulo: check MSB and add polynomial
 599         if p1 & mask1:
 600             p1 ^= polyred
 601         p2 >>= 1
 602     return p & mask2
 603
 604 if __name__ == "__main__":
 605
 606     # Define binary field GF(2^3)/x^3 + x + 1
 607     setGF2(0b1011) # degree 3
 608
 609     # Evaluate the product (x^2 + x + 1)(x^2 + 1)
 610     print("{:02x}".format(multGF2(0b111, 0b101)))
 611
 612     # Define binary field GF(2^8)/x^8 + x^4 + x^3 + x + 1
 613     # (used in the Advanced Encryption Standard-AES)
 614     setGF2(0b100011011) # degree 8
 615
 616     # Evaluate the product (x^7)(x^7 + x + 1)
 617     print("{:02x}".format(multGF2(0b10000000, 0b10000011)))
 618 ```
 619 ## GF div and mod
 620
 621 ```
 622 def gf_degree(a) :
 623   res = 0
 624   a >>= 1
 625   while (a != 0) :
 626     a >>= 1;
 627     res += 1;
 628   return res
 629
 630 def FullDivision(self, f, v):
 631         """
 632         Takes two arguments, f, v
 633         fDegree and vDegree are the degrees of the field elements
 634         f and v represented as a polynomials.
 635         This method returns the field elements a and b such that
 636
 637             f(x) = a(x) * v(x) + b(x).
 638
 639         That is, a is the divisor and b is the remainder, or in
 640         other words a is like floor(f/v) and b is like f modulo v.
 641         """
 642
 643         fDegree, vDegree = gf_degree(f), gf_degree(v)
 644         res, rem = 0, f
 645         i = fDegree
 646         mask = 1 << i
 647         while (i >= vDegree):
 648             if (mask & rem): # check MSB
 649                 res ^= (1 << (i - vDegree))
 650                 rem ^= ( v << (i - vDegree)))
 651             i -= 1
 652             mask >>= 1
 653         return (res, rem)
 654 ```
 655
 656 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 657 | -- | -- | --- | --- | --- | ------- |--| ----- |
 658 | NN | RS | RA  | deg | RC  | 0 1  011 |Rc| gfaddi |
 659 | NN | RS | RA  | RB  | RC  | 1 1  111 |Rc| gfadd |
 660
 661 GFMOD is a pseudo-op where RA=0
 662
 663 ## carryless mul
 664
 665 based on RV bitmanip
 666 see https://en.wikipedia.org/wiki/CLMUL_instruction_set
 667
 668 these are GF2 operations with the modulo set to 2^degree.
 669 they are worth adding as their own non-overwrite operations
 670 (in the same pipeline).
 671
 672 ```
 673 uint_xlen_t clmul(uint_xlen_t RA, uint_xlen_t RB)
 674 {
 675     uint_xlen_t x = 0;
 676     for (int i = 0; i < XLEN; i++)
 677         if ((RB >> i) & 1)
 678             x ^= RA << i;
 679     return x;
 680 }
 681 uint_xlen_t clmulh(uint_xlen_t RA, uint_xlen_t RB)
 682 {
 683     uint_xlen_t x = 0;
 684     for (int i = 1; i < XLEN; i++)
 685         if ((RB >> i) & 1)
 686             x ^= RA >> (XLEN-i);
 687     return x;
 688 }
 689 uint_xlen_t clmulr(uint_xlen_t RA, uint_xlen_t RB)
 690 {
 691     uint_xlen_t x = 0;
 692     for (int i = 0; i < XLEN; i++)
 693         if ((RB >> i) & 1)
 694             x ^= RA >> (XLEN-i-1);
 695     return x;
 696 }
 697 ```
 698
 699 # bitmatrix
 700
 701 ```
 702 uint64_t bmatflip(uint64_t RA)
 703 {
 704     uint64_t x = RA;
 705     x = shfl64(x, 31);
 706     x = shfl64(x, 31);
 707     x = shfl64(x, 31);
 708     return x;
 709 }
 710 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 711 {
 712     // transpose of RB
 713     uint64_t RBt = bmatflip(RB);
 714     uint8_t u[8]; // rows of RA
 715     uint8_t v[8]; // cols of RB
 716     for (int i = 0; i < 8; i++) {
 717         u[i] = RA >> (i*8);
 718         v[i] = RBt >> (i*8);
 719     }
 720     uint64_t x = 0;
 721     for (int i = 0; i < 64; i++) {
 722         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 723             x |= 1LL << i;
 724     }
 725     return x;
 726 }
 727 uint64_t bmator(uint64_t RA, uint64_t RB)
 728 {
 729     // transpose of RB
 730     uint64_t RBt = bmatflip(RB);
 731     uint8_t u[8]; // rows of RA
 732     uint8_t v[8]; // cols of RB
 733     for (int i = 0; i < 8; i++) {
 734         u[i] = RA >> (i*8);
 735         v[i] = RBt >> (i*8);
 736     }
 737     uint64_t x = 0;
 738     for (int i = 0; i < 64; i++) {
 739         if ((u[i / 8] & v[i % 8]) != 0)
 740             x |= 1LL << i;
 741     }
 742     return x;
 743 }
 744
 745 ```
 746
 747 # Already in POWER ISA
 748
 749 ## count leading/trailing zeros with mask
 750
 751 in v3.1 p105
 752
 753 ```
 754 count = 0
 755 do i = 0 to 63 if((RB)i=1) then do
 756 if((RS)i=1) then break end end count ← count + 1
 757 RA ← EXTZ64(count)
 758 ```
 759
 760 ##  bit deposit
 761
 762 vpdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
 763
 764     do while(m < 64)
 765        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
 766           result = VSR[VRA+32].dword[i].bit[63-k]
 767           VSR[VRT+32].dword[i].bit[63-m] = result
 768           k = k + 1
 769        m = m + 1
 770
 771 ```
 772
 773 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
 774 {
 775     uint_xlen_t r = 0;
 776     for (int i = 0, j = 0; i < XLEN; i++)
 777         if ((RB >> i) & 1) {
 778             if ((RA >> j) & 1)
 779                 r |= uint_xlen_t(1) << i;
 780             j++;
 781         }
 782     return r;
 783 }
 784
 785 ```
 786
 787 # bit extract
 788
 789 other way round: identical to RV bext, found in v3.1 p196
 790
 791 ```
 792 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
 793 {
 794     uint_xlen_t r = 0;
 795     for (int i = 0, j = 0; i < XLEN; i++)
 796         if ((RB >> i) & 1) {
 797             if ((RA >> i) & 1)
 798                 r |= uint_xlen_t(1) << j;
 799             j++;
 800         }
 801     return r;
 802 }
 803 ```
 804
 805 # centrifuge
 806
 807 found in v3.1 p106 so not to be added here
 808
 809 ```
 810 ptr0 = 0
 811 ptr1 = 0
 812 do i = 0 to 63
 813     if((RB)i=0) then do
 814        resultptr0 = (RS)i
 815     end
 816     ptr0 = ptr0 + 1
 817     if((RB)63-i==1) then do
 818         result63-ptr1 = (RS)63-i
 819     end
 820     ptr1 = ptr1 + 1
 821 RA = result
 822 ```
 823