openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 # Implementation Log
   4
   5 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   6 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   7 * remove Rc=1 from ternlog due to conflicts in encoding as well
   8   as saving space <https://bugs.libre-soc.org/show_bug.cgi?id=753#c5>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11 # bitmanipulation
  12
  13 **DRAFT STATUS**
  14
  15 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  16 Vectorisation Context is provided by [[openpower/sv]].
  17
  18 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  19
  20 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  21
  22 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  23
  24 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  25 the [[sv/av_opcodes]] as well as [[sv/setvl]]
  26
  27 Useful resource:
  28
  29 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  30 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  31
  32 # summary
  33
  34 minor opcode allocation
  35
  36     |  28.30 |31| name      |
  37     | ------ |--| --------- |
  38     |  -00   |0 | ternlogi  |
  39     |  -00   |1 | grevlog   |
  40     |  -01   |  | grevlogi  |
  41     |  010   |Rc| bitmask   |
  42     |  011   |Rc| gfmadd*   |
  43     |  110   |Rc| 1/2-op    |
  44     |  111   |1 | ternlogv  |
  45     |  111   |0 | ternlogcr |
  46
  47 1-op and variants
  48
  49 | dest | src1 | subop | op       |
  50 | ---- | ---- | ----- | -------- |
  51 | RT   | RA   | ..    | bmatflip |
  52
  53 2-op and variants
  54
  55 | dest | src1 | src2 | subop | op       |
  56 | ---- | ---- | ---- | ----- | -------- |
  57 | RT   | RA   | RB   | or    | bmatflip |
  58 | RT   | RA   | RB   | xor   | bmatflip |
  59 | RT   | RA   | RB   |       | grev  |
  60 | RT   | RA   | RB   |       | clmul*  |
  61 | RT   | RA   | RB   |       | gorc |
  62 | RT   | RA   | RB   | shuf  | shuffle |
  63 | RT   | RA   | RB   | unshuf| shuffle |
  64 | RT   | RA   | RB   | width | xperm  |
  65 | RT   | RA   | RB   | type | minmax |
  66 | RT   | RA   | RB   |      | av abs avgadd  |
  67 | RT   | RA   | RB   | type | vmask ops |
  68 | RT   | RA   | RB   |      |       |
  69
  70 3 ops
  71
  72 * grevlog
  73 * ternlog bitops
  74 * GF mul-add
  75 * bitmask-reverse
  76
  77 TODO: convert all instructions to use RT and not RS
  78
  79 | 0.5|6.10|11.15|16.20 |21..25   | 26....30  |31| name |
  80 | -- | -- | --- | ---  | -----   | --------  |--| ------ |
  81 | NN | RT | RA  | RB   | im0-4   | im5-7  00 |0 | ternlogi |
  82 | NN | RT | RA  | RB   | im0-4   | im5-7  00 |1 | grevlog |
  83 | NN | RT | RA  | s0-4 | im0-4   | im5-7  01 |s5| grevlogi |
  84 | NN | RS | RA  | RB   | RC      | 00    011 |Rc| gfmadd |
  85 | NN | RS | RA  | RB   | RC      | 10    011 |Rc| gfmaddsub |
  86 | NN | RT | RA  | RB   | sh0-4   | sh5 1 011 |Rc| bmrevi |
  87
  88 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31| name |
  89 | -- | -- | --- | ----- | ---- | ----- |--| ------ |
  90 | NN | RT | RA  | imm   | mask | 111   |1 | ternlogv |
  91
  92 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31| name |
  93 | -- | -- | --- | --- |- |-----|----- | -----|--| -------|
  94 | NN | BA | BB  | BC  |0 |imm  | mask | 111  |0 | ternlogcr |
  95
  96 ops (note that av avg and abs as well as vec scalar mask
  97 are included here)
  98
  99 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 100 double check that instructions didn't need 3 inputs.
 101
 102 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 103 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 104 | NN | RA | RB  |     | 0  |       | 0000 110 |Rc| rsvd   |
 105 | NN | RA | RB  | RC  | 1  | itype | 0000 110 |Rc| xperm |
 106 | NN | RA | RB  | RC  | 0  | itype | 0100 110 |Rc| minmax |
 107 | NN | RA | RB  | RC  | 1  |   00  | 0100 110 |Rc| av avgadd |
 108 | NN | RA | RB  | RC  | 1  |   01  | 0100 110 |Rc| av abs |
 109 | NN | RA | RB  |     | 1  |   10  | 0100 110 |Rc| rsvd |
 110 | NN | RA | RB  |     | 1  |   11  | 0100 110 |Rc| rsvd |
 111 | NN | RA | RB  | sh  | SH | itype | 1000 110 |Rc| bmopsi |
 112 | NN | RA | RB  |     |    |       | 1100 110 |Rc| rsvd |
 113 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| gfdiv |
 114 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| gfmod |
 115 | NN | RT | RA  | RB  | 1  |  10   | 0001 110 |Rc| gfmul |
 116 | NN | RA | RB  |     | 1  |  11   | 0001 110 |Rc| rsvd |
 117 | NN | RA | RB  | RC  | 0  |   00  | 0001 110 |Rc| vec sbfm |
 118 | NN | RA | RB  | RC  | 0  |   01  | 0001 110 |Rc| vec sofm |
 119 | NN | RA | RB  | RC  | 0  |   10  | 0001 110 |Rc| vec sifm |
 120 | NN | RA | RB  | RC  | 0  |   11  | 0001 110 |Rc| vec cprop |
 121 | NN | RA | RB  |     | 0  |       | 0101 110 |Rc| rsvd |
 122 | NN | RA | RB  | RC  | 0  | 00    | 0010 110 |Rc| gorc |
 123 | NN | RA | RB  | sh  | SH | 00    | 1010 110 |Rc| gorci |
 124 | NN | RA | RB  | RC  | 0  | 00    | 0110 110 |Rc| gorcw |
 125 | NN | RA | RB  | sh  | 0  | 00    | 1110 110 |Rc| gorcwi |
 126 | NN | RA | RB  | RC  | 1  | 00    | 1110 110 |Rc| bmator  |
 127 | NN | RA | RB  | RC  | 0  | 01    | 0010 110 |Rc| grev |
 128 | NN | RA | RB  | RC  | 1  | 01    | 0010 110 |Rc| clmul |
 129 | NN | RA | RB  | sh  | SH | 01    | 1010 110 |Rc| grevi |
 130 | NN | RA | RB  | RC  | 0  | 01    | 0110 110 |Rc| grevw |
 131 | NN | RA | RB  | sh  | 0  | 01    | 1110 110 |Rc| grevwi |
 132 | NN | RA | RB  | RC  | 1  | 01    | 1110 110 |Rc| bmatxor   |
 133 | NN | RA | RB  | RC  | 0  | 10    | 0010 110 |Rc| shfl |
 134 | NN | RA | RB  | sh  | SH | 10    | 1010 110 |Rc| shfli |
 135 | NN | RA | RB  | RC  | 0  | 10    | 0110 110 |Rc| shflw |
 136 | NN | RA | RB  | RC  |    | 10    | 1110 110 |Rc| rsvd    |
 137 | NN | RA | RB  | RC  | 0  | 11    | 1110 110 |Rc| clmulr  |
 138 | NN | RA | RB  | RC  | 1  | 11    | 1110 110 |Rc| clmulh  |
 139 | NN |    |     |     |    |       | --11 110 |Rc| setvl  |
 140
 141 # bit to byte permute
 142
 143 similar to matrix permute in RV bitmanip, which has XOR and OR variants
 144
 145     do j = 0 to 7
 146       do k = 0 to 7
 147          b = VSR[VRB+32].dword[i].byte[k].bit[j]
 148          VSR[VRT+32].dword[i].byte[j].bit[k] = b
 149
 150 # int min/max
 151
 152 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 153
 154 signed/unsigned min/max gives more flexibility.
 155
 156 ```
 157 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 158 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 159 }
 160 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 161 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 162 }
 163 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 164 { return rs1 < rs2 ? rs1 : rs2;
 165 }
 166 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 167 { return rs1 > rs2 ? rs1 : rs2;
 168 }
 169 ```
 170
 171
 172 # ternlog bitops
 173
 174 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 175
 176 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 177
 178 ## ternlogi
 179
 180 | 0.5|6.10|11.15|16.20| 21..25| 26..30   |31|
 181 | -- | -- | --- | --- | ----- | -------- |--|
 182 | NN | RT | RA  | RB  | im0-4 | im5-7 00 |0 |
 183
 184     lut3(imm, a, b, c):
 185         idx = c << 2 | b << 1 | a
 186         return imm[idx] # idx by LSB0 order
 187
 188     for i in range(64):
 189         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 190
 191 bits 21..22 may be used to specify a mode, such as treating the whole integer zero/nonzero and putting 1/0 in the result, rather than bitwise test.
 192
 193 ## ternlog
 194
 195 a 4 operand variant which becomes more along the lines of an FPGA:
 196
 197 | 0.5|6.10|11.15|16.20|21.25| 26...30  |31|
 198 | -- | -- | --- | --- | --- | -------- |--|
 199 | NN | RT | RA  | RB  | RC  | mode 100 |1 |
 200
 201     for i in range(64):
 202         idx = RT[i] << 2 | RA[i] << 1 | RB[i]
 203         RT[i] = (RC & (1<<idx)) != 0
 204
 205 mode (2 bit) may be used to do inversion of ordering, similar to carryless mul,
 206 3 modes.
 207
 208 ## ternlogv
 209
 210 also, another possible variant involving swizzle and vec4:
 211
 212 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31|
 213 | -- | -- | --- | ----- | ---- | ----- |--|
 214 | NN | RT | RA  | imm   | mask | 101   |1 |
 215
 216     for i in range(8):
 217         idx = RA.x[i] << 2 | RA.y[i] << 1 | RA.z[i]
 218         res = (imm & (1<<idx)) != 0
 219         for j in range(3):
 220              if mask[j]: RT[i+j*8] = res
 221
 222 ## ternlogcr
 223
 224 another mode selection would be CRs not Ints.
 225
 226 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31|
 227 | -- | -- | --- | --- |- |-----|----- | -----|--|
 228 | NN | BA | BB  | BC  |0 |imm  | mask | 101  |0 |
 229
 230     for i in range(4):
 231         if not mask[i] continue
 232         idx = crregs[BA][i] << 2 |
 233               crregs[BB][i] << 1 |
 234               crregs[BC][i]
 235         crregs[BA][i] = (imm & (1<<idx)) != 0
 236
 237 ## cmix
 238
 239 based on RV bitmanip, covered by ternlog bitops
 240
 241 ```
 242 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 243     return (RA & RB) | (RC & ~RB);
 244 }
 245 ```
 246
 247
 248 # bitmask set
 249
 250 based on RV bitmanip singlebit set, instruction format similar to shift
 251 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 252 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 253
 254 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 255 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 256
 257 bmset (register for mask amount) is particularly useful for creating
 258 predicate masks where the length is a dynamic runtime quantity.
 259 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 260
 261 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 262 | -- | -- | --- | --- | --- | ------- |--| ----- |
 263 | NN | RT | RA  | RB  | RC  | mode 010 |Rc| bm*   |
 264
 265
 266 ```
 267 uint_xlen_t bmset(RA, RB, sh)
 268 {
 269     int shamt = RB & (XLEN - 1);
 270     mask = (2<<sh)-1;
 271     return RA | (mask << shamt);
 272 }
 273
 274 uint_xlen_t bmclr(RA, RB, sh)
 275 {
 276     int shamt = RB & (XLEN - 1);
 277     mask = (2<<sh)-1;
 278     return RA & ~(mask << shamt);
 279 }
 280
 281 uint_xlen_t bminv(RA, RB, sh)
 282 {
 283     int shamt = RB & (XLEN - 1);
 284     mask = (2<<sh)-1;
 285     return RA ^ (mask << shamt);
 286 }
 287
 288 uint_xlen_t bmext(RA, RB, sh)
 289 {
 290     int shamt = RB & (XLEN - 1);
 291     mask = (2<<sh)-1;
 292     return mask & (RA >> shamt);
 293 }
 294 ```
 295
 296 bitmask extract with reverse.  can be done by bitinverting all of RA and getting bits of RA from the opposite end.
 297
 298 ```
 299 msb = rb[5:0];
 300 rev[0:msb] = ra[msb:0];
 301 rt = ZE(rev[msb:0]);
 302
 303 uint_xlen_t bmextrev(RA, RB, sh)
 304 {
 305     int shamt = (RB & (XLEN - 1));
 306     shamt = (XLEN-1)-shamt;  # shift other end
 307     bra = bitreverse(RA)     # swap LSB-MSB
 308     mask = (2<<sh)-1;
 309     return mask & (bra >> shamt);
 310 }
 311 ```
 312
 313 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 314 | -- | -- | --- | --- | --- | ------- |--| ------ |
 315 | NN | RT | RA  | RB  | sh  | 1   011 |Rc| bmrevi |
 316
 317
 318 # grevlut
 319
 320 generalised reverse combined with a LUT2 and allowing
 321 zero when RA=0 provides a wide range of instructions
 322 and a means to set regular 64 bit patterns in one
 323 32 bit instruction.
 324
 325 ```
 326 lut2(imm, a, b):
 327     idx = b << 1 | a
 328     return imm[idx] # idx by LSB0 order
 329
 330 dorow(imm, step_i, chunksize):
 331     for j in 0 to 63:
 332         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 333     return step_o
 334
 335 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 lut2)
 336 {
 337     uint64_t x = RA;
 338     int shamt = RB & 63;
 339     int imm = lut2 & 0b1111;
 340     for i in 0 to 6
 341         step = 1<<i
 342         if (shamt & step) x = dorow(imm, x, step)
 343     return x;
 344 }
 345
 346 ```
 347
 348 # grev
 349
 350 based on RV bitmanip, this is also known as a butterfly network. however
 351 where a butterfly network allows setting of every crossbar setting in
 352 every row and every column, generalised-reverse (grev) only allows
 353 a per-row decision: every entry in the same row must either switch or
 354 not-switch.
 355
 356 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 357
 358 ```
 359 uint64_t grev64(uint64_t RA, uint64_t RB)
 360 {
 361     uint64_t x = RA;
 362     int shamt = RB & 63;
 363     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 364                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 365     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 366                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 367     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 368                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 369     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 370                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 371     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 372                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 373     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 374                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 375     return x;
 376 }
 377
 378 ```
 379
 380 # shuffle / unshuffle
 381
 382 based on RV bitmanip
 383
 384 ```
 385 uint32_t shfl32(uint32_t RA, uint32_t RB)
 386 {
 387     uint32_t x = RA;
 388     int shamt = RB & 15;
 389     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 390     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 391     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 392     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 393     return x;
 394 }
 395 uint32_t unshfl32(uint32_t RA, uint32_t RB)
 396 {
 397     uint32_t x = RA;
 398     int shamt = RB & 15;
 399     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 400     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 401     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 402     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 403     return x;
 404 }
 405
 406 uint64_t shuffle64_stage(uint64_t src, uint64_t maskL, uint64_t maskR, int N)
 407 {
 408     uint64_t x = src & ~(maskL | maskR);
 409     x |= ((src << N) & maskL) | ((src >> N) & maskR);
 410     return x;
 411 }
 412 uint64_t shfl64(uint64_t RA, uint64_t RB)
 413 {
 414     uint64_t x = RA;
 415     int shamt = RB & 31;
 416     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 417                                            0x00000000ffff0000LL, 16);
 418     if (shamt & 8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 419                                            0x0000ff000000ff00LL, 8);
 420     if (shamt & 4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 421                                            0x00f000f000f000f0LL, 4);
 422     if (shamt & 2) x = shuffle64_stage(x, 0x3030303030303030LL,
 423                                            0x0c0c0c0c0c0c0c0cLL, 2);
 424     if (shamt & 1) x = shuffle64_stage(x, 0x4444444444444444LL,
 425                                            0x2222222222222222LL, 1);
 426     return x;
 427 }
 428 uint64_t unshfl64(uint64_t RA, uint64_t RB)
 429 {
 430     uint64_t x = RA;
 431     int shamt = RB & 31;
 432     if (shamt &  1) x = shuffle64_stage(x, 0x4444444444444444LL,
 433                                            0x2222222222222222LL, 1);
 434     if (shamt &  2) x = shuffle64_stage(x, 0x3030303030303030LL,
 435                                            0x0c0c0c0c0c0c0c0cLL, 2);
 436     if (shamt &  4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 437                                            0x00f000f000f000f0LL, 4);
 438     if (shamt &  8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 439                                            0x0000ff000000ff00LL, 8);
 440     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 441                                            0x00000000ffff0000LL, 16);
 442     return x;
 443 }
 444 ```
 445
 446 # xperm
 447
 448 based on RV bitmanip
 449
 450 ```
 451 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 452 {
 453     uint_xlen_t r = 0;
 454     uint_xlen_t sz = 1LL << sz_log2;
 455     uint_xlen_t mask = (1LL << sz) - 1;
 456     for (int i = 0; i < XLEN; i += sz) {
 457         uint_xlen_t pos = ((RB >> i) & mask) << sz_log2;
 458         if (pos < XLEN)
 459             r |= ((RA >> pos) & mask) << i;
 460     }
 461     return r;
 462 }
 463 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 464 {  return xperm(RA, RB, 2); }
 465 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 466 {  return xperm(RA, RB, 3); }
 467 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 468 {  return xperm(RA, RB, 4); }
 469 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 470 {  return xperm(RA, RB, 5); }
 471 ```
 472
 473 # gorc
 474
 475 based on RV bitmanip
 476
 477 ```
 478 uint32_t gorc32(uint32_t RA, uint32_t RB)
 479 {
 480     uint32_t x = RA;
 481     int shamt = RB & 31;
 482     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 483     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 484     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 485     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 486     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 487     return x;
 488 }
 489 uint64_t gorc64(uint64_t RA, uint64_t RB)
 490 {
 491     uint64_t x = RA;
 492     int shamt = RB & 63;
 493     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 494                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 495     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 496                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 497     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 498                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 499     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 500                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 501     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 502                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 503     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 504                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 505     return x;
 506 }
 507
 508 ```
 509
 510 # Galois Field 2^M
 511
 512 see:
 513
 514 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 515 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 516 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 517
 518 ## SPRs to set modulo and degree
 519
 520 to save registers and make operations orthogonal with standard
 521 arithmetic the modulo is to be set in an SPR
 522
 523 ## Twin Butterfly (Tukey-Cooley) Mul-add-sub
 524
 525 used in combination with SV FFT REMAP to perform
 526 a full NTT in-place
 527
 528     gffmadd  RT,RA,RC,RB (Rc=0)
 529     gffmadd. RT,RA,RC,RB (Rc=1)
 530
 531 Pseudo-code:
 532
 533     RT <- GFMULADD(RA, RC, RB)
 534     RS <- GFMULADD(RA, RC, RB)
 535
 536
 537 ## Multiply
 538
 539 with the modulo and degree being in an SPR, multiply can be identical
 540 equivalent to standard integer add
 541
 542     RS = GFMUL(RA, RB)
 543
 544 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31|
 545 | -- | -- | --- | --- | --- | ------ |--|
 546 | NN | RT | RA  | RB  |11000|  01110 |Rc|
 547
 548
 549
 550 ```
 551 from functools import reduce
 552
 553 def gf_degree(a) :
 554   res = 0
 555   a >>= 1
 556   while (a != 0) :
 557     a >>= 1;
 558     res += 1;
 559   return res
 560
 561 # constants used in the multGF2 function
 562 mask1 = mask2 = polyred = None
 563
 564 def setGF2(irPoly):
 565     """Define parameters of binary finite field GF(2^m)/g(x)
 566        - irPoly: coefficients of irreducible polynomial g(x)
 567     """
 568     # degree: extension degree of binary field
 569     degree = gf_degree(irPoly)
 570
 571     def i2P(sInt):
 572         """Convert an integer into a polynomial"""
 573         return [(sInt >> i) & 1
 574                 for i in reversed(range(sInt.bit_length()))]
 575
 576     global mask1, mask2, polyred
 577     mask1 = mask2 = 1 << degree
 578     mask2 -= 1
 579     polyred = reduce(lambda x, y: (x << 1) + y, i2P(irPoly)[1:])
 580
 581 def multGF2(p1, p2):
 582     """Multiply two polynomials in GF(2^m)/g(x)"""
 583     p = 0
 584     while p2:
 585         # standard long-multiplication: check LSB and add
 586         if p2 & 1:
 587             p ^= p1
 588         p1 <<= 1
 589         # standard modulo: check MSB and add polynomial
 590         if p1 & mask1:
 591             p1 ^= polyred
 592         p2 >>= 1
 593     return p & mask2
 594
 595 if __name__ == "__main__":
 596
 597     # Define binary field GF(2^3)/x^3 + x + 1
 598     setGF2(0b1011) # degree 3
 599
 600     # Evaluate the product (x^2 + x + 1)(x^2 + 1)
 601     print("{:02x}".format(multGF2(0b111, 0b101)))
 602
 603     # Define binary field GF(2^8)/x^8 + x^4 + x^3 + x + 1
 604     # (used in the Advanced Encryption Standard-AES)
 605     setGF2(0b100011011) # degree 8
 606
 607     # Evaluate the product (x^7)(x^7 + x + 1)
 608     print("{:02x}".format(multGF2(0b10000000, 0b10000011)))
 609 ```
 610 ## GF div and mod
 611
 612 ```
 613 def gf_degree(a) :
 614   res = 0
 615   a >>= 1
 616   while (a != 0) :
 617     a >>= 1;
 618     res += 1;
 619   return res
 620
 621 def FullDivision(self, f, v):
 622         """
 623         Takes two arguments, f, v
 624         fDegree and vDegree are the degrees of the field elements
 625         f and v represented as a polynomials.
 626         This method returns the field elements a and b such that
 627
 628             f(x) = a(x) * v(x) + b(x).
 629
 630         That is, a is the divisor and b is the remainder, or in
 631         other words a is like floor(f/v) and b is like f modulo v.
 632         """
 633
 634         fDegree, vDegree = gf_degree(f), gf_degree(v)
 635         res, rem = 0, f
 636         i = fDegree
 637         mask = 1 << i
 638         while (i >= vDegree):
 639             if (mask & rem): # check MSB
 640                 res ^= (1 << (i - vDegree))
 641                 rem ^= ( v << (i - vDegree)))
 642             i -= 1
 643             mask >>= 1
 644         return (res, rem)
 645 ```
 646
 647 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 648 | -- | -- | --- | --- | --- | ------- |--| ----- |
 649 | NN | RS | RA  | deg | RC  | 0 1  011 |Rc| gfaddi |
 650 | NN | RS | RA  | RB  | RC  | 1 1  111 |Rc| gfadd |
 651
 652 GFMOD is a pseudo-op where RA=0
 653
 654 ## carryless mul
 655
 656 based on RV bitmanip
 657 see https://en.wikipedia.org/wiki/CLMUL_instruction_set
 658
 659 these are GF2 operations with the modulo set to 2^degree.
 660 they are worth adding as their own non-overwrite operations
 661 (in the same pipeline).
 662
 663 ```
 664 uint_xlen_t clmul(uint_xlen_t RA, uint_xlen_t RB)
 665 {
 666     uint_xlen_t x = 0;
 667     for (int i = 0; i < XLEN; i++)
 668         if ((RB >> i) & 1)
 669             x ^= RA << i;
 670     return x;
 671 }
 672 uint_xlen_t clmulh(uint_xlen_t RA, uint_xlen_t RB)
 673 {
 674     uint_xlen_t x = 0;
 675     for (int i = 1; i < XLEN; i++)
 676         if ((RB >> i) & 1)
 677             x ^= RA >> (XLEN-i);
 678     return x;
 679 }
 680 uint_xlen_t clmulr(uint_xlen_t RA, uint_xlen_t RB)
 681 {
 682     uint_xlen_t x = 0;
 683     for (int i = 0; i < XLEN; i++)
 684         if ((RB >> i) & 1)
 685             x ^= RA >> (XLEN-i-1);
 686     return x;
 687 }
 688 ```
 689
 690 # bitmatrix
 691
 692 ```
 693 uint64_t bmatflip(uint64_t RA)
 694 {
 695     uint64_t x = RA;
 696     x = shfl64(x, 31);
 697     x = shfl64(x, 31);
 698     x = shfl64(x, 31);
 699     return x;
 700 }
 701 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 702 {
 703     // transpose of RB
 704     uint64_t RBt = bmatflip(RB);
 705     uint8_t u[8]; // rows of RA
 706     uint8_t v[8]; // cols of RB
 707     for (int i = 0; i < 8; i++) {
 708         u[i] = RA >> (i*8);
 709         v[i] = RBt >> (i*8);
 710     }
 711     uint64_t x = 0;
 712     for (int i = 0; i < 64; i++) {
 713         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 714             x |= 1LL << i;
 715     }
 716     return x;
 717 }
 718 uint64_t bmator(uint64_t RA, uint64_t RB)
 719 {
 720     // transpose of RB
 721     uint64_t RBt = bmatflip(RB);
 722     uint8_t u[8]; // rows of RA
 723     uint8_t v[8]; // cols of RB
 724     for (int i = 0; i < 8; i++) {
 725         u[i] = RA >> (i*8);
 726         v[i] = RBt >> (i*8);
 727     }
 728     uint64_t x = 0;
 729     for (int i = 0; i < 64; i++) {
 730         if ((u[i / 8] & v[i % 8]) != 0)
 731             x |= 1LL << i;
 732     }
 733     return x;
 734 }
 735
 736 ```
 737
 738 # Already in POWER ISA
 739
 740 ## count leading/trailing zeros with mask
 741
 742 in v3.1 p105
 743
 744 ```
 745 count = 0
 746 do i = 0 to 63 if((RB)i=1) then do
 747 if((RS)i=1) then break end end count ← count + 1
 748 RA ← EXTZ64(count)
 749 ```
 750
 751 ##  bit deposit
 752
 753 vpdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
 754
 755     do while(m < 64)
 756        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
 757           result = VSR[VRA+32].dword[i].bit[63-k]
 758           VSR[VRT+32].dword[i].bit[63-m] = result
 759           k = k + 1
 760        m = m + 1
 761
 762 ```
 763
 764 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
 765 {
 766     uint_xlen_t r = 0;
 767     for (int i = 0, j = 0; i < XLEN; i++)
 768         if ((RB >> i) & 1) {
 769             if ((RA >> j) & 1)
 770                 r |= uint_xlen_t(1) << i;
 771             j++;
 772         }
 773     return r;
 774 }
 775
 776 ```
 777
 778 # bit extract
 779
 780 other way round: identical to RV bext, found in v3.1 p196
 781
 782 ```
 783 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
 784 {
 785     uint_xlen_t r = 0;
 786     for (int i = 0, j = 0; i < XLEN; i++)
 787         if ((RB >> i) & 1) {
 788             if ((RA >> i) & 1)
 789                 r |= uint_xlen_t(1) << j;
 790             j++;
 791         }
 792     return r;
 793 }
 794 ```
 795
 796 # centrifuge
 797
 798 found in v3.1 p106 so not to be added here
 799
 800 ```
 801 ptr0 = 0
 802 ptr1 = 0
 803 do i = 0 to 63
 804     if((RB)i=0) then do
 805        resultptr0 = (RS)i
 806     end
 807     ptr0 = ptr0 + 1
 808     if((RB)63-i==1) then do
 809         result63-ptr1 = (RS)63-i
 810     end
 811     ptr1 = ptr1 + 1
 812 RA = result
 813 ```
 814