openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 # Implementation Log
   4
   5 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   6 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   7 * remove Rc=1 from ternlog due to conflicts in encoding as well
   8   as saving space <https://bugs.libre-soc.org/show_bug.cgi?id=753#c5>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11 # bitmanipulation
  12
  13 **DRAFT STATUS**
  14
  15 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  16 Vectorisation Context is provided by [[openpower/sv]].
  17
  18 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  19
  20 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  21
  22 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  23
  24 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  25 the [[sv/av_opcodes]] as well as [[sv/setvl]]
  26
  27 Useful resource:
  28
  29 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  30 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  31
  32 # summary
  33
  34 minor opcode allocation
  35
  36     |  28.30 |31| name      |
  37     | ------ |--| --------- |
  38     |   00   |0 | ternlogi  |
  39     |  000   |1 | ternlog   |
  40     |  100   |1 | grevlog   |
  41     |  010   |Rc| bitmask   |
  42     |  011   |Rc| gf*       |
  43     |  101   |1 | ternlogv  |
  44     |  101   |0 | ternlogcr |
  45     |  110   |Rc| 1/2-op    |
  46     |  111   |Rc| 3-op      |
  47
  48 1-op and variants
  49
  50 | dest | src1 | subop | op       |
  51 | ---- | ---- | ----- | -------- |
  52 | RT   | RA   | ..    | bmatflip |
  53
  54 2-op and variants
  55
  56 | dest | src1 | src2 | subop | op       |
  57 | ---- | ---- | ---- | ----- | -------- |
  58 | RT   | RA   | RB   | or    | bmatflip |
  59 | RT   | RA   | RB   | xor   | bmatflip |
  60 | RT   | RA   | RB   |       | grev  |
  61 | RT   | RA   | RB   |       | clmul*  |
  62 | RT   | RA   | RB   |       | gorc |
  63 | RT   | RA   | RB   | shuf  | shuffle |
  64 | RT   | RA   | RB   | unshuf| shuffle |
  65 | RT   | RA   | RB   | width | xperm  |
  66 | RT   | RA   | RB   | type | minmax |
  67 | RT   | RA   | RB   |      | av abs avgadd  |
  68 | RT   | RA   | RB   | type | vmask ops |
  69 | RT   | RA   | RB   |  |  |
  70
  71 3 ops
  72
  73 * bitmask set/extract
  74 * ternlog bitops
  75 * GF
  76
  77 TODO: convert all instructions to use RT and not RS
  78
  79 | 0.5|6.10|11.15|16.20 |21..25   | 26....30 |31| name |
  80 | -- | -- | --- | ---  | -----   | -------- |--| ------ |
  81 | NN | RT | RA  | RB   | RC      | mode 000 |1 | ternlog |
  82 | NN | RT | RA  | RB   | im0-4   | im5-7 00 |0 | ternlogi |
  83 | NN | RT | RA  | RB   | / im0-3 | 00   100 |1 | grevlog |
  84 | NN | RT | RA  | s0-5 | s6 im0-3| 01   100 |1 | grevlogi |
  85 | NN | RT | RA  |      |         | 1-   100 |1 | rsvd |
  86 | NN | RS | RA  | RB   | RC      | 00  011  |Rc| gfmul |
  87 | NN | RS | RA  | RB   | RC      | 01  011  |Rc| gfmaddsub |
  88 | NN | RT | RA  | RB   |         | 10  011  |Rc| rsvd  |
  89 | NN | RS | RA  | RB   |         | 11  011  |Rc| rsvd  |
  90 | NN | RS | RA  | RB   |         | 11  111  |Rc| rsvd   |
  91
  92 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31| name |
  93 | -- | -- | --- | ----- | ---- | ----- |--| ------ |
  94 | NN | RT | RA  | imm   | mask | 101   |1 | ternlogv |
  95
  96 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31| name |
  97 | -- | -- | --- | --- |- |-----|----- | -----|--| -------|
  98 | NN | BA | BB  | BC  |0 |imm  | mask | 101  |0 | ternlogcr |
  99
 100 ops (note that av avg and abs as well as vec scalar mask
 101 are included here)
 102
 103 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 104 double check that instructions didn't need 3 inputs.
 105
 106 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 107 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 108 | NN | RA | RB  |     | 0  |       | 0000 110 |Rc| rsvd   |
 109 | NN | RA | RB  | RC  | 1  | itype | 0000 110 |Rc| xperm |
 110 | NN | RA | RB  | RC  | 0  | itype | 0100 110 |Rc| minmax |
 111 | NN | RA | RB  | RC  | 1  |   00  | 0100 110 |Rc| av avgadd |
 112 | NN | RA | RB  | RC  | 1  |   01  | 0100 110 |Rc| av abs |
 113 | NN | RA | RB  |     | 1  |   10  | 0100 110 |Rc| rsvd |
 114 | NN | RA | RB  |     | 1  |   11  | 0100 110 |Rc| rsvd |
 115 | NN | RA | RB  | sh  | SH | itype | 1000 110 |Rc| bmopsi |
 116 | NN | RA | RB  |     |    |       | 1100 110 |Rc| rsvd |
 117 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| gfdiv |
 118 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| gfmod |
 119 | NN | RT | RA  |     | 1  |  10   | 0001 110 |Rc| rsvd |
 120 | NN | RA | RB  |     | 1  |  11   | 0001 110 |Rc| rsvd |
 121 | NN | RA | RB  | RC  | 0  |   00  | 0001 110 |Rc| vec sbfm |
 122 | NN | RA | RB  | RC  | 0  |   01  | 0001 110 |Rc| vec sofm |
 123 | NN | RA | RB  | RC  | 0  |   10  | 0001 110 |Rc| vec sifm |
 124 | NN | RA | RB  | RC  | 0  |   11  | 0001 110 |Rc| vec cprop |
 125 | NN | RA | RB  |     | 0  |       | 0101 110 |Rc| rsvd |
 126 | NN | RA | RB  | RC  | 0  | 00    | 0010 110 |Rc| gorc |
 127 | NN | RA | RB  | sh  | SH | 00    | 1010 110 |Rc| gorci |
 128 | NN | RA | RB  | RC  | 0  | 00    | 0110 110 |Rc| gorcw |
 129 | NN | RA | RB  | sh  | 0  | 00    | 1110 110 |Rc| gorcwi |
 130 | NN | RA | RB  | RC  | 1  | 00    | 1110 110 |Rc| bmator  |
 131 | NN | RA | RB  | RC  | 0  | 01    | 0010 110 |Rc| grev |
 132 | NN | RA | RB  | RC  | 1  | 01    | 0010 110 |Rc| clmul |
 133 | NN | RA | RB  | sh  | SH | 01    | 1010 110 |Rc| grevi |
 134 | NN | RA | RB  | RC  | 0  | 01    | 0110 110 |Rc| grevw |
 135 | NN | RA | RB  | sh  | 0  | 01    | 1110 110 |Rc| grevwi |
 136 | NN | RA | RB  | RC  | 1  | 01    | 1110 110 |Rc| bmatxor   |
 137 | NN | RA | RB  | RC  | 0  | 10    | 0010 110 |Rc| shfl |
 138 | NN | RA | RB  | sh  | SH | 10    | 1010 110 |Rc| shfli |
 139 | NN | RA | RB  | RC  | 0  | 10    | 0110 110 |Rc| shflw |
 140 | NN | RA | RB  | RC  |    | 10    | 1110 110 |Rc| rsvd    |
 141 | NN | RA | RB  | RC  | 0  | 11    | 1110 110 |Rc| clmulr  |
 142 | NN | RA | RB  | RC  | 1  | 11    | 1110 110 |Rc| clmulh  |
 143 | NN |    |     |     |    |       | --11 110 |Rc| setvl  |
 144
 145 # bit to byte permute
 146
 147 similar to matrix permute in RV bitmanip, which has XOR and OR variants
 148
 149     do j = 0 to 7
 150       do k = 0 to 7
 151          b = VSR[VRB+32].dword[i].byte[k].bit[j]
 152          VSR[VRT+32].dword[i].byte[j].bit[k] = b
 153
 154 # int min/max
 155
 156 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 157
 158 signed/unsigned min/max gives more flexibility.
 159
 160 ```
 161 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 162 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 163 }
 164 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 165 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 166 }
 167 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 168 { return rs1 < rs2 ? rs1 : rs2;
 169 }
 170 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 171 { return rs1 > rs2 ? rs1 : rs2;
 172 }
 173 ```
 174
 175
 176 # ternlog bitops
 177
 178 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 179
 180 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 181
 182 ## ternlogi
 183
 184 | 0.5|6.10|11.15|16.20| 21..25| 26..30   |31|
 185 | -- | -- | --- | --- | ----- | -------- |--|
 186 | NN | RT | RA  | RB  | im0-4 | im5-7 00 |0 |
 187
 188     lut3(imm, a, b, c):
 189         idx = c << 2 | b << 1 | a
 190         return imm[idx] # idx by LSB0 order
 191
 192     for i in range(64):
 193         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 194
 195 bits 21..22 may be used to specify a mode, such as treating the whole integer zero/nonzero and putting 1/0 in the result, rather than bitwise test.
 196
 197 ## ternlog
 198
 199 a 4 operand variant which becomes more along the lines of an FPGA:
 200
 201 | 0.5|6.10|11.15|16.20|21.25| 26...30  |31|
 202 | -- | -- | --- | --- | --- | -------- |--|
 203 | NN | RT | RA  | RB  | RC  | mode 100 |1 |
 204
 205     for i in range(64):
 206         idx = RT[i] << 2 | RA[i] << 1 | RB[i]
 207         RT[i] = (RC & (1<<idx)) != 0
 208
 209 mode (2 bit) may be used to do inversion of ordering, similar to carryless mul,
 210 3 modes.
 211
 212 ## ternlogv
 213
 214 also, another possible variant involving swizzle and vec4:
 215
 216 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31|
 217 | -- | -- | --- | ----- | ---- | ----- |--|
 218 | NN | RT | RA  | imm   | mask | 101   |1 |
 219
 220     for i in range(8):
 221         idx = RA.x[i] << 2 | RA.y[i] << 1 | RA.z[i]
 222         res = (imm & (1<<idx)) != 0
 223         for j in range(3):
 224              if mask[j]: RT[i+j*8] = res
 225
 226 ## ternlogcr
 227
 228 another mode selection would be CRs not Ints.
 229
 230 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31|
 231 | -- | -- | --- | --- |- |-----|----- | -----|--|
 232 | NN | BA | BB  | BC  |0 |imm  | mask | 101  |0 |
 233
 234     for i in range(4):
 235         if not mask[i] continue
 236         idx = crregs[BA][i] << 2 |
 237               crregs[BB][i] << 1 |
 238               crregs[BC][i]
 239         crregs[BA][i] = (imm & (1<<idx)) != 0
 240
 241 ## cmix
 242
 243 based on RV bitmanip, covered by ternlog bitops
 244
 245 ```
 246 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 247     return (RA & RB) | (RC & ~RB);
 248 }
 249 ```
 250
 251
 252 # bitmask set
 253
 254 based on RV bitmanip singlebit set, instruction format similar to shift
 255 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 256 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 257
 258 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 259 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 260
 261 bmset (register for mask amount) is particularly useful for creating
 262 predicate masks where the length is a dynamic runtime quantity.
 263 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 264
 265 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 266 | -- | -- | --- | --- | --- | ------- |--| ----- |
 267 | NN | RT | RA  | RB  | RC  | mode 010 |Rc| bm*   |
 268 | NN | RT | RA  | RB  | RC  | 0 1  111 |Rc| bmrev |
 269
 270
 271 ```
 272 uint_xlen_t bmset(RA, RB, sh)
 273 {
 274     int shamt = RB & (XLEN - 1);
 275     mask = (2<<sh)-1;
 276     return RA | (mask << shamt);
 277 }
 278
 279 uint_xlen_t bmclr(RA, RB, sh)
 280 {
 281     int shamt = RB & (XLEN - 1);
 282     mask = (2<<sh)-1;
 283     return RA & ~(mask << shamt);
 284 }
 285
 286 uint_xlen_t bminv(RA, RB, sh)
 287 {
 288     int shamt = RB & (XLEN - 1);
 289     mask = (2<<sh)-1;
 290     return RA ^ (mask << shamt);
 291 }
 292
 293 uint_xlen_t bmext(RA, RB, sh)
 294 {
 295     int shamt = RB & (XLEN - 1);
 296     mask = (2<<sh)-1;
 297     return mask & (RA >> shamt);
 298 }
 299 ```
 300
 301 bitmask extract with reverse.  can be done by bitinverting all of RA and getting bits of RA from the opposite end.
 302
 303 ```
 304 msb = rb[5:0];
 305 rev[0:msb] = ra[msb:0];
 306 rt = ZE(rev[msb:0]);
 307
 308 uint_xlen_t bmextrev(RA, RB, sh)
 309 {
 310     int shamt = (RB & (XLEN - 1));
 311     shamt = (XLEN-1)-shamt;  # shift other end
 312     bra = bitreverse(RA)     # swap LSB-MSB
 313     mask = (2<<sh)-1;
 314     return mask & (bra >> shamt);
 315 }
 316 ```
 317
 318 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 319 | -- | -- | --- | --- | --- | ------- |--| ------ |
 320 | NN | RT | RA  | RB  | sh  | 0   111 |Rc| bmrevi |
 321
 322
 323 # grevlut
 324
 325 generalised reverse combined with a LUT2 and allowing
 326 zero when RA=0 provides a wide range of instructions
 327 and a means to set regular 64 bit patterns in one
 328 32 bit instruction.
 329
 330 ```
 331 lut2(imm, a, b):
 332     idx = b << 1 | a
 333     return imm[idx] # idx by LSB0 order
 334
 335 dorow(imm, step_i, chunksize):
 336     for j in 0 to 63:
 337         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 338     return step_o
 339
 340 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 lut2)
 341 {
 342     uint64_t x = RA;
 343     int shamt = RB & 63;
 344     int imm = lut2 & 0b1111;
 345     for i in 0 to 6
 346         step = 1<<i
 347         if (shamt & step) x = dorow(imm, x, step)
 348     return x;
 349 }
 350
 351 ```
 352
 353 # grev
 354
 355 based on RV bitmanip, this is also known as a butterfly network. however
 356 where a butterfly network allows setting of every crossbar setting in
 357 every row and every column, generalised-reverse (grev) only allows
 358 a per-row decision: every entry in the same row must either switch or
 359 not-switch.
 360
 361 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 362
 363 ```
 364 uint64_t grev64(uint64_t RA, uint64_t RB)
 365 {
 366     uint64_t x = RA;
 367     int shamt = RB & 63;
 368     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 369                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 370     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 371                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 372     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 373                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 374     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 375                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 376     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 377                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 378     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 379                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 380     return x;
 381 }
 382
 383 ```
 384
 385 # shuffle / unshuffle
 386
 387 based on RV bitmanip
 388
 389 ```
 390 uint32_t shfl32(uint32_t RA, uint32_t RB)
 391 {
 392     uint32_t x = RA;
 393     int shamt = RB & 15;
 394     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 395     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 396     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 397     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 398     return x;
 399 }
 400 uint32_t unshfl32(uint32_t RA, uint32_t RB)
 401 {
 402     uint32_t x = RA;
 403     int shamt = RB & 15;
 404     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 405     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 406     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 407     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 408     return x;
 409 }
 410
 411 uint64_t shuffle64_stage(uint64_t src, uint64_t maskL, uint64_t maskR, int N)
 412 {
 413     uint64_t x = src & ~(maskL | maskR);
 414     x |= ((src << N) & maskL) | ((src >> N) & maskR);
 415     return x;
 416 }
 417 uint64_t shfl64(uint64_t RA, uint64_t RB)
 418 {
 419     uint64_t x = RA;
 420     int shamt = RB & 31;
 421     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 422                                            0x00000000ffff0000LL, 16);
 423     if (shamt & 8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 424                                            0x0000ff000000ff00LL, 8);
 425     if (shamt & 4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 426                                            0x00f000f000f000f0LL, 4);
 427     if (shamt & 2) x = shuffle64_stage(x, 0x3030303030303030LL,
 428                                            0x0c0c0c0c0c0c0c0cLL, 2);
 429     if (shamt & 1) x = shuffle64_stage(x, 0x4444444444444444LL,
 430                                            0x2222222222222222LL, 1);
 431     return x;
 432 }
 433 uint64_t unshfl64(uint64_t RA, uint64_t RB)
 434 {
 435     uint64_t x = RA;
 436     int shamt = RB & 31;
 437     if (shamt &  1) x = shuffle64_stage(x, 0x4444444444444444LL,
 438                                            0x2222222222222222LL, 1);
 439     if (shamt &  2) x = shuffle64_stage(x, 0x3030303030303030LL,
 440                                            0x0c0c0c0c0c0c0c0cLL, 2);
 441     if (shamt &  4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 442                                            0x00f000f000f000f0LL, 4);
 443     if (shamt &  8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 444                                            0x0000ff000000ff00LL, 8);
 445     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 446                                            0x00000000ffff0000LL, 16);
 447     return x;
 448 }
 449 ```
 450
 451 # xperm
 452
 453 based on RV bitmanip
 454
 455 ```
 456 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 457 {
 458     uint_xlen_t r = 0;
 459     uint_xlen_t sz = 1LL << sz_log2;
 460     uint_xlen_t mask = (1LL << sz) - 1;
 461     for (int i = 0; i < XLEN; i += sz) {
 462         uint_xlen_t pos = ((RB >> i) & mask) << sz_log2;
 463         if (pos < XLEN)
 464             r |= ((RA >> pos) & mask) << i;
 465     }
 466     return r;
 467 }
 468 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 469 {  return xperm(RA, RB, 2); }
 470 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 471 {  return xperm(RA, RB, 3); }
 472 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 473 {  return xperm(RA, RB, 4); }
 474 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 475 {  return xperm(RA, RB, 5); }
 476 ```
 477
 478 # gorc
 479
 480 based on RV bitmanip
 481
 482 ```
 483 uint32_t gorc32(uint32_t RA, uint32_t RB)
 484 {
 485     uint32_t x = RA;
 486     int shamt = RB & 31;
 487     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 488     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 489     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 490     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 491     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 492     return x;
 493 }
 494 uint64_t gorc64(uint64_t RA, uint64_t RB)
 495 {
 496     uint64_t x = RA;
 497     int shamt = RB & 63;
 498     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 499                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 500     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 501                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 502     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 503                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 504     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 505                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 506     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 507                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 508     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 509                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 510     return x;
 511 }
 512
 513 ```
 514
 515 # Galois Field 2^M
 516
 517 see:
 518
 519 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 520 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 521 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 522
 523 ## SPRs to set modulo and degree
 524
 525 to save registers and make operations orthogonal with standard
 526 arithmetic the modulo is to be set in an SPR
 527
 528 ## Twin Butterfly (Tukey-Cooley) Mul-add-sub
 529
 530 used in combination with SV FFT REMAP to perform
 531 a full NTT in-place
 532
 533     gffmadd  RT,RA,RC,RB (Rc=0)
 534     gffmadd. RT,RA,RC,RB (Rc=1)
 535
 536 Pseudo-code:
 537
 538     RT <- GFMULADD(RA, RC, RB)
 539     RS <- GFMULADD(RA, RC, RB)
 540
 541
 542 ## Multiply
 543
 544 this requires 3 parameters and a "degree"
 545
 546     RT = GFMUL(RA, RB, gfdegree, modulo=RC)
 547
 548 realistically with the degree also needing to be an immediate it should be brought down to an overwrite version:
 549
 550     RS = GFMUL(RS, RA, gfdegree, modulo=RC)
 551     RS = GFMUL(RS, RA, gfdegree=RB, modulo=RC)
 552
 553 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31|
 554 | -- | -- | --- | --- | --- | ------- |--|
 555 | NN | RS | RA  | deg | RC  | 00  011 |Rc|
 556 | NN | RS | RA  | RB  | RC  | 11  011 |Rc|
 557
 558 where the SimpleV variant may override RS-as-src differently from RS-as-dest
 559
 560
 561
 562 ```
 563 from functools import reduce
 564
 565 def gf_degree(a) :
 566   res = 0
 567   a >>= 1
 568   while (a != 0) :
 569     a >>= 1;
 570     res += 1;
 571   return res
 572
 573 # constants used in the multGF2 function
 574 mask1 = mask2 = polyred = None
 575
 576 def setGF2(irPoly):
 577     """Define parameters of binary finite field GF(2^m)/g(x)
 578        - irPoly: coefficients of irreducible polynomial g(x)
 579     """
 580     # degree: extension degree of binary field
 581     degree = gf_degree(irPoly)
 582
 583     def i2P(sInt):
 584         """Convert an integer into a polynomial"""
 585         return [(sInt >> i) & 1
 586                 for i in reversed(range(sInt.bit_length()))]
 587
 588     global mask1, mask2, polyred
 589     mask1 = mask2 = 1 << degree
 590     mask2 -= 1
 591     polyred = reduce(lambda x, y: (x << 1) + y, i2P(irPoly)[1:])
 592
 593 def multGF2(p1, p2):
 594     """Multiply two polynomials in GF(2^m)/g(x)"""
 595     p = 0
 596     while p2:
 597         if p2 & 1:
 598             p ^= p1
 599         p1 <<= 1
 600         if p1 & mask1:
 601             p1 ^= polyred
 602         p2 >>= 1
 603     return p & mask2
 604
 605 if __name__ == "__main__":
 606
 607     # Define binary field GF(2^3)/x^3 + x + 1
 608     setGF2(0b1011) # degree 3
 609
 610     # Evaluate the product (x^2 + x + 1)(x^2 + 1)
 611     print("{:02x}".format(multGF2(0b111, 0b101)))
 612
 613     # Define binary field GF(2^8)/x^8 + x^4 + x^3 + x + 1
 614     # (used in the Advanced Encryption Standard-AES)
 615     setGF2(0b100011011) # degree 8
 616
 617     # Evaluate the product (x^7)(x^7 + x + 1)
 618     print("{:02x}".format(multGF2(0b10000000, 0b10000011)))
 619 ```
 620 ## GF div and mod
 621
 622 ```
 623 def gf_degree(a) :
 624   res = 0
 625   a >>= 1
 626   while (a != 0) :
 627     a >>= 1;
 628     res += 1;
 629   return res
 630
 631 def FullDivision(self, f, v):
 632         """
 633         Takes two arguments, f, v
 634         fDegree and vDegree are the degrees of the field elements
 635         f and v represented as a polynomials.
 636         This method returns the field elements a and b such that
 637
 638             f(x) = a(x) * v(x) + b(x).
 639
 640         That is, a is the divisor and b is the remainder, or in
 641         other words a is like floor(f/v) and b is like f modulo v.
 642         """
 643
 644         fDegree, vDegree = gf_degree(f), gf_degree(v)
 645         res, rem = 0, f
 646         i = fDegree
 647         mask = 1 << i
 648         while (i >= vDegree):
 649             if (mask & rem):
 650                 res ^= (1 << (i - vDegree))
 651                 rem ^= ( v << (i - vDegree)))
 652             i -= 1
 653             mask >>= 1
 654         return (res, rem)
 655 ```
 656
 657 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 658 | -- | -- | --- | --- | --- | ------- |--| ----- |
 659 | NN | RS | RA  | deg | RC  | 0 1  011 |Rc| gfaddi |
 660 | NN | RS | RA  | RB  | RC  | 1 1  111 |Rc| gfadd |
 661
 662 GFMOD is a pseudo-op where RA=0
 663
 664 ## gf invert
 665
 666 note this function is incorrect / incomplete, as-is
 667 https://stackoverflow.com/questions/45442396/
 668
 669 ```
 670 def gf_degree(a) :
 671   res = 0
 672   a >>= 1
 673   while (a != 0) :
 674     a >>= 1;
 675     res += 1;
 676   return res
 677
 678 def gf_invert(a, mod=0x1B) :
 679
 680   mod_degree = gf_degree(mod)
 681   v = mod
 682   g1 = 1
 683   g2 = 0
 684   j = gf_degree(a) - mod_degree
 685
 686   while (a != 1) :
 687     if (j < 0) :
 688       a, v = v, a
 689       g1, g2 = g2, g1
 690       j = -j
 691
 692     a ^= v << j
 693     g1 ^= g2 << j
 694
 695     a %= (1<<mod_degree)  # Emulating 8-bit overflow
 696     g1 %= (1<<mod_degree) # Emulating 8-bit overflow
 697
 698     j = gf_degree(a) - gf_degree(v)
 699
 700   return g1
 701 ```
 702
 703 ## carryless mul
 704
 705 based on RV bitmanip
 706 see https://en.wikipedia.org/wiki/CLMUL_instruction_set
 707
 708 these are GF2 operations with the modulo set to 2^degree.
 709 they are worth adding as their own non-overwrite operations
 710 (in the same pipeline).
 711
 712 ```
 713 uint_xlen_t clmul(uint_xlen_t RA, uint_xlen_t RB)
 714 {
 715     uint_xlen_t x = 0;
 716     for (int i = 0; i < XLEN; i++)
 717         if ((RB >> i) & 1)
 718             x ^= RA << i;
 719     return x;
 720 }
 721 uint_xlen_t clmulh(uint_xlen_t RA, uint_xlen_t RB)
 722 {
 723     uint_xlen_t x = 0;
 724     for (int i = 1; i < XLEN; i++)
 725         if ((RB >> i) & 1)
 726             x ^= RA >> (XLEN-i);
 727     return x;
 728 }
 729 uint_xlen_t clmulr(uint_xlen_t RA, uint_xlen_t RB)
 730 {
 731     uint_xlen_t x = 0;
 732     for (int i = 0; i < XLEN; i++)
 733         if ((RB >> i) & 1)
 734             x ^= RA >> (XLEN-i-1);
 735     return x;
 736 }
 737 ```
 738
 739 # bitmatrix
 740
 741 ```
 742 uint64_t bmatflip(uint64_t RA)
 743 {
 744     uint64_t x = RA;
 745     x = shfl64(x, 31);
 746     x = shfl64(x, 31);
 747     x = shfl64(x, 31);
 748     return x;
 749 }
 750 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 751 {
 752     // transpose of RB
 753     uint64_t RBt = bmatflip(RB);
 754     uint8_t u[8]; // rows of RA
 755     uint8_t v[8]; // cols of RB
 756     for (int i = 0; i < 8; i++) {
 757         u[i] = RA >> (i*8);
 758         v[i] = RBt >> (i*8);
 759     }
 760     uint64_t x = 0;
 761     for (int i = 0; i < 64; i++) {
 762         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 763             x |= 1LL << i;
 764     }
 765     return x;
 766 }
 767 uint64_t bmator(uint64_t RA, uint64_t RB)
 768 {
 769     // transpose of RB
 770     uint64_t RBt = bmatflip(RB);
 771     uint8_t u[8]; // rows of RA
 772     uint8_t v[8]; // cols of RB
 773     for (int i = 0; i < 8; i++) {
 774         u[i] = RA >> (i*8);
 775         v[i] = RBt >> (i*8);
 776     }
 777     uint64_t x = 0;
 778     for (int i = 0; i < 64; i++) {
 779         if ((u[i / 8] & v[i % 8]) != 0)
 780             x |= 1LL << i;
 781     }
 782     return x;
 783 }
 784
 785 ```
 786
 787 # Already in POWER ISA
 788
 789 ## count leading/trailing zeros with mask
 790
 791 in v3.1 p105
 792
 793 ```
 794 count = 0
 795 do i = 0 to 63 if((RB)i=1) then do
 796 if((RS)i=1) then break end end count ← count + 1
 797 RA ← EXTZ64(count)
 798 ```
 799
 800 ##  bit deposit
 801
 802 vpdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
 803
 804     do while(m < 64)
 805        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
 806           result = VSR[VRA+32].dword[i].bit[63-k]
 807           VSR[VRT+32].dword[i].bit[63-m] = result
 808           k = k + 1
 809        m = m + 1
 810
 811 ```
 812
 813 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
 814 {
 815     uint_xlen_t r = 0;
 816     for (int i = 0, j = 0; i < XLEN; i++)
 817         if ((RB >> i) & 1) {
 818             if ((RA >> j) & 1)
 819                 r |= uint_xlen_t(1) << i;
 820             j++;
 821         }
 822     return r;
 823 }
 824
 825 ```
 826
 827 # bit extract
 828
 829 other way round: identical to RV bext, found in v3.1 p196
 830
 831 ```
 832 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
 833 {
 834     uint_xlen_t r = 0;
 835     for (int i = 0, j = 0; i < XLEN; i++)
 836         if ((RB >> i) & 1) {
 837             if ((RA >> i) & 1)
 838                 r |= uint_xlen_t(1) << j;
 839             j++;
 840         }
 841     return r;
 842 }
 843 ```
 844
 845 # centrifuge
 846
 847 found in v3.1 p106 so not to be added here
 848
 849 ```
 850 ptr0 = 0
 851 ptr1 = 0
 852 do i = 0 to 63
 853     if((RB)i=0) then do
 854        resultptr0 = (RS)i
 855     end
 856     ptr0 = ptr0 + 1
 857     if((RB)63-i==1) then do
 858         result63-ptr1 = (RS)63-i
 859     end
 860     ptr1 = ptr1 + 1
 861 RA = result
 862 ```
 863