openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 # Implementation Log
   4
   5 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   6 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   7 * remove Rc=1 from ternlog due to conflicts in encoding as well
   8   as saving space <https://bugs.libre-soc.org/show_bug.cgi?id=753#c5>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11 # bitmanipulation
  12
  13 **DRAFT STATUS**
  14
  15 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  16 Vectorisation Context is provided by [[openpower/sv]].
  17
  18 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  19
  20 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  21
  22 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  23
  24 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  25 the [[sv/av_opcodes]] as well as [[sv/setvl]]
  26
  27 Useful resource:
  28
  29 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  30 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  31
  32 # summary
  33
  34 minor opcode allocation
  35
  36     |  28.30 |31| name      |
  37     | ------ |--| --------- |
  38     |   00   |0 | ternlogi  |
  39     |  000   |1 | ternlog   |
  40     |  100   |1 | grevlog   |
  41     |  010   |Rc| bitmask   |
  42     |  011   |Rc| gf*       |
  43     |  101   |1 | ternlogv  |
  44     |  101   |0 | ternlogcr |
  45     |  110   |Rc| 1/2-op    |
  46     |  111   |Rc| 3-op      |
  47
  48 1-op and variants
  49
  50 | dest | src1 | subop | op       |
  51 | ---- | ---- | ----- | -------- |
  52 | RT   | RA   | ..    | bmatflip |
  53
  54 2-op and variants
  55
  56 | dest | src1 | src2 | subop | op       |
  57 | ---- | ---- | ---- | ----- | -------- |
  58 | RT   | RA   | RB   | or    | bmatflip |
  59 | RT   | RA   | RB   | xor   | bmatflip |
  60 | RT   | RA   | RB   |       | grev  |
  61 | RT   | RA   | RB   |       | clmul*  |
  62 | RT   | RA   | RB   |       | gorc |
  63 | RT   | RA   | RB   | shuf  | shuffle |
  64 | RT   | RA   | RB   | unshuf| shuffle |
  65 | RT   | RA   | RB   | width | xperm  |
  66 | RT   | RA   | RB   | type | minmax |
  67 | RT   | RA   | RB   |      | av abs avgadd  |
  68 | RT   | RA   | RB   | type | vmask ops |
  69 | RT   | RA   | RB   |  |  |
  70
  71 3 ops
  72
  73 * bitmask set/extract
  74 * ternlog bitops
  75 * GF
  76
  77 TODO: convert all instructions to use RT and not RS
  78
  79 | 0.5|6.10|11.15|16.20 |21..25   | 26....30 |31| name |
  80 | -- | -- | --- | ---  | -----   | -------- |--| ------ |
  81 | NN | RT | RA  | RB   | RC      | mode 000 |1 | ternlog |
  82 | NN | RT | RA  | RB   | im0-4   | im5-7 00 |0 | ternlogi |
  83 | NN | RT | RA  | RB   | / im0-3 | 00   100 |1 | grevlog |
  84 | NN | RT | RA  | s0-5 | s6 im0-3| 01   100 |1 | grevlogi |
  85 | NN | RT | RA  |      |         | 1-   100 |1 | rsvd |
  86 | NN | RS | RA  | RB   | RC      | 00  011  |Rc| gfmul |
  87 | NN | RS | RA  | RB   | RC      | 01  011  |Rc| gfmaddsub |
  88 | NN | RT | RA  | RB   |         | 10  011  |Rc| rsvd  |
  89 | NN | RS | RA  | RB   |         | 11  011  |Rc| rsvd  |
  90 | NN | RS | RA  | RB   |         | 11  111  |Rc| rsvd   |
  91
  92 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31| name |
  93 | -- | -- | --- | ----- | ---- | ----- |--| ------ |
  94 | NN | RT | RA  | imm   | mask | 101   |1 | ternlogv |
  95
  96 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31| name |
  97 | -- | -- | --- | --- |- |-----|----- | -----|--| -------|
  98 | NN | BA | BB  | BC  |0 |imm  | mask | 101  |0 | ternlogcr |
  99
 100 ops (note that av avg and abs as well as vec scalar mask
 101 are included here)
 102
 103 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 104 double check that instructions didn't need 3 inputs.
 105
 106 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 107 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 108 | NN | RA | RB  |     | 0  |       | 0000 110 |Rc| rsvd   |
 109 | NN | RA | RB  | RC  | 1  | itype | 0000 110 |Rc| xperm |
 110 | NN | RA | RB  | RC  | 0  | itype | 0100 110 |Rc| minmax |
 111 | NN | RA | RB  | RC  | 1  |   00  | 0100 110 |Rc| av avgadd |
 112 | NN | RA | RB  | RC  | 1  |   01  | 0100 110 |Rc| av abs |
 113 | NN | RA | RB  |     | 1  |   10  | 0100 110 |Rc| rsvd |
 114 | NN | RA | RB  |     | 1  |   11  | 0100 110 |Rc| rsvd |
 115 | NN | RA | RB  | sh  | SH | itype | 1000 110 |Rc| bmopsi |
 116 | NN | RA | RB  |     |    |       | 1100 110 |Rc| rsvd |
 117 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| gfdiv |
 118 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| gfmod |
 119 | NN | RT | RA  |     | 1  |  10   | 0001 110 |Rc| rsvd |
 120 | NN | RA | RB  |     | 1  |  11   | 0001 110 |Rc| rsvd |
 121 | NN | RA | RB  | RC  | 0  |   00  | 0001 110 |Rc| vec sbfm |
 122 | NN | RA | RB  | RC  | 0  |   01  | 0001 110 |Rc| vec sofm |
 123 | NN | RA | RB  | RC  | 0  |   10  | 0001 110 |Rc| vec sifm |
 124 | NN | RA | RB  | RC  | 0  |   11  | 0001 110 |Rc| vec cprop |
 125 | NN | RA | RB  |     | 0  |       | 0101 110 |Rc| rsvd |
 126 | NN | RA | RB  | RC  | 0  | 00    | 0010 110 |Rc| gorc |
 127 | NN | RA | RB  | sh  | SH | 00    | 1010 110 |Rc| gorci |
 128 | NN | RA | RB  | RC  | 0  | 00    | 0110 110 |Rc| gorcw |
 129 | NN | RA | RB  | sh  | 0  | 00    | 1110 110 |Rc| gorcwi |
 130 | NN | RA | RB  | RC  | 1  | 00    | 1110 110 |Rc| bmator  |
 131 | NN | RA | RB  | RC  | 0  | 01    | 0010 110 |Rc| grev |
 132 | NN | RA | RB  | RC  | 1  | 01    | 0010 110 |Rc| clmul |
 133 | NN | RA | RB  | sh  | SH | 01    | 1010 110 |Rc| grevi |
 134 | NN | RA | RB  | RC  | 0  | 01    | 0110 110 |Rc| grevw |
 135 | NN | RA | RB  | sh  | 0  | 01    | 1110 110 |Rc| grevwi |
 136 | NN | RA | RB  | RC  | 1  | 01    | 1110 110 |Rc| bmatxor   |
 137 | NN | RA | RB  | RC  | 0  | 10    | 0010 110 |Rc| shfl |
 138 | NN | RA | RB  | sh  | SH | 10    | 1010 110 |Rc| shfli |
 139 | NN | RA | RB  | RC  | 0  | 10    | 0110 110 |Rc| shflw |
 140 | NN | RA | RB  | RC  |    | 10    | 1110 110 |Rc| rsvd    |
 141 | NN | RA | RB  | RC  | 0  | 11    | 1110 110 |Rc| clmulr  |
 142 | NN | RA | RB  | RC  | 1  | 11    | 1110 110 |Rc| clmulh  |
 143 | NN |    |     |     |    |       | --11 110 |Rc| setvl  |
 144
 145 # bit to byte permute
 146
 147 similar to matrix permute in RV bitmanip, which has XOR and OR variants
 148
 149     do j = 0 to 7
 150       do k = 0 to 7
 151          b = VSR[VRB+32].dword[i].byte[k].bit[j]
 152          VSR[VRT+32].dword[i].byte[j].bit[k] = b
 153
 154 # int min/max
 155
 156 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 157
 158 signed/unsigned min/max gives more flexibility.
 159
 160 ```
 161 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 162 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 163 }
 164 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 165 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 166 }
 167 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 168 { return rs1 < rs2 ? rs1 : rs2;
 169 }
 170 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 171 { return rs1 > rs2 ? rs1 : rs2;
 172 }
 173 ```
 174
 175
 176 # ternlog bitops
 177
 178 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 179
 180 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 181
 182 ## ternlogi
 183
 184 | 0.5|6.10|11.15|16.20| 21..25| 26..30   |31|
 185 | -- | -- | --- | --- | ----- | -------- |--|
 186 | NN | RT | RA  | RB  | im0-4 | im5-7 00 |0 |
 187
 188     lut3(imm, a, b, c):
 189         idx = c << 2 | b << 1 | a
 190         return imm[idx] # idx by LSB0 order
 191
 192     for i in range(64):
 193         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 194
 195 bits 21..22 may be used to specify a mode, such as treating the whole integer zero/nonzero and putting 1/0 in the result, rather than bitwise test.
 196
 197 ## ternlog
 198
 199 a 4 operand variant which becomes more along the lines of an FPGA:
 200
 201 | 0.5|6.10|11.15|16.20|21.25| 26...30  |31|
 202 | -- | -- | --- | --- | --- | -------- |--|
 203 | NN | RT | RA  | RB  | RC  | mode 100 |1 |
 204
 205     for i in range(64):
 206         idx = RT[i] << 2 | RA[i] << 1 | RB[i]
 207         RT[i] = (RC & (1<<idx)) != 0
 208
 209 mode (2 bit) may be used to do inversion of ordering, similar to carryless mul,
 210 3 modes.
 211
 212 ## ternlogv
 213
 214 also, another possible variant involving swizzle and vec4:
 215
 216 | 0.5|6.10|11.15| 16.23 |24.27 | 28.30 |31|
 217 | -- | -- | --- | ----- | ---- | ----- |--|
 218 | NN | RT | RA  | imm   | mask | 101   |1 |
 219
 220     for i in range(8):
 221         idx = RA.x[i] << 2 | RA.y[i] << 1 | RA.z[i]
 222         res = (imm & (1<<idx)) != 0
 223         for j in range(3):
 224              if mask[j]: RT[i+j*8] = res
 225
 226 ## ternlogcr
 227
 228 another mode selection would be CRs not Ints.
 229
 230 | 0.5|6.8 | 9.11|12.14|15|16.23|24.27 | 28.30|31|
 231 | -- | -- | --- | --- |- |-----|----- | -----|--|
 232 | NN | BA | BB  | BC  |0 |imm  | mask | 101  |0 |
 233
 234     for i in range(4):
 235         if not mask[i] continue
 236         idx = crregs[BA][i] << 2 |
 237               crregs[BB][i] << 1 |
 238               crregs[BC][i]
 239         crregs[BA][i] = (imm & (1<<idx)) != 0
 240
 241 ## cmix
 242
 243 based on RV bitmanip, covered by ternlog bitops
 244
 245 ```
 246 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 247     return (RA & RB) | (RC & ~RB);
 248 }
 249 ```
 250
 251
 252 # bitmask set
 253
 254 based on RV bitmanip singlebit set, instruction format similar to shift
 255 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 256 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 257
 258 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 259 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 260
 261 bmset (register for mask amount) is particularly useful for creating
 262 predicate masks where the length is a dynamic runtime quantity.
 263 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 264
 265 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 266 | -- | -- | --- | --- | --- | ------- |--| ----- |
 267 | NN | RT | RA  | RB  | RC  | mode 010 |Rc| bm*   |
 268 | NN | RT | RA  | RB  | RC  | 0 1  111 |Rc| bmrev |
 269
 270
 271 ```
 272 uint_xlen_t bmset(RA, RB, sh)
 273 {
 274     int shamt = RB & (XLEN - 1);
 275     mask = (2<<sh)-1;
 276     return RA | (mask << shamt);
 277 }
 278
 279 uint_xlen_t bmclr(RA, RB, sh)
 280 {
 281     int shamt = RB & (XLEN - 1);
 282     mask = (2<<sh)-1;
 283     return RA & ~(mask << shamt);
 284 }
 285
 286 uint_xlen_t bminv(RA, RB, sh)
 287 {
 288     int shamt = RB & (XLEN - 1);
 289     mask = (2<<sh)-1;
 290     return RA ^ (mask << shamt);
 291 }
 292
 293 uint_xlen_t bmext(RA, RB, sh)
 294 {
 295     int shamt = RB & (XLEN - 1);
 296     mask = (2<<sh)-1;
 297     return mask & (RA >> shamt);
 298 }
 299 ```
 300
 301 bitmask extract with reverse.  can be done by bitinverting all of RA and getting bits of RA from the opposite end.
 302
 303 ```
 304 msb = rb[5:0];
 305 rev[0:msb] = ra[msb:0];
 306 rt = ZE(rev[msb:0]);
 307
 308 uint_xlen_t bmextrev(RA, RB, sh)
 309 {
 310     int shamt = (RB & (XLEN - 1));
 311     shamt = (XLEN-1)-shamt;  # shift other end
 312     bra = bitreverse(RA)     # swap LSB-MSB
 313     mask = (2<<sh)-1;
 314     return mask & (bra >> shamt);
 315 }
 316 ```
 317
 318 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 319 | -- | -- | --- | --- | --- | ------- |--| ------ |
 320 | NN | RT | RA  | RB  | sh  | 0   111 |Rc| bmrevi |
 321
 322
 323 # grevlut
 324
 325 generalised reverse combined with a LUT2 and allowing
 326 zero when RA=0 provides a wide range of instructions
 327 and a means to set regular 64 bit patterns in one
 328 32 bit instruction.
 329
 330 ```
 331 lut2(imm, a, b):
 332     idx = b << 1 | a
 333     return imm[idx] # idx by LSB0 order
 334
 335 dorow(imm, step_i, chunksize):
 336     for j in 0 to 63:
 337         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 338     return step_o
 339
 340 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 lut2)
 341 {
 342     uint64_t x = RA;
 343     int shamt = RB & 63;
 344     int imm = lut2 & 0b1111;
 345     for i in 0 to 6
 346         step = 1<<i
 347         if (shamt & step) x = dorow(imm, x, step)
 348     return x;
 349 }
 350
 351 ```
 352
 353 # grev
 354
 355 based on RV bitmanip, this is also known as a butterfly network. however
 356 where a butterfly network allows setting of every crossbar setting in
 357 every row and every column, generalised-reverse (grev) only allows
 358 a per-row decision: every entry in the same row must either switch or
 359 not-switch.
 360
 361 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 362
 363 ```
 364 uint64_t grev64(uint64_t RA, uint64_t RB)
 365 {
 366     uint64_t x = RA;
 367     int shamt = RB & 63;
 368     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 369                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 370     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 371                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 372     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 373                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 374     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 375                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 376     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 377                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 378     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 379                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 380     return x;
 381 }
 382
 383 ```
 384
 385 # shuffle / unshuffle
 386
 387 based on RV bitmanip
 388
 389 ```
 390 uint32_t shfl32(uint32_t RA, uint32_t RB)
 391 {
 392     uint32_t x = RA;
 393     int shamt = RB & 15;
 394     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 395     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 396     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 397     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 398     return x;
 399 }
 400 uint32_t unshfl32(uint32_t RA, uint32_t RB)
 401 {
 402     uint32_t x = RA;
 403     int shamt = RB & 15;
 404     if (shamt & 1) x  = shuffle32_stage(x, 0x44444444, 0x22222222, 1);
 405     if (shamt & 2) x  = shuffle32_stage(x, 0x30303030, 0x0c0c0c0c, 2);
 406     if (shamt & 4) x  = shuffle32_stage(x, 0x0f000f00, 0x00f000f0, 4);
 407     if (shamt & 8) x  = shuffle32_stage(x, 0x00ff0000, 0x0000ff00, 8);
 408     return x;
 409 }
 410
 411 uint64_t shuffle64_stage(uint64_t src, uint64_t maskL, uint64_t maskR, int N)
 412 {
 413     uint64_t x = src & ~(maskL | maskR);
 414     x |= ((src << N) & maskL) | ((src >> N) & maskR);
 415     return x;
 416 }
 417 uint64_t shfl64(uint64_t RA, uint64_t RB)
 418 {
 419     uint64_t x = RA;
 420     int shamt = RB & 31;
 421     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 422                                            0x00000000ffff0000LL, 16);
 423     if (shamt & 8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 424                                            0x0000ff000000ff00LL, 8);
 425     if (shamt & 4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 426                                            0x00f000f000f000f0LL, 4);
 427     if (shamt & 2) x = shuffle64_stage(x, 0x3030303030303030LL,
 428                                            0x0c0c0c0c0c0c0c0cLL, 2);
 429     if (shamt & 1) x = shuffle64_stage(x, 0x4444444444444444LL,
 430                                            0x2222222222222222LL, 1);
 431     return x;
 432 }
 433 uint64_t unshfl64(uint64_t RA, uint64_t RB)
 434 {
 435     uint64_t x = RA;
 436     int shamt = RB & 31;
 437     if (shamt &  1) x = shuffle64_stage(x, 0x4444444444444444LL,
 438                                            0x2222222222222222LL, 1);
 439     if (shamt &  2) x = shuffle64_stage(x, 0x3030303030303030LL,
 440                                            0x0c0c0c0c0c0c0c0cLL, 2);
 441     if (shamt &  4) x = shuffle64_stage(x, 0x0f000f000f000f00LL,
 442                                            0x00f000f000f000f0LL, 4);
 443     if (shamt &  8) x = shuffle64_stage(x, 0x00ff000000ff0000LL,
 444                                            0x0000ff000000ff00LL, 8);
 445     if (shamt & 16) x = shuffle64_stage(x, 0x0000ffff00000000LL,
 446                                            0x00000000ffff0000LL, 16);
 447     return x;
 448 }
 449 ```
 450
 451 # xperm
 452
 453 based on RV bitmanip
 454
 455 ```
 456 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 457 {
 458     uint_xlen_t r = 0;
 459     uint_xlen_t sz = 1LL << sz_log2;
 460     uint_xlen_t mask = (1LL << sz) - 1;
 461     for (int i = 0; i < XLEN; i += sz) {
 462         uint_xlen_t pos = ((RB >> i) & mask) << sz_log2;
 463         if (pos < XLEN)
 464             r |= ((RA >> pos) & mask) << i;
 465     }
 466     return r;
 467 }
 468 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 469 {  return xperm(RA, RB, 2); }
 470 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 471 {  return xperm(RA, RB, 3); }
 472 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 473 {  return xperm(RA, RB, 4); }
 474 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 475 {  return xperm(RA, RB, 5); }
 476 ```
 477
 478 # gorc
 479
 480 based on RV bitmanip
 481
 482 ```
 483 uint32_t gorc32(uint32_t RA, uint32_t RB)
 484 {
 485     uint32_t x = RA;
 486     int shamt = RB & 31;
 487     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 488     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 489     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 490     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 491     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 492     return x;
 493 }
 494 uint64_t gorc64(uint64_t RA, uint64_t RB)
 495 {
 496     uint64_t x = RA;
 497     int shamt = RB & 63;
 498     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 499                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 500     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 501                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 502     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 503                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 504     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 505                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 506     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 507                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 508     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 509                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 510     return x;
 511 }
 512
 513 ```
 514
 515 # Galois Field 2^M
 516
 517 see <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 518
 519
 520 https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf
 521
 522
 523 ## SPRs to set modulo and degree
 524
 525 to save registers and make operations orthogonal with standard
 526 arithmetic the modulo is to be set in an SPR
 527
 528 ## Twin Butterfly (Tukey-Cooley) Mul-add-sub
 529
 530 used in combination with SV FFT REMAP to perform
 531 a full NTT in-place
 532
 533     gffmadd  RT,RA,RC,RB (Rc=0)
 534     gffmadd. RT,RA,RC,RB (Rc=1)
 535
 536 Pseudo-code:
 537
 538     RT <- GFMULADD(RA, RC, RB)
 539     RS <- GFMULADD(RA, RC, RB)
 540
 541
 542 ## Multiply
 543
 544 this requires 3 parameters and a "degree"
 545
 546     RT = GFMUL(RA, RB, gfdegree, modulo=RC)
 547
 548 realistically with the degree also needing to be an immediate it should be brought down to an overwrite version:
 549
 550     RS = GFMUL(RS, RA, gfdegree, modulo=RC)
 551     RS = GFMUL(RS, RA, gfdegree=RB, modulo=RC)
 552
 553 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31|
 554 | -- | -- | --- | --- | --- | ------- |--|
 555 | NN | RS | RA  | deg | RC  | 00  011 |Rc|
 556 | NN | RS | RA  | RB  | RC  | 11  011 |Rc|
 557
 558 where the SimpleV variant may override RS-as-src differently from RS-as-dest
 559
 560
 561
 562 ```
 563 from functools import reduce
 564
 565 # constants used in the multGF2 function
 566 mask1 = mask2 = polyred = None
 567
 568 def setGF2(degree, irPoly):
 569     """Define parameters of binary finite field GF(2^m)/g(x)
 570        - degree: extension degree of binary field
 571        - irPoly: coefficients of irreducible polynomial g(x)
 572     """
 573     def i2P(sInt):
 574         """Convert an integer into a polynomial"""
 575         return [(sInt >> i) & 1
 576                 for i in reversed(range(sInt.bit_length()))]
 577
 578     global mask1, mask2, polyred
 579     mask1 = mask2 = 1 << degree
 580     mask2 -= 1
 581     polyred = reduce(lambda x, y: (x << 1) + y, i2P(irPoly)[1:])
 582
 583 def multGF2(p1, p2):
 584     """Multiply two polynomials in GF(2^m)/g(x)"""
 585     p = 0
 586     while p2:
 587         if p2 & 1:
 588             p ^= p1
 589         p1 <<= 1
 590         if p1 & mask1:
 591             p1 ^= polyred
 592         p2 >>= 1
 593     return p & mask2
 594
 595 if __name__ == "__main__":
 596
 597     # Define binary field GF(2^3)/x^3 + x + 1
 598     setGF2(3, 0b1011)
 599
 600     # Evaluate the product (x^2 + x + 1)(x^2 + 1)
 601     print("{:02x}".format(multGF2(0b111, 0b101)))
 602
 603     # Define binary field GF(2^8)/x^8 + x^4 + x^3 + x + 1
 604     # (used in the Advanced Encryption Standard-AES)
 605     setGF2(8, 0b100011011)
 606
 607     # Evaluate the product (x^7)(x^7 + x + 1)
 608     print("{:02x}".format(multGF2(0b10000000, 0b10000011)))
 609 ```
 610 ## GF div and mod
 611
 612 ```
 613 def FullDivision(self, f, v, fDegree, vDegree):
 614         """
 615         Takes four arguments, f, v, fDegree, and vDegree where
 616         fDegree and vDegree are the degrees of the field elements
 617         f and v represented as a polynomials.
 618         This method returns the field elements a and b such that
 619
 620             f(x) = a(x) * v(x) + b(x).
 621
 622         That is, a is the divisor and b is the remainder, or in
 623         other words a is like floor(f/v) and b is like f modulo v.
 624         """
 625
 626         res, rem = 0, f
 627         i = fDegree
 628         mask = 1 << i
 629         while (i >= vDegree):
 630             if (mask & rem):
 631                 res ^= (1 << (i - vDegree))
 632                 rem ^= ( v << (i - vDegree)))
 633             i -= 1
 634             mask >>= 1
 635         return (res, rem)
 636 ```
 637
 638 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 639 | -- | -- | --- | --- | --- | ------- |--| ----- |
 640 | NN | RS | RA  | deg | RC  | 0 1  011 |Rc| gfaddi |
 641 | NN | RS | RA  | RB  | RC  | 1 1  111 |Rc| gfadd |
 642
 643 GFMOD is a pseudo-op where RA=0
 644
 645 ## gf invert
 646
 647 ```
 648 def gf_degree(a) :
 649   res = 0
 650   a >>= 1
 651   while (a != 0) :
 652     a >>= 1;
 653     res += 1;
 654   return res
 655
 656 def gf_invert(a, mod=0x1B) :
 657   v = mod
 658   g1 = 1
 659   g2 = 0
 660   j = gf_degree(a) - 8
 661
 662   while (a != 1) :
 663     if (j < 0) :
 664       a, v = v, a
 665       g1, g2 = g2, g1
 666       j = -j
 667
 668     a ^= v << j
 669     g1 ^= g2 << j
 670
 671     a %= 256  # Emulating 8-bit overflow
 672     g1 %= 256 # Emulating 8-bit overflow
 673
 674     j = gf_degree(a) - gf_degree(v)
 675
 676   return g1
 677 ```
 678
 679 ## carryless mul
 680
 681 based on RV bitmanip
 682 see https://en.wikipedia.org/wiki/CLMUL_instruction_set
 683
 684 these are GF2 operations with the modulo set to 2^degree.
 685 they are worth adding as their own non-overwrite operations
 686 (in the same pipeline).
 687
 688 ```
 689 uint_xlen_t clmul(uint_xlen_t RA, uint_xlen_t RB)
 690 {
 691     uint_xlen_t x = 0;
 692     for (int i = 0; i < XLEN; i++)
 693         if ((RB >> i) & 1)
 694             x ^= RA << i;
 695     return x;
 696 }
 697 uint_xlen_t clmulh(uint_xlen_t RA, uint_xlen_t RB)
 698 {
 699     uint_xlen_t x = 0;
 700     for (int i = 1; i < XLEN; i++)
 701         if ((RB >> i) & 1)
 702             x ^= RA >> (XLEN-i);
 703     return x;
 704 }
 705 uint_xlen_t clmulr(uint_xlen_t RA, uint_xlen_t RB)
 706 {
 707     uint_xlen_t x = 0;
 708     for (int i = 0; i < XLEN; i++)
 709         if ((RB >> i) & 1)
 710             x ^= RA >> (XLEN-i-1);
 711     return x;
 712 }
 713 ```
 714
 715 # bitmatrix
 716
 717 ```
 718 uint64_t bmatflip(uint64_t RA)
 719 {
 720     uint64_t x = RA;
 721     x = shfl64(x, 31);
 722     x = shfl64(x, 31);
 723     x = shfl64(x, 31);
 724     return x;
 725 }
 726 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 727 {
 728     // transpose of RB
 729     uint64_t RBt = bmatflip(RB);
 730     uint8_t u[8]; // rows of RA
 731     uint8_t v[8]; // cols of RB
 732     for (int i = 0; i < 8; i++) {
 733         u[i] = RA >> (i*8);
 734         v[i] = RBt >> (i*8);
 735     }
 736     uint64_t x = 0;
 737     for (int i = 0; i < 64; i++) {
 738         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 739             x |= 1LL << i;
 740     }
 741     return x;
 742 }
 743 uint64_t bmator(uint64_t RA, uint64_t RB)
 744 {
 745     // transpose of RB
 746     uint64_t RBt = bmatflip(RB);
 747     uint8_t u[8]; // rows of RA
 748     uint8_t v[8]; // cols of RB
 749     for (int i = 0; i < 8; i++) {
 750         u[i] = RA >> (i*8);
 751         v[i] = RBt >> (i*8);
 752     }
 753     uint64_t x = 0;
 754     for (int i = 0; i < 64; i++) {
 755         if ((u[i / 8] & v[i % 8]) != 0)
 756             x |= 1LL << i;
 757     }
 758     return x;
 759 }
 760
 761 ```
 762
 763 # Already in POWER ISA
 764
 765 ## count leading/trailing zeros with mask
 766
 767 in v3.1 p105
 768
 769 ```
 770 count = 0
 771 do i = 0 to 63 if((RB)i=1) then do
 772 if((RS)i=1) then break end end count ← count + 1
 773 RA ← EXTZ64(count)
 774 ```
 775
 776 ##  bit deposit
 777
 778 vpdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
 779
 780     do while(m < 64)
 781        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
 782           result = VSR[VRA+32].dword[i].bit[63-k]
 783           VSR[VRT+32].dword[i].bit[63-m] = result
 784           k = k + 1
 785        m = m + 1
 786
 787 ```
 788
 789 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
 790 {
 791     uint_xlen_t r = 0;
 792     for (int i = 0, j = 0; i < XLEN; i++)
 793         if ((RB >> i) & 1) {
 794             if ((RA >> j) & 1)
 795                 r |= uint_xlen_t(1) << i;
 796             j++;
 797         }
 798     return r;
 799 }
 800
 801 ```
 802
 803 # bit extract
 804
 805 other way round: identical to RV bext, found in v3.1 p196
 806
 807 ```
 808 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
 809 {
 810     uint_xlen_t r = 0;
 811     for (int i = 0, j = 0; i < XLEN; i++)
 812         if ((RB >> i) & 1) {
 813             if ((RA >> i) & 1)
 814                 r |= uint_xlen_t(1) << j;
 815             j++;
 816         }
 817     return r;
 818 }
 819 ```
 820
 821 # centrifuge
 822
 823 found in v3.1 p106 so not to be added here
 824
 825 ```
 826 ptr0 = 0
 827 ptr1 = 0
 828 do i = 0 to 63
 829     if((RB)i=0) then do
 830        resultptr0 = (RS)i
 831     end
 832     ptr0 = ptr0 + 1
 833     if((RB)63-i==1) then do
 834         result63-ptr1 = (RS)63-i
 835     end
 836     ptr1 = ptr1 + 1
 837 RA = result
 838 ```
 839