openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 [[!toc levels=1]]
   4
   5 # Implementation Log
   6
   7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11
  12 # bitmanipulation
  13
  14 **DRAFT STATUS**
  15
  16 pseudocode: [[openpower/isa/bitmanip]]
  17
  18 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  19 Vectorisation Context is provided by [[openpower/sv]].
  20
  21 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  22
  23 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  24
  25 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  26
  27 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  28 the [[sv/av_opcodes]] as well as [[sv/setvl]]
  29
  30 Useful resource:
  31
  32 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  33 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  34
  35 # summary
  36
  37 two major opcodes are needed
  38
  39 ternlog has its own major opcode
  40
  41 |  29.30 |31| name      |
  42 | ------ |--| --------- |
  43 |   0  0   |Rc| ternlogi  |
  44 |   0  1   |sz| ternlogv  |
  45 |   1 iv   |  | grevlogi |
  46
  47 2nd major opcode for other bitmanip: minor opcode allocation
  48
  49 |  28.30 |31| name      |
  50 | ------ |--| --------- |
  51 |  -00   |0 | xpermi    |
  52 |  -00   |1 | grevlog   |
  53 |  -01   |  | crternlog  |
  54 |  010   |Rc| bitmask   |
  55 |  011   |  | gf/cl madd*  |
  56 |  110   |Rc| 1/2-op    |
  57 |  111   |  | bmrevi   |
  58
  59
  60 1-op and variants
  61
  62 | dest | src1 | subop | op       |
  63 | ---- | ---- | ----- | -------- |
  64 | RT   | RA   | ..    | bmatflip |
  65
  66 2-op and variants
  67
  68 | dest | src1 | src2 | subop | op       |
  69 | ---- | ---- | ---- | ----- | -------- |
  70 | RT   | RA   | RB   | or    | bmatflip |
  71 | RT   | RA   | RB   | xor   | bmatflip |
  72 | RT   | RA   | RB   |       | grev  |
  73 | RT   | RA   | RB   |       | clmul*  |
  74 | RT   | RA   | RB   |       | gorc |
  75 | RT   | RA   | RB   | shuf  | shuffle |
  76 | RT   | RA   | RB   | unshuf| shuffle |
  77 | RT   | RA   | RB   | width | xperm  |
  78 | RT   | RA   | RB   | type | av minmax |
  79 | RT   | RA   | RB   |      | av abs avgadd  |
  80 | RT   | RA   | RB   | type | vmask ops |
  81 | RT   | RA   | RB   |      |       |
  82
  83 3 ops
  84
  85 * grevlog
  86 * GF mul-add
  87 * bitmask-reverse
  88
  89 TODO: convert all instructions to use RT and not RS
  90
  91 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|name|
  92 | -- | -- | --- | --- | --- |-----|----- | -----|--|----|
  93 | NN | BT | BA  | BB  | BC  |m0-2 | imm  |  10  |m3|crternlog|
  94
  95 | 0.5|6.10|11.15|16.20 |21..25   | 26....30  |31| name |
  96 | -- | -- | --- | ---  | -----   | --------  |--| ------ |
  97 | NN | RT | RA  |itype/| im0-4   | im5-7  00 |0 | xpermi  |
  98 | NN | RT | RA  | RB   | im0-4   | im5-7  00 |1 | grevlog |
  99 | NN |    |     |      |         | .....  01 |0 | crternlog |
 100 | NN | RT | RA  | RB   | RC      | mode  010 |Rc| bitmask* |
 101 | NN | RS | RA  | RB   | RC      | 00    011 |0 | gfbmadd |
 102 | NN | RS | RA  | RB   | RC      | 00    011 |1 | gfbmaddsub |
 103 | NN | RS | RA  | RB   | RC      | 01    011 |0 | clmadd |
 104 | NN | RS | RA  | RB   | RC      | 01    011 |1 | clmaddsub |
 105 | NN | RS | RA  | RB   | RC      | 10    011 |0 | gfpmadd |
 106 | NN | RS | RA  | RB   | RC      | 10    011 |1 | gfpmaddsub |
 107 | NN | RS | RA  | RB   | RC      | 11    011 |  | rsvd |
 108 | NN | RT | RA  | RB   | sh0-4   | sh5 1 111 |Rc| bmrevi |
 109
 110 ops (note that av avg and abs as well as vec scalar mask
 111 are included here [[sv/vector_ops]], and
 112 the [[sv/av_opcodes]])
 113
 114 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 115 double check that instructions didn't need 3 inputs.
 116
 117 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 118 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 119 | NN | RS | me  | sh  | SH | ME 0  | nn00 110 |Rc| bmopsi |
 120 | NN | RS | RB  | sh  | SH | 0   1 | nn00 110 |Rc| bmopsi |
 121 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| cldiv |
 122 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| clmod |
 123 | NN | RT | RA  | RB  | 1  |  10   | 0001 110 |Rc| clmul |
 124 | NN | RT | RB  | RB  | 1  |  11   | 0001 110 |Rc| clinv |
 125 | NN | RA | RB  | RC  | 0  |   00  | 0001 110 |Rc| vec sbfm |
 126 | NN | RA | RB  | RC  | 0  |   01  | 0001 110 |Rc| vec sofm |
 127 | NN | RA | RB  | RC  | 0  |   10  | 0001 110 |Rc| vec sifm |
 128 | NN | RA | RB  | RC  | 0  |   11  | 0001 110 |Rc| vec cprop |
 129 | NN | RT | RA  | RB  | 1  | itype | 0101 110 |Rc| xperm |
 130 | NN | RA | RB  | RC  | 0  | itype | 0101 110 |Rc| av minmax |
 131 | NN | RA | RB  | RC  | 1  |   00  | 0101 110 |Rc| av abss |
 132 | NN | RA | RB  | RC  | 1  |   01  | 0101 110 |Rc| av absu|
 133 | NN | RA | RB  |     | 1  |   10  | 0101 110 |Rc| av avgadd |
 134 | NN | RA | RB  | RC  | 1  |   11  | 0101 110 |Rc| grevw |
 135 | NN | RA | RB  |     |    |       | 1001 110 |Rc| rsvd |
 136 | NN | RA | RB  | RC  | 0  | 00    | 1101 110 |Rc| bmator  |
 137 | NN | RA | RB  | RC  | 1  | 00    | 1101 110 |Rc| bmatxor   |
 138 | NN | RA | RB  | sh  | 0  | 01    | 1101 110 |Rc| grevwi |
 139 | NN | RA | RB  | RC  | 1  | 01    | 1101 110 |Rc| grev |
 140 | NN | RA | RB  | sh  | SH | 10    | 1101 110 |Rc| grevi |
 141 | NN | RA | RB  | RC  | 0  | 11    | 1101 110 |Rc| clmulr  |
 142 | NN | RA | RB  | RC  | 1  | 11    | 1101 110 |Rc| clmulh  |
 143 | NN | RA | RB  | RC  |    |       | --10 110 |Rc| rsvd  |
 144 | NN |    |     |     |    |       | --11 110 |Rc| setvl  |
 145
 146 # ternlog bitops
 147
 148 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 149
 150 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 151
 152 ## ternlogi
 153
 154 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
 155 | -- | -- | --- | --- | ----- | --- |--|
 156 | NN | RT | RA  | RB  | im0-7 |  00 |Rc|
 157
 158     lut3(imm, a, b, c):
 159         idx = c << 2 | b << 1 | a
 160         return imm[idx] # idx by LSB0 order
 161
 162     for i in range(64):
 163         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 164
 165 ## ternlogv
 166
 167 also, another possible variant involving swizzle-like selection
 168 and masking, this only requires 3 64 bit registers (RA, RS, RB) and
 169 only 16 LUT3s.
 170
 171 Note however that unless XLEN matches sz, this instruction
 172 is a Read-Modify-Write: RS must be read as a second operand
 173 and all unmodified bits preserved.  SVP64 may provide limited
 174 alternative destination for RS from RS-as-source, but again
 175 all unmodified bits must still be copied.
 176
 177 | 0.5|6.10|11.15|16.20|21.28 | 29.30 |31|
 178 | -- | -- | --- | --- | ---- | ----- |--|
 179 | NN | RS | RA  | RB  |idx0-3|  01   |sz|
 180
 181     SZ = (1+sz) * 8 # 8 or 16
 182     raoff = MIN(XLEN, idx0 * SZ)
 183     rboff = MIN(XLEN, idx1 * SZ)
 184     rcoff = MIN(XLEN, idx2 * SZ)
 185     rsoff = MIN(XLEN, idx3 * SZ)
 186     imm = RB[0:8]
 187     for i in range(MIN(XLEN, SZ)):
 188         ra = RA[raoff:+i]
 189         rb = RA[rboff+i]
 190         rc = RA[rcoff+i]
 191         res = lut3(imm, ra, rb, rc)
 192         RS[rsoff+i] = res
 193
 194 ## ternlogcr
 195
 196 another mode selection would be CRs not Ints.
 197
 198 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
 199 | -- | -- | --- | --- | --- |-----|----- | -----|--|
 200 | NN | BT | BA  | BB  | BC  |m0-2 | imm  |  10  |m3|
 201
 202     mask = m0-3,m4
 203     for i in range(4):
 204         if not mask[i] continue
 205         crregs[BT][i] = lut3(imm,
 206                              crregs[BA][i],
 207                              crregs[BB][i],
 208                              crregs[BC][i])
 209
 210
 211 # int min/max
 212
 213 required for
 214 the [[sv/av_opcodes]]
 215
 216 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 217
 218 signed/unsigned min/max gives more flexibility.
 219
 220 ```
 221 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 222 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 223 }
 224 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 225 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 226 }
 227 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 228 { return rs1 < rs2 ? rs1 : rs2;
 229 }
 230 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 231 { return rs1 > rs2 ? rs1 : rs2;
 232 }
 233 ```
 234
 235
 236 ## cmix
 237
 238 based on RV bitmanip, covered by ternlog bitops
 239
 240 ```
 241 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 242     return (RA & RB) | (RC & ~RB);
 243 }
 244 ```
 245
 246
 247 # bitmask set
 248
 249 based on RV bitmanip singlebit set, instruction format similar to shift
 250 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 251 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 252
 253 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 254 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 255
 256 bmset (register for mask amount) is particularly useful for creating
 257 predicate masks where the length is a dynamic runtime quantity.
 258 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 259
 260 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 261 | -- | -- | --- | --- | --- | ------- |--| ----- |
 262 | NN | RS | RA  | RB  | RC  | mode 010 |Rc| bm*   |
 263
 264 Immediate-variant is an overwrite form:
 265
 266 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 267 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 268 | NN | RS | RB  | sh  | SH | itype | 1000 110 |Rc| bm*i |
 269
 270 ```
 271 def MASK(x, y):
 272      if x < y:
 273          x = x+1
 274          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 275          mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
 276      elif x == y:
 277          return 1 << x
 278      else:
 279          x = x+1
 280          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 281          mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
 282      return mask_a ^ mask_b
 283
 284
 285 uint_xlen_t bmset(RS, RB, sh)
 286 {
 287     int shamt = RB & (XLEN - 1);
 288     mask = (2<<sh)-1;
 289     return RS | (mask << shamt);
 290 }
 291
 292 uint_xlen_t bmclr(RS, RB, sh)
 293 {
 294     int shamt = RB & (XLEN - 1);
 295     mask = (2<<sh)-1;
 296     return RS & ~(mask << shamt);
 297 }
 298
 299 uint_xlen_t bminv(RS, RB, sh)
 300 {
 301     int shamt = RB & (XLEN - 1);
 302     mask = (2<<sh)-1;
 303     return RS ^ (mask << shamt);
 304 }
 305
 306 uint_xlen_t bmext(RS, RB, sh)
 307 {
 308     int shamt = RB & (XLEN - 1);
 309     mask = (2<<sh)-1;
 310     return mask & (RS >> shamt);
 311 }
 312 ```
 313
 314 bitmask extract with reverse.  can be done by bit-order-inverting all of RB and getting bits of RB from the opposite end.
 315
 316 when RA is zero, no shift occurs. this makes bmextrev useful for
 317 simply reversing all bits of a register.
 318
 319 ```
 320 msb = ra[5:0];
 321 rev[0:msb] = rb[msb:0];
 322 rt = ZE(rev[msb:0]);
 323
 324 uint_xlen_t bmextrev(RA, RB, sh)
 325 {
 326     int shamt = XLEN-1;
 327     if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
 328     shamt = (XLEN-1)-shamt;  # shift other end
 329     bra = bitreverse(RB)     # swap LSB-MSB
 330     mask = (2<<sh)-1;
 331     return mask & (bra >> shamt);
 332 }
 333 ```
 334
 335 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 336 | -- | -- | --- | --- | --- | ------- |--| ------ |
 337 | NN | RT | RA  | RB  | sh  | 1   011 |Rc| bmrevi |
 338
 339
 340 # grevlut
 341
 342 generalised reverse combined with a pair of LUT2s and allowing
 343 a constant `0b0101...0101` when RA=0, and an option to invert
 344 (including when RA=0, giving a constant 0b1010...1010 as the
 345 initial value) provides a wide range of instructions
 346 and a means to set regular 64 bit patterns in one
 347 32 bit instruction.
 348
 349 the two LUT2s are applied left-half (when not swapping)
 350 and right-half (when swapping) so as to allow a wider
 351 range of options.
 352
 353 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
 354
 355 * A value of `0b11001010` for the immediate provides
 356 the functionality of a standard "grev".
 357 * `0b11101110` provides gorc
 358
 359 grevlut should be arranged so as to produce the constants
 360 needed to put into bext (bitextract) so as in turn to
 361 be able to emulate x86 pmovmask instructions <https://www.felixcloutier.com/x86/pmovmskb>.
 362 This only requires 2 instructions (grevlut, bext).
 363
 364 Note that if the mask is required to be placed
 365 directly into CR Fields (for use as CR Predicate
 366 masks rather than a integer mask) then sv.ori
 367 may be used instead, bearing in mind that sv.ori
 368 is a 64-bit instruction, and `VL` must have been
 369 set to the required length:
 370
 371     sv.ori./elwid=8 r10.v, r10.v, 0
 372
 373 The following settings provide the required mask constants:
 374
 375 | RA       | RB      | imm        | iv | result        |
 376 | -------  | ------- | ---------- | -- | ----------    |
 377 | 0x555..  | 0b10    | 0b01101100 | 0  | 0x111111...   |
 378 | 0x555..  | 0b110   | 0b01101100 | 0  | 0x010101...   |
 379 | 0x555..  | 0b1110  | 0b01101100 | 0  | 0x00010001...   |
 380 | 0x555..  | 0b10    | 0b11000110 | 1  | 0x88888...   |
 381 | 0x555..  | 0b110   | 0b11000110 | 1  | 0x808080...   |
 382 | 0x555..  | 0b1110  | 0b11000110 | 1  | 0x80008000...   |
 383
 384 Better diagram showing the correct ordering of shamt (RB).  A LUT2
 385 is applied to all locations marked in red using the first 4
 386 bits of the immediate, and a separate LUT2 applied to all
 387 locations in green using the upper 4 bits of the immediate.
 388
 389 <img src="/openpower/sv/grevlut.png" width=700 />
 390
 391 demo code [[openpower/sv/grevlut.py]]
 392
 393 ```
 394 lut2(imm, a, b):
 395     idx = b << 1 | a
 396     return imm[idx] # idx by LSB0 order
 397
 398 dorow(imm8, step_i, chunksize):
 399     for j in 0 to 63:
 400         if (j&chunk_size) == 0
 401            imm = imm8[0..3]
 402         else
 403            imm = imm8[4..7]
 404         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 405     return step_o
 406
 407 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 imm, bool iv)
 408 {
 409     uint64_t x = 0x5555_5555_5555_5555;
 410     if (RA != 0) x = GPR(RA);
 411     if (iv) x = ~x;
 412     int shamt = RB & 63;
 413     for i in 0 to 6
 414         step = 1<<i
 415         if (shamt & step) x = dorow(imm, x, step)
 416     return x;
 417 }
 418
 419 ```
 420
 421 | 0.5|6.10|11.15|16.20 |21..25   | 26....30    |31| name |
 422 | -- | -- | --- | ---  | -----   | --------    |--| ------ |
 423 | NN | RT | RA  | s0-4 | im0-4   | im5-7  1 iv |s5| grevlogi |
 424 | NN | RT | RA  | RB   | im0-4   | im5-7  00   |1 | grevlog |
 425
 426
 427 # grev
 428
 429 based on RV bitmanip, this is also known as a butterfly network. however
 430 where a butterfly network allows setting of every crossbar setting in
 431 every row and every column, generalised-reverse (grev) only allows
 432 a per-row decision: every entry in the same row must either switch or
 433 not-switch.
 434
 435 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 436
 437 ```
 438 uint64_t grev64(uint64_t RA, uint64_t RB)
 439 {
 440     uint64_t x = RA;
 441     int shamt = RB & 63;
 442     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 443                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 444     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 445                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 446     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 447                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 448     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 449                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 450     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 451                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 452     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 453                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 454     return x;
 455 }
 456
 457 ```
 458
 459 # xperm
 460
 461 based on RV bitmanip.
 462
 463 RA contains a vector of indices to select parts of RB to be
 464 copied to RT.  The immediate-variant allows up to an 8 bit
 465 pattern (repeated) to be targetted at different parts of RT
 466
 467 ```
 468 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
 469 {
 470     uint_xlen_t r = 0;
 471     uint_xlen_t sz = 1LL << sz_log2;
 472     uint_xlen_t mask = (1LL << sz) - 1;
 473     uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
 474     for (int i = 0; i < XLEN; i += sz) {
 475         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 476         if (pos < XLEN)
 477             r |= ((RB >> pos) & mask) << i;
 478     }
 479     return r;
 480 }
 481 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 482 {
 483     uint_xlen_t r = 0;
 484     uint_xlen_t sz = 1LL << sz_log2;
 485     uint_xlen_t mask = (1LL << sz) - 1;
 486     for (int i = 0; i < XLEN; i += sz) {
 487         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 488         if (pos < XLEN)
 489             r |= ((RB >> pos) & mask) << i;
 490     }
 491     return r;
 492 }
 493 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 494 {  return xperm(RA, RB, 2); }
 495 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 496 {  return xperm(RA, RB, 3); }
 497 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 498 {  return xperm(RA, RB, 4); }
 499 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 500 {  return xperm(RA, RB, 5); }
 501 ```
 502
 503 # gorc
 504
 505 based on RV bitmanip
 506
 507 ```
 508 uint32_t gorc32(uint32_t RA, uint32_t RB)
 509 {
 510     uint32_t x = RA;
 511     int shamt = RB & 31;
 512     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 513     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 514     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 515     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 516     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 517     return x;
 518 }
 519 uint64_t gorc64(uint64_t RA, uint64_t RB)
 520 {
 521     uint64_t x = RA;
 522     int shamt = RB & 63;
 523     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 524                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 525     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 526                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 527     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 528                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 529     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 530                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 531     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 532                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 533     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 534                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 535     return x;
 536 }
 537
 538 ```
 539 # Introduction to Carry-less and GF arithmetic
 540
 541 * obligatory xkcd <https://xkcd.com/2595/>
 542
 543 There are three completely separate types of Galois-Field-based
 544 arithmetic that we implement which are not well explained even in introductory literature.  A slightly oversimplified explanation
 545 is followed by more accurate descriptions:
 546
 547 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
 548   but is accidentally referred to as GF(2) - see below as to why.
 549 * `GF(p)` modulo arithmetic with a Prime number, these are "proper" Galois Fields
 550 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
 551   (2^N) and a second "reducing" polynomial (similar to a prime number), these
 552   are said to be GF(2^N) arithmetic.
 553
 554 further detailed and more precise explanations are provided below
 555
 556 * **Polynomials with coefficients in `GF(2)`**
 557   (aka. Carry-less arithmetic -- the `cl*` instructions).
 558   This isn't actually a Galois Field, but its coefficients are. This is
 559   basically binary integer addition, subtraction, and multiplication like
 560   usual, except that carries aren't propagated at all, effectively turning
 561   both addition and subtraction into the bitwise xor operation. Division and
 562   remainder are defined to match how addition and multiplication works.
 563 * **Galois Fields with a prime size**
 564   (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
 565   This is basically just the integers mod `p`.
 566 * **Galois Fields with a power-of-a-prime size**
 567   (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
 568   integer `n > 0`).
 569   We only implement these for `p == 2`, called Binary Galois Fields
 570   (`GF(2^n)` -- the `gfb*` instructions).
 571   For any prime `p`, `GF(p^n)` is implemented as polynomials with
 572   coefficients in `GF(p)` and degree `< n`, where the polynomials are the
 573   remainders of dividing by a specificly chosen polynomial in `GF(p)` called
 574   the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
 575   Polynomial must be an irreducable polynomial (like primes, but for
 576   polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
 577   and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
 578   affect `GF(p^n)`'s mathematical shape, all that changes is the specific
 579   polynomials used to implement `GF(p^n)`.
 580
 581 Many implementations and much of the literature do not make a clear
 582 distinction between these three categories, which makes it confusing
 583 to understand what their purpose and value is.
 584
 585 * carry-less multiply is extremely common and is used for the ubiquitous
 586   CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
 587 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
 588   has significant uses throughout cryptography
 589 * GF(p) is the basis again of a significant quantity of algorithms
 590   (TODO, list them, jacob knows what they are), even though the
 591   modulo is limited to be below 64-bit (size of a scalar int)
 592
 593 # Instructions for Carry-less Operations
 594
 595 aka. Polynomials with coefficients in `GF(2)`
 596
 597 Carry-less addition/subtraction is simply XOR, so a `cladd`
 598 instruction is not provided since the `xor[i]` instruction can be used instead.
 599
 600 These are operations on polynomials with coefficients in `GF(2)`, with the
 601 polynomial's coefficients packed into integers with the following algorithm:
 602
 603 ```python
 604 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
 605 ```
 606
 607 ## Carry-less Multiply Instructions
 608
 609 based on RV bitmanip
 610 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
 611 <https://www.felixcloutier.com/x86/pclmulqdq> and
 612 <https://en.m.wikipedia.org/wiki/Carry-less_product>
 613
 614 They are worth adding as their own non-overwrite operations
 615 (in the same pipeline).
 616
 617 ### `clmul` Carry-less Multiply
 618
 619 ```python
 620 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
 621 ```
 622
 623 ### `clmulh` Carry-less Multiply High
 624
 625 ```python
 626 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
 627 ```
 628
 629 ### `clmulr` Carry-less Multiply (Reversed)
 630
 631 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
 632 bit-reversed inputs.
 633
 634 ```python
 635 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
 636 ```
 637
 638 ## `clmadd` Carry-less Multiply-Add
 639
 640 ```
 641 clmadd RT, RA, RB, RC
 642 ```
 643
 644 ```
 645 (RT) = clmul((RA), (RB)) ^ (RC)
 646 ```
 647
 648 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
 649
 650 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
 651 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
 652 to avoid the need for a temp register. RS is written to as well as RT.
 653
 654 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
 655 definition of the Inverse Discrete Fourier Transform involves calculating a
 656 multiplicative inverse, which may not exist in every Ring, therefore the
 657 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
 658 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
 659 TODO: check with someone who knows for sure if that's correct.)
 660
 661 ```
 662 cltmadd RT, RA, RB, RC
 663 ```
 664
 665 TODO: add link to explanation for where `RS` comes from.
 666
 667 ```
 668 a = (RA)
 669 c = (RC)
 670 # read all inputs before writing to any outputs in case
 671 # an input overlaps with an output register.
 672 (RT) = clmul(a, (RB)) ^ c
 673 (RS) = a ^ c
 674 ```
 675
 676 ## `cldivrem` Carry-less Division and Remainder
 677
 678 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
 679 for other instructions.
 680
 681 ```python
 682 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
 683 ```
 684
 685 ## `cldiv` Carry-less Division
 686
 687 ```
 688 cldiv RT, RA, RB
 689 ```
 690
 691 ```
 692 n = (RA)
 693 d = (RB)
 694 q, r = cldivrem(n, d, width=XLEN)
 695 (RT) = q
 696 ```
 697
 698 ## `clrem` Carry-less Remainder
 699
 700 ```
 701 clrem RT, RA, RB
 702 ```
 703
 704 ```
 705 n = (RA)
 706 d = (RB)
 707 q, r = cldivrem(n, d, width=XLEN)
 708 (RT) = r
 709 ```
 710
 711 # Instructions for Binary Galois Fields `GF(2^m)`
 712
 713 see:
 714
 715 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 716 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 717 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 718
 719 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
 720 instruction is not provided since the `xor[i]` instruction can be used instead.
 721
 722 ## `GFBREDPOLY` SPR -- Reducing Polynomial
 723
 724 In order to save registers and to make operations orthogonal with standard
 725 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
 726 This also allows hardware to pre-compute useful parameters (such as the
 727 degree, or look-up tables) based on the reducing polynomial, and store them
 728 alongside the SPR in hidden registers, only recomputing them whenever the SPR
 729 is written to, rather than having to recompute those values for every
 730 instruction.
 731
 732 Because Galois Fields require the reducing polynomial to be an irreducible
 733 polynomial, that guarantees that any polynomial of `degree > 1` must have
 734 the LSB set, since otherwise it would be divisible by the polynomial `x`,
 735 making it reducible, making whatever we're working on no longer a Field.
 736 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
 737
 738 ```python
 739 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
 740 ```
 741
 742 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
 743
 744 unless this is an immediate op, `mtspr` is completely sufficient.
 745
 746 ```python
 747 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
 748 ```
 749
 750 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
 751
 752 ```
 753 gfbmul RT, RA, RB
 754 ```
 755
 756 ```python
 757 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
 758 ```
 759
 760 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
 761
 762 ```
 763 gfbmadd RT, RA, RB, RC
 764 ```
 765
 766 ```python
 767 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
 768 ```
 769
 770 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
 771
 772 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
 773 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
 774 for a temp register. RS is written to as well as RT.
 775
 776 ```
 777 gfbtmadd RT, RA, RB, RC
 778 ```
 779
 780 TODO: add link to explanation for where `RS` comes from.
 781
 782 ```
 783 a = (RA)
 784 c = (RC)
 785 # read all inputs before writing to any outputs in case
 786 # an input overlaps with an output register.
 787 (RT) = gfbmadd(a, (RB), c)
 788 # use gfbmadd again since it reduces the result
 789 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
 790 ```
 791
 792 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
 793
 794 ```
 795 gfbinv RT, RA
 796 ```
 797
 798 ```python
 799 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
 800 ```
 801
 802 # Instructions for Prime Galois Fields `GF(p)`
 803
 804 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
 805
 806 ## `gfpadd` Prime Galois Field `GF(p)` Addition
 807
 808 ```
 809 gfpadd RT, RA, RB
 810 ```
 811
 812 ```python
 813 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
 814 ```
 815
 816 the addition happens on infinite-precision integers
 817
 818 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
 819
 820 ```
 821 gfpsub RT, RA, RB
 822 ```
 823
 824 ```python
 825 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
 826 ```
 827
 828 the subtraction happens on infinite-precision integers
 829
 830 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
 831
 832 ```
 833 gfpmul RT, RA, RB
 834 ```
 835
 836 ```python
 837 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
 838 ```
 839
 840 the multiplication happens on infinite-precision integers
 841
 842 ## `gfpinv` Prime Galois Field `GF(p)` Invert
 843
 844 ```
 845 gfpinv RT, RA
 846 ```
 847
 848 Some potential hardware implementations are found in:
 849 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
 850
 851 ```python
 852 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
 853 ```
 854
 855 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
 856
 857 ```
 858 gfpmadd RT, RA, RB, RC
 859 ```
 860
 861 ```python
 862 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
 863 ```
 864
 865 the multiplication and addition happens on infinite-precision integers
 866
 867 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
 868
 869 ```
 870 gfpmsub RT, RA, RB, RC
 871 ```
 872
 873 ```python
 874 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
 875 ```
 876
 877 the multiplication and subtraction happens on infinite-precision integers
 878
 879 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
 880
 881 ```
 882 gfpmsubr RT, RA, RB, RC
 883 ```
 884
 885 ```python
 886 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
 887 ```
 888
 889 the multiplication and subtraction happens on infinite-precision integers
 890
 891 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
 892
 893 Used in combination with SV FFT REMAP to perform
 894 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
 895 to avoid the need for a temp register. RS is written
 896 to as well as RT.
 897
 898 ```
 899 gfpmaddsubr RT, RA, RB, RC
 900 ```
 901
 902 TODO: add link to explanation for where `RS` comes from.
 903
 904 ```
 905 factor1 = (RA)
 906 factor2 = (RB)
 907 term = (RC)
 908 # read all inputs before writing to any outputs in case
 909 # an input overlaps with an output register.
 910 (RT) = gfpmadd(factor1, factor2, term)
 911 (RS) = gfpmsubr(factor1, factor2, term)
 912 ```
 913
 914 # bitmatrix
 915
 916 ```
 917 uint64_t bmatflip(uint64_t RA)
 918 {
 919     uint64_t x = RA;
 920     x = shfl64(x, 31);
 921     x = shfl64(x, 31);
 922     x = shfl64(x, 31);
 923     return x;
 924 }
 925 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 926 {
 927     // transpose of RB
 928     uint64_t RBt = bmatflip(RB);
 929     uint8_t u[8]; // rows of RA
 930     uint8_t v[8]; // cols of RB
 931     for (int i = 0; i < 8; i++) {
 932         u[i] = RA >> (i*8);
 933         v[i] = RBt >> (i*8);
 934     }
 935     uint64_t x = 0;
 936     for (int i = 0; i < 64; i++) {
 937         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 938             x |= 1LL << i;
 939     }
 940     return x;
 941 }
 942 uint64_t bmator(uint64_t RA, uint64_t RB)
 943 {
 944     // transpose of RB
 945     uint64_t RBt = bmatflip(RB);
 946     uint8_t u[8]; // rows of RA
 947     uint8_t v[8]; // cols of RB
 948     for (int i = 0; i < 8; i++) {
 949         u[i] = RA >> (i*8);
 950         v[i] = RBt >> (i*8);
 951     }
 952     uint64_t x = 0;
 953     for (int i = 0; i < 64; i++) {
 954         if ((u[i / 8] & v[i % 8]) != 0)
 955             x |= 1LL << i;
 956     }
 957     return x;
 958 }
 959
 960 ```
 961
 962 # Already in POWER ISA
 963
 964 ## count leading/trailing zeros with mask
 965
 966 in v3.1 p105
 967
 968 ```
 969 count = 0
 970 do i = 0 to 63 if((RB)i=1) then do
 971 if((RS)i=1) then break end end count ← count + 1
 972 RA ← EXTZ64(count)
 973 ```
 974
 975 ## bit deposit
 976
 977 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
 978
 979     do while(m < 64)
 980        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
 981           result = VSR[VRA+32].dword[i].bit[63-k]
 982           VSR[VRT+32].dword[i].bit[63-m] = result
 983           k = k + 1
 984        m = m + 1
 985
 986 ```
 987
 988 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
 989 {
 990     uint_xlen_t r = 0;
 991     for (int i = 0, j = 0; i < XLEN; i++)
 992         if ((RB >> i) & 1) {
 993             if ((RA >> j) & 1)
 994                 r |= uint_xlen_t(1) << i;
 995             j++;
 996         }
 997     return r;
 998 }
 999
1000 ```
1001
1002 ## bit extract
1003
1004 other way round: identical to RV bext: pextd, found in v3.1 p196
1005
1006 ```
1007 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1008 {
1009     uint_xlen_t r = 0;
1010     for (int i = 0, j = 0; i < XLEN; i++)
1011         if ((RB >> i) & 1) {
1012             if ((RA >> i) & 1)
1013                 r |= uint_xlen_t(1) << j;
1014             j++;
1015         }
1016     return r;
1017 }
1018 ```
1019
1020 ## centrifuge
1021
1022 found in v3.1 p106 so not to be added here
1023
1024 ```
1025 ptr0 = 0
1026 ptr1 = 0
1027 do i = 0 to 63
1028     if((RB)i=0) then do
1029        resultptr0 = (RS)i
1030     end
1031     ptr0 = ptr0 + 1
1032     if((RB)63-i==1) then do
1033         result63-ptr1 = (RS)63-i
1034     end
1035     ptr1 = ptr1 + 1
1036 RA = result
1037 ```
1038
1039 ## bit to byte permute
1040
1041 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1042 these perform a transpose. TODO this looks VSX is there a scalar variant
1043 in v3.0/1 already
1044
1045     do j = 0 to 7
1046       do k = 0 to 7
1047          b = VSR[VRB+32].dword[i].byte[k].bit[j]
1048          VSR[VRT+32].dword[i].byte[j].bit[k] = b
1049
1050 # Appendix
1051
1052 see [[bitmanip/appendix]]
1053