openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 [[!toc levels=1]]
   4
   5 # Implementation Log
   6
   7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11
  12 # bitmanipulation
  13
  14 **DRAFT STATUS**
  15
  16 pseudocode: [[openpower/isa/bitmanip]]
  17
  18 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
  19 Also included are DSP/Multimedia operations suitable for
  20 Audio/Video.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  21 Vectorisation Context is provided by [[openpower/sv]].
  22
  23 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that the Packed SIMD aspects of VSX may be retired as "legacy"
  24 in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  25
  26 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  27
  28 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  29
  30 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  31 the [[sv/av_opcodes]] as well as [[sv/setvl]], [[sv/svstep]], [[sv/remap]]
  32
  33 Useful resource:
  34
  35 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  36 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  37
  38 # summary
  39
  40 two major opcodes are needed
  41
  42 ternlog has its own major opcode
  43
  44 |  29.30 |31| name      |
  45 | ------ |--| --------- |
  46 |   0  0   |Rc| ternlogi  | TLI-Form |
  47 |   0  1   |  | crternlogi | TLI-Form |
  48 |   1 iv   |  | grevlogi | TLI-Form |
  49
  50 2nd major opcode for other bitmanip: minor opcode allocation
  51
  52 |  28.30 |31| name      |
  53 | ------ |--| --------- |
  54 |  -00   |0 | xpermi    |
  55 |  -00   |1 | rsvd      |
  56 |  -01   |0 | grevlog     |
  57 |  -01   |1 | grevlogw     |
  58 |  010   |Rc| bitmask   |
  59 |  011   |  | SVP64  |
  60 |  110   |Rc| 1/2-op    |
  61 |  111   |  | bmrevi   |
  62
  63
  64 1-op and variants
  65
  66 | dest | src1 | subop | op       |
  67 | ---- | ---- | ----- | -------- |
  68 | RT   | RA   | ..    | bmatflip |
  69
  70 2-op and variants
  71
  72 | dest | src1 | src2 | subop | op       |
  73 | ---- | ---- | ---- | ----- | -------- |
  74 | RT   | RA   | RB   | or    | bmatflip |
  75 | RT   | RA   | RB   | xor   | bmatflip |
  76 | RT   | RA   | RB   |       | grev  |
  77 | RT   | RA   | RB   |       | clmul\*  |
  78 | RT   | RA   | RB   |       | gorc |
  79 | RT   | RA   | RB   | shuf  | shuffle |
  80 | RT   | RA   | RB   | unshuf| shuffle |
  81 | RT   | RA   | RB   | width | xperm  |
  82 | RT   | RA   | RB   | type | av minmax |
  83 | RT   | RA   | RB   |      | av abs avgadd  |
  84 | RT   | RA   | RB   | type | vmask ops |
  85 | RT   | RA   | RB   | type | abs accumulate (overwrite)  |
  86
  87 3 ops
  88
  89 * grevlog[w]
  90 * GF mul-add
  91 * bitmask-reverse
  92
  93 TODO: convert all instructions to use RT and not RS
  94
  95 | 0.5|6.10|11.15|16.20 |21..25   | 26....30  |31| name     | Form    |
  96 | -- | -- | --- | ---  | -----   | --------  |--| ------   | -------- |
  97 | NN | RT | RA  |itype/| im0-4   | im5-7  00 |0 | xpermi  |           |
  98 | NN | RT | RA  | RB   | RC      | nh 00  00 |1 | binlut |           |
  99 | NN | RT | RA  | RB   | BFC//   | 0  01  00 |1 | bincrflut    |           |
 100 | NN |    |     |      |         | 1  01  00 |1 | rsvd    |           |
 101 | NN |    |     |      |         | -  10  00 |1 | rsvd    |           |
 102 | NN |    |     |      |         | 0  11  00 |1 | svshape    |           |
 103 | NN |    |     |      |         | 1  11  00 |1 | svstep   |           |
 104 | NN | RT | RA  | RB   | im0-4   | im5-7  01 |0 | grevlog |           |
 105 | NN | RT | RA  | RB   | im0-4   | im5-7  01 |1 | grevlogw |           |
 106 | NN | RT | RA  | RB   | RC      | mode  010 |Rc| bitmask\* |           |
 107 | NN |    |     |      |         | 0-    011 |  | rsvd    |           |
 108 | NN |    |     |      |         | 10    011 |Rc| svstep |           |
 109 | NN |    |     |      |         | 11    011 |Rc| setvl |           |
 110 | NN |    |     |      |         | ----  110 |  | 1/2 ops | other table |
 111 | NN | RT | RA  | RB   | sh0-4   | sh5 1 111 |Rc| bmrevi |           |
 112
 113 ops (note that av avg and abs as well as vec scalar mask
 114 are included here [[sv/vector_ops]], and
 115 the [[sv/av_opcodes]])
 116
 117 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name      |  Form   |
 118 | -- | -- | --- | --- | -- | ----- | -------- |--| ----      | ------- |
 119 | NN | RS | me  | sh  | SH | ME 0  | nn00 110 |Rc| bmopsi    | {TODO}  |
 120 | NN | RS | RA  | sh  | SH | 0   1 | nn00 110 |Rc| bmopsi    | XB-Form |
 121 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| cldiv     | X-Form  |
 122 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| clmod     | X-Form  |
 123 | NN | RT | RA  |     | 1  |  10   | 0001 110 |Rc| clmulh    | X-Form  |
 124 | NN | RT | RA  | RB  | 1  |  11   | 0001 110 |Rc| clmul     | X-Form  |
 125 | NN | RT | RA  | RB  | 0  |   00  | 0001 110 |Rc| vec sbfm  | X-Form  |
 126 | NN | RT | RA  | RB  | 0  |   01  | 0001 110 |Rc| vec sofm  | X-Form  |
 127 | NN | RT | RA  | RB  | 0  |   10  | 0001 110 |Rc| vec sifm  | X-Form  |
 128 | NN | RT | RA  | RB  | 0  |   11  | 0001 110 |Rc| vec cprop | X-Form  |
 129 | NN |    |     |     |    |   -0  | 0101 110 |Rc| crfbinlog | {TODO}  |
 130 | NN |    |     |     |    |   -1  | 0101 110 |Rc| rsvd      |         |
 131 | NN | RT | RA  | RB  | 0  | itype | 1001 110 |Rc| av minmax | X-Form  |
 132 | NN | RT | RA  | RB  | 1  |   00  | 1001 110 |Rc| av abss   | X-Form  |
 133 | NN | RT | RA  | RB  | 1  |   01  | 1001 110 |Rc| av absu   | X-Form  |
 134 | NN | RT | RA  | RB  | 1  |   10  | 1001 110 |Rc| av avgadd | X-Form  |
 135 | NN |    |     |     | 1  |   11  | 1001 110 |Rc| rsvd      |         |
 136 | NN | RT | RA  | RB  | 0  | itype | 1101 110 |Rc| shadd     | {TODO}  |
 137 | NN | RT | RA  | RB  | 1  | itype | 1101 110 |Rc| shadduw   | {TODO}  |
 138 | NN | RT | RA  | RB  | 0  | 00    | 0010 110 |Rc| gorc      | X-Form  |
 139 | NN | RS | RA  | sh  | SH | 00    | 1010 110 |Rc| gorci     | XB-Form |
 140 | NN | RT | RA  | RB  | 0  | 00    | 0110 110 |Rc| gorcw     | X-Form  |
 141 | NN | RS | RA  | SH  | 0  | 00    | 1110 110 |Rc| gorcwi    | X-Form  |
 142 | NN | RT | RA  | RB  | 1  | 00    | 1110 110 |Rc| rsvd      |         |
 143 | NN | RT | RA  | RB  | 0  | 01    | 0010 110 |Rc| grev      | X-Form  |
 144 | NN | RT | RA  | RB  | 1  | 01    | 0010 110 |Rc| clmulr    | X-Form  |
 145 | NN | RS | RA  | sh  | SH | 01    | 1010 110 |Rc| grevi     | XB-Form |
 146 | NN | RT | RA  | RB  | 0  | 01    | 0110 110 |Rc| grevw     | X-Form  |
 147 | NN | RS | RA  | SH  | 0  | 01    | 1110 110 |Rc| grevwi    | X-Form  |
 148 | NN | RT | RA  | RB  | 1  | 01    | 1110 110 |Rc| rsvd      |         |
 149 | NN | RS | RA  | RB  | 0  | 10    | 0010 110 |Rc| bmator    | X-Form  |
 150 | NN | RS | RA  | RB  | 0  | 10    | 0110 110 |Rc| bmatand   | X-Form  |
 151 | NN | RS | RA  | RB  | 0  | 10    | 1010 110 |Rc| bmatxor   | X-Form  |
 152 | NN | RS | RA  | RB  | 0  | 10    | 1110 110 |Rc| bmatflip  | X-Form  |
 153 | NN | RT | RA  | RB  | 1  | 10    | 0010 110 |Rc| xpermn    | X-Form  |
 154 | NN | RT | RA  | RB  | 1  | 10    | 0110 110 |Rc| xpermb    | X-Form  |
 155 | NN | RT | RA  | RB  | 1  | 10    | 1010 110 |Rc| xpermh    | X-Form  |
 156 | NN | RT | RA  | RB  | 1  | 10    | 1110 110 |Rc| xpermw    | X-Form  |
 157 | NN | RT | RA  | RB  | 0  | 11    | 1110 110 |Rc| abssa     | X-Form  |
 158 | NN | RT | RA  | RB  | 1  | 11    | 1110 110 |Rc| absua     | X-Form  |
 159 | NN |    |     |     |    |       | --11 110 |Rc| rsvd      |         |
 160
 161 # binary and ternary bitops
 162
 163 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8-8-bit immediate (for the ternary instructions), or in another register (4-bit
 164 for the binary instructions).  The binary lookup instructions have CR Field
 165 lookup variants due to CR Fields being 4 bit.
 166
 167 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 168
 169 ## ternlogi
 170
 171 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
 172 | -- | -- | --- | --- | ----- | --- |--|
 173 | NN | RT | RA  | RB  | im0-7 |  00 |Rc|
 174
 175     lut3(imm, a, b, c):
 176         idx = c << 2 | b << 1 | a
 177         return imm[idx] # idx by LSB0 order
 178
 179     for i in range(64):
 180         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 181
 182 ## binlut
 183
 184 Binary lookup is a dynamic LUT2 version of ternlogi. Firstly, the
 185 lookup table is 4 bits wide not 8 bits, and secondly the lookup
 186 table comes from a register not an immediate.
 187
 188 | 0.5|6.10|11.15|16.20| 21..25|26..30 |31|
 189 | -- | -- | --- | --- | ----- |------ |--|
 190 | NN | RT | RA  | RB  | RC    |nh 0000|1 |
 191 | NN | RT | RA  | RB  | BFA// |0  0100|1 |
 192
 193 For binlut:
 194
 195     lut2(imm, a, b):
 196         idx = b << 1 | a
 197         return imm[idx] # idx by LSB0 order
 198
 199     imm = (RC>>(nh*4))&0b1111
 200     for i in range(64):
 201         RT[i] = lut2(imm, RB[i], RA[i])
 202
 203 For bincrlut, `BFA` selects the 4-bit CR Field as the LUT2:
 204
 205     for i in range(64):
 206         RT[i] = lut2(CRs{BFA}, RB[i], RA[i])
 207
 208 *Programmer's note: a dynamic ternary lookup may be synthesised from
 209 a pair of `binlut` instructions followed by a `ternlogi` to select which
 210 to merge. Use `nh` to select which nibble to use as the lookup table
 211 from the RC source register (`nh=1` nibble high)*
 212
 213 ## crternlogi
 214
 215 another mode selection would be CRs not Ints.
 216
 217 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
 218 | -- | -- | --- | --- | --- |-----|----- | -----|--|
 219 | NN | BT | BA  | BB  | BC  |m0-2 | imm  |  01  |m3|
 220
 221     mask = m0-3,m4
 222     for i in range(4):
 223         a,b,c = CRs[BA][i], CRs[BB][i], CRs[BC][i])
 224         if mask[i] CRs[BT][i] = lut3(imm, a, b, c)
 225
 226 ## crbinlog
 227
 228 With ternary (LUT3) dynamic instructions being very costly,
 229 and CR Fields being only 4 bit, a binary (LUT2) variant is better
 230
 231 | 0.5|6.8 | 9.11|12.14|15.17|18.22|23...30  |31|
 232 | -- | -- | --- | --- | --- |-----| --------|--|
 233 | NN | BT | BA  | BB  | BC  |m0-m2|00101110 |m3|
 234
 235     mask = m0-3,m4
 236     for i in range(4):
 237         a,b = CRs[BA][i], CRs[BB][i])
 238         if mask[i] CRs[BT][i] = lut2(CRs[BC], a, b)
 239
 240 *Programmer's note: just as with binlut and ternlogi, a pair
 241  of crbinlog instructions followed by a merging crternlogi may
 242  be deployed to synthesise dynamic ternary (LUT3) CR Field
 243  manipulation*
 244
 245 # int ops
 246
 247 ## min/m
 248
 249 required for the [[sv/av_opcodes]]
 250
 251 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 252
 253 signed/unsigned min/max gives more flexibility.
 254
 255 ```
 256 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 257 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 258 }
 259 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 260 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 261 }
 262 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 263 { return rs1 < rs2 ? rs1 : rs2;
 264 }
 265 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 266 { return rs1 > rs2 ? rs1 : rs2;
 267 }
 268 ```
 269
 270 ## average
 271
 272 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 273 but not scalar
 274
 275 ```
 276 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
 277      return (rs1 + rs2 + 1) >> 1:
 278 }
 279 ```
 280
 281 ## abs
 282
 283 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 284 but not scalar
 285
 286 ```
 287 uint_xlen_t intabs(uint_xlen_t rs1, uint_xlen_t rs2) {
 288      return (src1 > src2) ? (src1-src2) : (src2-src1)
 289 }
 290 ```
 291
 292 ## abs-accumulate
 293
 294 required for the [[sv/av_opcodes]], these are needed for motion estimation.
 295 both are overwrite on RS.
 296
 297 ```
 298 uint_xlen_t uintabsacc(uint_xlen_t rs, uint_xlen_t ra, uint_xlen_t rb) {
 299      return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
 300 }
 301 uint_xlen_t intabsacc(uint_xlen_t rs, int_xlen_t ra, int_xlen_t rb) {
 302      return rs + (src1 > src2) ? (src1-src2) : (src2-src1)
 303 }
 304 ```
 305
 306 For SVP64, the twin Elwidths allows e.g. a 16 bit accumulator for 8 bit
 307 differences.  Form is `RM-1P-3S1D` where RS-as-source has a separate
 308 SVP64 designation from RS-as-dest. This gives a limited range of
 309 non-overwrite capability.
 310
 311 # shift-and-add
 312
 313 Power ISA is missing LD/ST with shift, which is present in both ARM and x86.
 314 Too complex to add more LD/ST, a compromise is to add shift-and-add.
 315 Replaces a pair of explicit instructions in hot-loops.
 316
 317 ```
 318 uint_xlen_t shadd(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
 319     return (rs1 << (sh+1)) + rs2;
 320 }
 321
 322 uint_xlen_t shadduw(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
 323     uint_xlen_t rs1z = rs1 & 0xFFFFFFFF;
 324     return (rs1z << (sh+1)) + rs2;
 325 }
 326 ```
 327
 328 # cmix
 329
 330 based on RV bitmanip, covered by ternlog bitops
 331
 332 ```
 333 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 334     return (RA & RB) | (RC & ~RB);
 335 }
 336 ```
 337
 338
 339 # bitmask set
 340
 341 based on RV bitmanip singlebit set, instruction format similar to shift
 342 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 343 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 344
 345 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 346 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 347
 348 bmset (register for mask amount) is particularly useful for creating
 349 predicate masks where the length is a dynamic runtime quantity.
 350 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 351
 352 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 353 | -- | -- | --- | --- | --- | ------- |--| ----- |
 354 | NN | RS | RA  | RB  | RC  | mode 010 |Rc| bm\*   |
 355
 356 Immediate-variant is an overwrite form:
 357
 358 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 359 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 360 | NN | RS | RB  | sh  | SH | itype | 1000 110 |Rc| bm\*i |
 361
 362 ```
 363 def MASK(x, y):
 364      if x < y:
 365          x = x+1
 366          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 367          mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
 368      elif x == y:
 369          return 1 << x
 370      else:
 371          x = x+1
 372          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 373          mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
 374      return mask_a ^ mask_b
 375
 376
 377 uint_xlen_t bmset(RS, RB, sh)
 378 {
 379     int shamt = RB & (XLEN - 1);
 380     mask = (2<<sh)-1;
 381     return RS | (mask << shamt);
 382 }
 383
 384 uint_xlen_t bmclr(RS, RB, sh)
 385 {
 386     int shamt = RB & (XLEN - 1);
 387     mask = (2<<sh)-1;
 388     return RS & ~(mask << shamt);
 389 }
 390
 391 uint_xlen_t bminv(RS, RB, sh)
 392 {
 393     int shamt = RB & (XLEN - 1);
 394     mask = (2<<sh)-1;
 395     return RS ^ (mask << shamt);
 396 }
 397
 398 uint_xlen_t bmext(RS, RB, sh)
 399 {
 400     int shamt = RB & (XLEN - 1);
 401     mask = (2<<sh)-1;
 402     return mask & (RS >> shamt);
 403 }
 404 ```
 405
 406 bitmask extract with reverse.  can be done by bit-order-inverting all of RB and getting bits of RB from the opposite end.
 407
 408 when RA is zero, no shift occurs. this makes bmextrev useful for
 409 simply reversing all bits of a register.
 410
 411 ```
 412 msb = ra[5:0];
 413 rev[0:msb] = rb[msb:0];
 414 rt = ZE(rev[msb:0]);
 415
 416 uint_xlen_t bmextrev(RA, RB, sh)
 417 {
 418     int shamt = XLEN-1;
 419     if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
 420     shamt = (XLEN-1)-shamt;  # shift other end
 421     bra = bitreverse(RB)     # swap LSB-MSB
 422     mask = (2<<sh)-1;
 423     return mask & (bra >> shamt);
 424 }
 425 ```
 426
 427 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 428 | -- | -- | --- | --- | --- | ------- |--| ------ |
 429 | NN | RT | RA  | RB  | sh  | 1   011 |Rc| bmrevi |
 430
 431
 432 # grevlut
 433
 434 generalised reverse combined with a pair of LUT2s and allowing
 435 a constant `0b0101...0101` when RA=0, and an option to invert
 436 (including when RA=0, giving a constant 0b1010...1010 as the
 437 initial value) provides a wide range of instructions
 438 and a means to set hundreds of regular 64 bit patterns with one
 439 single 32 bit instruction.
 440
 441 the two LUT2s are applied left-half (when not swapping)
 442 and right-half (when swapping) so as to allow a wider
 443 range of options.
 444
 445 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
 446
 447 * A value of `0b11001010` for the immediate provides
 448 the functionality of a standard "grev".
 449 * `0b11101110` provides gorc
 450
 451 grevlut should be arranged so as to produce the constants
 452 needed to put into bext (bitextract) so as in turn to
 453 be able to emulate x86 pmovmask instructions <https://www.felixcloutier.com/x86/pmovmskb>.
 454 This only requires 2 instructions (grevlut, bext).
 455
 456 Note that if the mask is required to be placed
 457 directly into CR Fields (for use as CR Predicate
 458 masks rather than a integer mask) then sv.cmpi or sv.ori
 459 may be used instead, bearing in mind that sv.ori
 460 is a 64-bit instruction, and `VL` must have been
 461 set to the required length:
 462
 463     sv.ori./elwid=8 r10.v, r10.v, 0
 464
 465 The following settings provide the required mask constants:
 466
 467 | RA       | RB      | imm        | iv | result        |
 468 | -------  | ------- | ---------- | -- | ----------    |
 469 | 0x555..  | 0b10    | 0b01101100 | 0  | 0x111111...   |
 470 | 0x555..  | 0b110   | 0b01101100 | 0  | 0x010101...   |
 471 | 0x555..  | 0b1110  | 0b01101100 | 0  | 0x00010001...   |
 472 | 0x555..  | 0b10    | 0b11000110 | 1  | 0x88888...   |
 473 | 0x555..  | 0b110   | 0b11000110 | 1  | 0x808080...   |
 474 | 0x555..  | 0b1110  | 0b11000110 | 1  | 0x80008000...   |
 475
 476 Better diagram showing the correct ordering of shamt (RB).  A LUT2
 477 is applied to all locations marked in red using the first 4
 478 bits of the immediate, and a separate LUT2 applied to all
 479 locations in green using the upper 4 bits of the immediate.
 480
 481 <img src="/openpower/sv/grevlut.png" width=700 />
 482
 483 demo code [[openpower/sv/grevlut.py]]
 484
 485 ```
 486 lut2(imm, a, b):
 487     idx = b << 1 | a
 488     return imm[idx] # idx by LSB0 order
 489
 490 dorow(imm8, step_i, chunksize, us32b):
 491     for j in 0 to 31 if is32b else 63:
 492         if (j&chunk_size) == 0
 493            imm = imm8[0..3]
 494         else
 495            imm = imm8[4..7]
 496         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 497     return step_o
 498
 499 uint64_t grevlut(uint64_t RA, uint64_t RB, uint8 imm, bool iv, bool is32b)
 500 {
 501     uint64_t x = 0x5555_5555_5555_5555;
 502     if (RA != 0) x = GPR(RA);
 503     if (iv) x = ~x;
 504     int shamt = RB & 31 if is32b else 63
 505     for i in 0 to (6-is32b)
 506         step = 1<<i
 507         if (shamt & step) x = dorow(imm, x, step, is32b)
 508     return x;
 509 }
 510
 511 ```
 512
 513 | 0.5|6.10|11.15|16.20 |21..28   | 29.30|31| name |
 514 | -- | -- | --- | ---  | -----   | -----|--| ------ |
 515 | NN | RT | RA  | s0-4 | im0-7   | 1 iv |s5| grevlogi |
 516 | NN | RT | RA  | RB   | im0-7   | 01   |0 | grevlog |           |
 517 | NN | RT | RA  | RB   | im0-7   | 01   |1 | grevlogw |           |
 518
 519 # grev
 520
 521 superceded by grevlut
 522
 523 based on RV bitmanip, this is also known as a butterfly network. however
 524 where a butterfly network allows setting of every crossbar setting in
 525 every row and every column, generalised-reverse (grev) only allows
 526 a per-row decision: every entry in the same row must either switch or
 527 not-switch.
 528
 529 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 530
 531 ```
 532 uint64_t grev64(uint64_t RA, uint64_t RB)
 533 {
 534     uint64_t x = RA;
 535     int shamt = RB & 63;
 536     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 537                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 538     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 539                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 540     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 541                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 542     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 543                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 544     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 545                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 546     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 547                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 548     return x;
 549 }
 550
 551 ```
 552
 553 # gorc
 554
 555 based on RV bitmanip, gorc is superceded by grevlut
 556
 557 ```
 558 uint32_t gorc32(uint32_t RA, uint32_t RB)
 559 {
 560     uint32_t x = RA;
 561     int shamt = RB & 31;
 562     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 563     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 564     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 565     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 566     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 567     return x;
 568 }
 569 uint64_t gorc64(uint64_t RA, uint64_t RB)
 570 {
 571     uint64_t x = RA;
 572     int shamt = RB & 63;
 573     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 574                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 575     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 576                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 577     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 578                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 579     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 580                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 581     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 582                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 583     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 584                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 585     return x;
 586 }
 587
 588 ```
 589
 590 # xperm
 591
 592 based on RV bitmanip.
 593
 594 RA contains a vector of indices to select parts of RB to be
 595 copied to RT.  The immediate-variant allows up to an 8 bit
 596 pattern (repeated) to be targetted at different parts of RT.
 597
 598 xperm shares some similarity with one of the uses of bmator
 599 in that xperm indices are binary addressing where bitmator
 600 may be considered to be unary addressing.
 601
 602 ```
 603 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
 604 {
 605     uint_xlen_t r = 0;
 606     uint_xlen_t sz = 1LL << sz_log2;
 607     uint_xlen_t mask = (1LL << sz) - 1;
 608     uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
 609     for (int i = 0; i < XLEN; i += sz) {
 610         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 611         if (pos < XLEN)
 612             r |= ((RB >> pos) & mask) << i;
 613     }
 614     return r;
 615 }
 616 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 617 {
 618     uint_xlen_t r = 0;
 619     uint_xlen_t sz = 1LL << sz_log2;
 620     uint_xlen_t mask = (1LL << sz) - 1;
 621     for (int i = 0; i < XLEN; i += sz) {
 622         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 623         if (pos < XLEN)
 624             r |= ((RB >> pos) & mask) << i;
 625     }
 626     return r;
 627 }
 628 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 629 {  return xperm(RA, RB, 2); }
 630 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 631 {  return xperm(RA, RB, 3); }
 632 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 633 {  return xperm(RA, RB, 4); }
 634 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 635 {  return xperm(RA, RB, 5); }
 636 ```
 637
 638 # bitmatrix
 639
 640 ```
 641 uint64_t bmatflip(uint64_t RA)
 642 {
 643     uint64_t x = RA;
 644     x = shfl64(x, 31);
 645     x = shfl64(x, 31);
 646     x = shfl64(x, 31);
 647     return x;
 648 }
 649 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 650 {
 651     // transpose of RB
 652     uint64_t RBt = bmatflip(RB);
 653     uint8_t u[8]; // rows of RA
 654     uint8_t v[8]; // cols of RB
 655     for (int i = 0; i < 8; i++) {
 656         u[i] = RA >> (i*8);
 657         v[i] = RBt >> (i*8);
 658     }
 659     uint64_t x = 0;
 660     for (int i = 0; i < 64; i++) {
 661         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 662             x |= 1LL << i;
 663     }
 664     return x;
 665 }
 666 uint64_t bmator(uint64_t RA, uint64_t RB)
 667 {
 668     // transpose of RB
 669     uint64_t RBt = bmatflip(RB);
 670     uint8_t u[8]; // rows of RA
 671     uint8_t v[8]; // cols of RB
 672     for (int i = 0; i < 8; i++) {
 673         u[i] = RA >> (i*8);
 674         v[i] = RBt >> (i*8);
 675     }
 676     uint64_t x = 0;
 677     for (int i = 0; i < 64; i++) {
 678         if ((u[i / 8] & v[i % 8]) != 0)
 679             x |= 1LL << i;
 680     }
 681     return x;
 682 }
 683 uint64_t bmatand(uint64_t RA, uint64_t RB)
 684 {
 685     // transpose of RB
 686     uint64_t RBt = bmatflip(RB);
 687     uint8_t u[8]; // rows of RA
 688     uint8_t v[8]; // cols of RB
 689     for (int i = 0; i < 8; i++) {
 690         u[i] = RA >> (i*8);
 691         v[i] = RBt >> (i*8);
 692     }
 693     uint64_t x = 0;
 694     for (int i = 0; i < 64; i++) {
 695         if ((u[i / 8] & v[i % 8]) == 0xff)
 696             x |= 1LL << i;
 697     }
 698     return x;
 699 }
 700 ```
 701
 702 # Introduction to Carry-less and GF arithmetic
 703
 704 * obligatory xkcd <https://xkcd.com/2595/>
 705
 706 There are three completely separate types of Galois-Field-based arithmetic
 707 that we implement which are not well explained even in introductory
 708 literature.  A slightly oversimplified explanation is followed by more
 709 accurate descriptions:
 710
 711 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
 712   but is accidentally referred to as GF(2) - see below as to why.
 713 * `GF(p)` modulo arithmetic with a Prime number, these are "proper"
 714    Galois Fields
 715 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
 716   (2^N) and a second "reducing" polynomial (similar to a prime number), these
 717   are said to be GF(2^N) arithmetic.
 718
 719 further detailed and more precise explanations are provided below
 720
 721 * **Polynomials with coefficients in `GF(2)`**
 722   (aka. Carry-less arithmetic -- the `cl*` instructions).
 723   This isn't actually a Galois Field, but its coefficients are. This is
 724   basically binary integer addition, subtraction, and multiplication like
 725   usual, except that carries aren't propagated at all, effectively turning
 726   both addition and subtraction into the bitwise xor operation. Division and
 727   remainder are defined to match how addition and multiplication works.
 728 * **Galois Fields with a prime size**
 729   (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
 730   This is basically just the integers mod `p`.
 731 * **Galois Fields with a power-of-a-prime size**
 732   (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
 733   integer `n > 0`).
 734   We only implement these for `p == 2`, called Binary Galois Fields
 735   (`GF(2^n)` -- the `gfb*` instructions).
 736   For any prime `p`, `GF(p^n)` is implemented as polynomials with
 737   coefficients in `GF(p)` and degree `< n`, where the polynomials are the
 738   remainders of dividing by a specificly chosen polynomial in `GF(p)` called
 739   the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
 740   Polynomial must be an irreducable polynomial (like primes, but for
 741   polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
 742   and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
 743   affect `GF(p^n)`'s mathematical shape, all that changes is the specific
 744   polynomials used to implement `GF(p^n)`.
 745
 746 Many implementations and much of the literature do not make a clear
 747 distinction between these three categories, which makes it confusing
 748 to understand what their purpose and value is.
 749
 750 * carry-less multiply is extremely common and is used for the ubiquitous
 751   CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
 752 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
 753   has significant uses throughout cryptography
 754 * GF(p) is the basis again of a significant quantity of algorithms
 755   (TODO, list them, jacob knows what they are), even though the
 756   modulo is limited to be below 64-bit (size of a scalar int)
 757
 758 # Instructions for Carry-less Operations
 759
 760 aka. Polynomials with coefficients in `GF(2)`
 761
 762 Carry-less addition/subtraction is simply XOR, so a `cladd`
 763 instruction is not provided since the `xor[i]` instruction can be used instead.
 764
 765 These are operations on polynomials with coefficients in `GF(2)`, with the
 766 polynomial's coefficients packed into integers with the following algorithm:
 767
 768 ```python
 769 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
 770 ```
 771
 772 ## Carry-less Multiply Instructions
 773
 774 based on RV bitmanip
 775 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
 776 <https://www.felixcloutier.com/x86/pclmulqdq> and
 777 <https://en.m.wikipedia.org/wiki/Carry-less_product>
 778
 779 They are worth adding as their own non-overwrite operations
 780 (in the same pipeline).
 781
 782 ### `clmul` Carry-less Multiply
 783
 784 ```python
 785 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
 786 ```
 787
 788 ### `clmulh` Carry-less Multiply High
 789
 790 ```python
 791 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
 792 ```
 793
 794 ### `clmulr` Carry-less Multiply (Reversed)
 795
 796 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
 797 bit-reversed inputs.
 798
 799 ```python
 800 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
 801 ```
 802
 803 ## `clmadd` Carry-less Multiply-Add
 804
 805 ```
 806 clmadd RT, RA, RB, RC
 807 ```
 808
 809 ```
 810 (RT) = clmul((RA), (RB)) ^ (RC)
 811 ```
 812
 813 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
 814
 815 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
 816 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
 817 to avoid the need for a temp register. RS is written to as well as RT.
 818
 819 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
 820 definition of the Inverse Discrete Fourier Transform involves calculating a
 821 multiplicative inverse, which may not exist in every Ring, therefore the
 822 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
 823 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
 824 TODO: check with someone who knows for sure if that's correct.)
 825
 826 ```
 827 cltmadd RT, RA, RB, RC
 828 ```
 829
 830 TODO: add link to explanation for where `RS` comes from.
 831
 832 ```
 833 a = (RA)
 834 c = (RC)
 835 # read all inputs before writing to any outputs in case
 836 # an input overlaps with an output register.
 837 (RT) = clmul(a, (RB)) ^ c
 838 (RS) = a ^ c
 839 ```
 840
 841 ## `cldivrem` Carry-less Division and Remainder
 842
 843 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
 844 for other instructions.
 845
 846 ```python
 847 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
 848 ```
 849
 850 ## `cldiv` Carry-less Division
 851
 852 ```
 853 cldiv RT, RA, RB
 854 ```
 855
 856 ```
 857 n = (RA)
 858 d = (RB)
 859 q, r = cldivrem(n, d, width=XLEN)
 860 (RT) = q
 861 ```
 862
 863 ## `clrem` Carry-less Remainder
 864
 865 ```
 866 clrem RT, RA, RB
 867 ```
 868
 869 ```
 870 n = (RA)
 871 d = (RB)
 872 q, r = cldivrem(n, d, width=XLEN)
 873 (RT) = r
 874 ```
 875
 876 # Instructions for Binary Galois Fields `GF(2^m)`
 877
 878 see:
 879
 880 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 881 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 882 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 883
 884 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
 885 instruction is not provided since the `xor[i]` instruction can be used instead.
 886
 887 ## `GFBREDPOLY` SPR -- Reducing Polynomial
 888
 889 In order to save registers and to make operations orthogonal with standard
 890 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
 891 This also allows hardware to pre-compute useful parameters (such as the
 892 degree, or look-up tables) based on the reducing polynomial, and store them
 893 alongside the SPR in hidden registers, only recomputing them whenever the SPR
 894 is written to, rather than having to recompute those values for every
 895 instruction.
 896
 897 Because Galois Fields require the reducing polynomial to be an irreducible
 898 polynomial, that guarantees that any polynomial of `degree > 1` must have
 899 the LSB set, since otherwise it would be divisible by the polynomial `x`,
 900 making it reducible, making whatever we're working on no longer a Field.
 901 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
 902
 903 ```python
 904 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
 905 ```
 906
 907 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
 908
 909 unless this is an immediate op, `mtspr` is completely sufficient.
 910
 911 ```python
 912 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
 913 ```
 914
 915 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
 916
 917 ```
 918 gfbmul RT, RA, RB
 919 ```
 920
 921 ```python
 922 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
 923 ```
 924
 925 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
 926
 927 ```
 928 gfbmadd RT, RA, RB, RC
 929 ```
 930
 931 ```python
 932 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
 933 ```
 934
 935 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
 936
 937 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
 938 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
 939 for a temp register. RS is written to as well as RT.
 940
 941 ```
 942 gfbtmadd RT, RA, RB, RC
 943 ```
 944
 945 TODO: add link to explanation for where `RS` comes from.
 946
 947 ```
 948 a = (RA)
 949 c = (RC)
 950 # read all inputs before writing to any outputs in case
 951 # an input overlaps with an output register.
 952 (RT) = gfbmadd(a, (RB), c)
 953 # use gfbmadd again since it reduces the result
 954 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
 955 ```
 956
 957 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
 958
 959 ```
 960 gfbinv RT, RA
 961 ```
 962
 963 ```python
 964 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
 965 ```
 966
 967 # Instructions for Prime Galois Fields `GF(p)`
 968
 969 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
 970
 971 ## `gfpadd` Prime Galois Field `GF(p)` Addition
 972
 973 ```
 974 gfpadd RT, RA, RB
 975 ```
 976
 977 ```python
 978 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
 979 ```
 980
 981 the addition happens on infinite-precision integers
 982
 983 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
 984
 985 ```
 986 gfpsub RT, RA, RB
 987 ```
 988
 989 ```python
 990 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
 991 ```
 992
 993 the subtraction happens on infinite-precision integers
 994
 995 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
 996
 997 ```
 998 gfpmul RT, RA, RB
 999 ```
1000
1001 ```python
1002 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
1003 ```
1004
1005 the multiplication happens on infinite-precision integers
1006
1007 ## `gfpinv` Prime Galois Field `GF(p)` Invert
1008
1009 ```
1010 gfpinv RT, RA
1011 ```
1012
1013 Some potential hardware implementations are found in:
1014 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
1015
1016 ```python
1017 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
1018 ```
1019
1020 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
1021
1022 ```
1023 gfpmadd RT, RA, RB, RC
1024 ```
1025
1026 ```python
1027 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
1028 ```
1029
1030 the multiplication and addition happens on infinite-precision integers
1031
1032 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
1033
1034 ```
1035 gfpmsub RT, RA, RB, RC
1036 ```
1037
1038 ```python
1039 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
1040 ```
1041
1042 the multiplication and subtraction happens on infinite-precision integers
1043
1044 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
1045
1046 ```
1047 gfpmsubr RT, RA, RB, RC
1048 ```
1049
1050 ```python
1051 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
1052 ```
1053
1054 the multiplication and subtraction happens on infinite-precision integers
1055
1056 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
1057
1058 Used in combination with SV FFT REMAP to perform
1059 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
1060 to avoid the need for a temp register. RS is written
1061 to as well as RT.
1062
1063 ```
1064 gfpmaddsubr RT, RA, RB, RC
1065 ```
1066
1067 TODO: add link to explanation for where `RS` comes from.
1068
1069 ```
1070 factor1 = (RA)
1071 factor2 = (RB)
1072 term = (RC)
1073 # read all inputs before writing to any outputs in case
1074 # an input overlaps with an output register.
1075 (RT) = gfpmadd(factor1, factor2, term)
1076 (RS) = gfpmsubr(factor1, factor2, term)
1077 ```
1078
1079 # Already in POWER ISA
1080
1081 ## count leading/trailing zeros with mask
1082
1083 in v3.1 p105
1084
1085 ```
1086 count = 0
1087 do i = 0 to 63 if((RB)i=1) then do
1088 if((RS)i=1) then break end end count ← count + 1
1089 RA ← EXTZ64(count)
1090 ```
1091
1092 ## bit deposit
1093
1094 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1095
1096     do while(m < 64)
1097        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1098           result = VSR[VRA+32].dword[i].bit[63-k]
1099           VSR[VRT+32].dword[i].bit[63-m] = result
1100           k = k + 1
1101        m = m + 1
1102
1103 ```
1104
1105 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1106 {
1107     uint_xlen_t r = 0;
1108     for (int i = 0, j = 0; i < XLEN; i++)
1109         if ((RB >> i) & 1) {
1110             if ((RA >> j) & 1)
1111                 r |= uint_xlen_t(1) << i;
1112             j++;
1113         }
1114     return r;
1115 }
1116
1117 ```
1118
1119 ## bit extract
1120
1121 other way round: identical to RV bext: pextd, found in v3.1 p196
1122
1123 ```
1124 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1125 {
1126     uint_xlen_t r = 0;
1127     for (int i = 0, j = 0; i < XLEN; i++)
1128         if ((RB >> i) & 1) {
1129             if ((RA >> i) & 1)
1130                 r |= uint_xlen_t(1) << j;
1131             j++;
1132         }
1133     return r;
1134 }
1135 ```
1136
1137 ## centrifuge
1138
1139 found in v3.1 p106 so not to be added here
1140
1141 ```
1142 ptr0 = 0
1143 ptr1 = 0
1144 do i = 0 to 63
1145     if((RB)i=0) then do
1146        resultptr0 = (RS)i
1147     end
1148     ptr0 = ptr0 + 1
1149     if((RB)63-i==1) then do
1150         result63-ptr1 = (RS)63-i
1151     end
1152     ptr1 = ptr1 + 1
1153 RA = result
1154 ```
1155
1156 ## bit to byte permute
1157
1158 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1159 these perform a transpose. TODO this looks VSX is there a scalar variant
1160 in v3.0/1 already
1161
1162     do j = 0 to 7
1163       do k = 0 to 7
1164          b = VSR[VRB+32].dword[i].byte[k].bit[j]
1165          VSR[VRT+32].dword[i].byte[j].bit[k] = b
1166
1167 # Appendix
1168
1169 see [[bitmanip/appendix]]
1170