openpower/sv/bitmanip.mdwn

   1 [[!tag standards]]
   2
   3 [[!toc levels=1]]
   4
   5 # Implementation Log
   6
   7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
   8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
   9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
  10
  11
  12 # bitmanipulation
  13
  14 **DRAFT STATUS**
  15
  16 pseudocode: [[openpower/isa/bitmanip]]
  17
  18 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX.
  19 Also included are DSP/Multimedia operations suitable for
  20 Audio/Video.  Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
  21 Vectorisation Context is provided by [[openpower/sv]].
  22
  23 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that the Packed SIMD aspects of VSX may be retired as "legacy"
  24 in the far future (10 to 20 years).  Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
  25
  26 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD".  It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs.  Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
  27
  28 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science.  however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added.  The expectation is that these operations would all be covered by the same pipeline.
  29
  30 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
  31 the [[sv/av_opcodes]] as well as [[sv/setvl]], [[sv/svstep]], [[sv/remap]]
  32
  33 Useful resource:
  34
  35 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
  36 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
  37
  38 # summary
  39
  40 two major opcodes are needed
  41
  42 ternlog has its own major opcode
  43
  44 |  29.30 |31| name      |
  45 | ------ |--| --------- |
  46 |   0  0   |Rc| ternlogi  |
  47 |   0  1   |sz| ternlogv  |
  48 |   1 iv   |  | grevlogi |
  49
  50 2nd major opcode for other bitmanip: minor opcode allocation
  51
  52 |  28.30 |31| name      |
  53 | ------ |--| --------- |
  54 |  -00   |0 | xpermi    |
  55 |  -00   |1 | grevlog   |
  56 |  -01   |  | crternlog  |
  57 |  010   |Rc| bitmask   |
  58 |  011   |  | SVP64  |
  59 |  110   |Rc| 1/2-op    |
  60 |  111   |  | bmrevi   |
  61
  62
  63 1-op and variants
  64
  65 | dest | src1 | subop | op       |
  66 | ---- | ---- | ----- | -------- |
  67 | RT   | RA   | ..    | bmatflip |
  68
  69 2-op and variants
  70
  71 | dest | src1 | src2 | subop | op       |
  72 | ---- | ---- | ---- | ----- | -------- |
  73 | RT   | RA   | RB   | or    | bmatflip |
  74 | RT   | RA   | RB   | xor   | bmatflip |
  75 | RT   | RA   | RB   |       | grev  |
  76 | RT   | RA   | RB   |       | clmul\*  |
  77 | RT   | RA   | RB   |       | gorc |
  78 | RT   | RA   | RB   | shuf  | shuffle |
  79 | RT   | RA   | RB   | unshuf| shuffle |
  80 | RT   | RA   | RB   | width | xperm  |
  81 | RT   | RA   | RB   | type | av minmax |
  82 | RT   | RA   | RB   |      | av abs avgadd  |
  83 | RT   | RA   | RB   | type | vmask ops |
  84 | RT   | RA   | RB   | type | abs accumulate (overwrite)  |
  85
  86 3 ops
  87
  88 * grevlog
  89 * GF mul-add
  90 * bitmask-reverse
  91
  92 TODO: convert all instructions to use RT and not RS
  93
  94 | 0.5|6.10|11.15|16.20 |21..25   | 26....30  |31| name |
  95 | -- | -- | --- | ---  | -----   | --------  |--| ------ |
  96 | NN | RT | RA  |itype/| im0-4   | im5-7  00 |0 | xpermi  |
  97 | NN | RT | RA  | RB   | im0-4   | im5-7  00 |1 | grevlog |
  98 | NN |    |     |      |         | -----  01 |m3| crternlog |
  99 | NN | RT | RA  | RB   | RC      | mode  010 |Rc| bitmask\* |
 100 | NN |    |     |      |         | 00    011 |  | rsvd |
 101 | NN |    |     |      |         | 01    011 |0 | svshape |
 102 | NN |    |     |      |         | 01    011 |1 | svremap |
 103 | NN |    |     |      |         | 10    011 |Rc| svstep |
 104 | NN |    |     |      |         | 11    011 |Rc| setvl |
 105 | NN |    |     |      |         | ----  110 |  | 1/2 ops |
 106 | NN | RT | RA  | RB   | sh0-4   | sh5 1 111 |Rc| bmrevi |
 107
 108 ops (note that av avg and abs as well as vec scalar mask
 109 are included here [[sv/vector_ops]], and
 110 the [[sv/av_opcodes]])
 111
 112 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
 113 double check that instructions didn't need 3 inputs.
 114
 115 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name      |  Form   |
 116 | -- | -- | --- | --- | -- | ----- | -------- |--| ----      | ------- |
 117 | NN | RS | me  | sh  | SH | ME 0  | nn00 110 |Rc| bmopsi    | {TODO}  |
 118 | NN | RS | RA  | sh  | SH | 0   1 | nn00 110 |Rc| bmopsi    | XB-Form |
 119 | NN | RT | RA  | RB  | 1  |  00   | 0001 110 |Rc| cldiv     | X-Form  |
 120 | NN | RT | RA  | RB  | 1  |  01   | 0001 110 |Rc| clmod     | X-Form  |
 121 | NN | RT | RA  |     | 1  |  10   | 0001 110 |Rc| bmatflip  | X-Form  |
 122 | NN |    |     |     | 1  |  11   | 0001 110 |Rc| rsvd      |         |
 123 | NN | RT | RA  | RB  | 0  |   00  | 0001 110 |Rc| vec sbfm  | X-Form  |
 124 | NN | RT | RA  | RB  | 0  |   01  | 0001 110 |Rc| vec sofm  | X-Form  |
 125 | NN | RT | RA  | RB  | 0  |   10  | 0001 110 |Rc| vec sifm  | X-Form  |
 126 | NN | RT | RA  | RB  | 0  |   11  | 0001 110 |Rc| vec cprop | X-Form  |
 127 | NN |    |     |     | 0  |       | 0101 110 |Rc| rsvd      |         |
 128 | NN | RT | RA  | RB  | 1  | itype | 0101 110 |Rc| xperm     | X-Form  |
 129 | NN | RT | RA  | RB  | 0  | itype | 1001 110 |Rc| av minmax | X-Form  |
 130 | NN | RT | RA  | RB  | 1  |   00  | 1001 110 |Rc| av abss   | X-Form  |
 131 | NN | RT | RA  | RB  | 1  |   01  | 1001 110 |Rc| av absu   | X-Form  |
 132 | NN | RT | RA  | RB  | 1  |   10  | 1001 110 |Rc| av avgadd | X-Form  |
 133 | NN |    |     |     | 1  |   11  | 1001 110 |Rc| rsvd      |         |
 134 | NN | RT | RA  | RB  | 0  |   sh  | 1101 110 |Rc| shadd     | {TODO}  |
 135 | NN | RT | RA  | RB  | 1  |   sh  | 1101 110 |Rc| shadduw   | {TODO}  |
 136 | NN | RT | RA  | RB  | 0  | 00    | 0010 110 |Rc| gorc      | X-Form  |
 137 | NN | RS | RA  | sh  | SH | 00    | 1010 110 |Rc| gorci     | XB-Form |
 138 | NN | RT | RA  | RB  | 0  | 00    | 0110 110 |Rc| gorcw     | X-Form  |
 139 | NN | RS | RA  | SH  | 0  | 00    | 1110 110 |Rc| gorcwi    | X-Form  |
 140 | NN | RT | RA  | RB  | 1  | 00    | 1110 110 |Rc| bmator    | X-Form  |
 141 | NN | RT | RA  | RB  | 0  | 01    | 0010 110 |Rc| grev      | X-Form  |
 142 | NN | RT | RA  | RB  | 1  | 01    | 0010 110 |Rc| clmul     | X-Form  |
 143 | NN | RS | RA  | sh  | SH | 01    | 1010 110 |Rc| grevi     | XB-Form |
 144 | NN | RT | RA  | RB  | 0  | 01    | 0110 110 |Rc| grevw     | X-Form  |
 145 | NN | RS | RA  | SH  | 0  | 01    | 1110 110 |Rc| grevwi    | X-Form  |
 146 | NN | RT | RA  | RB  | 1  | 01    | 1110 110 |Rc| bmatxor   | X-Form  |
 147 | NN | RS | RA  | RB  | 0  | 10    | 0010 110 |Rc| abssa     | X-Form  |
 148 | NN | RS | RA  | RB  | 0  | 10    | 0110 110 |Rc| absua     | X-Form  |
 149 | NN | RS | RA  | RB  | 0  | 10    | 1010 110 |Rc|           | X-Form  |
 150 | NN | RS | RA  | RB  | 0  | 10    | 1110 110 |Rc|           | X-Form  |
 151 | NN |    |     |     | 1  | 10    | --10 110 |Rc| rsvd      |         |
 152 | NN | RT | RA  | RB  | 0  | 11    | 1110 110 |Rc| clmulr    | X-Form  |
 153 | NN | RT | RA  | RB  | 1  | 11    | 1110 110 |Rc| clmulh    | X-Form  |
 154 | NN |    |     |     |    |       | --11 110 |Rc| rsvd      |         |
 155
 156 # ternlog bitops
 157
 158 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
 159
 160 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
 161
 162 ## ternlogi
 163
 164 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
 165 | -- | -- | --- | --- | ----- | --- |--|
 166 | NN | RT | RA  | RB  | im0-7 |  00 |Rc|
 167
 168     lut3(imm, a, b, c):
 169         idx = c << 2 | b << 1 | a
 170         return imm[idx] # idx by LSB0 order
 171
 172     for i in range(64):
 173         RT[i] = lut3(imm, RB[i], RA[i], RT[i])
 174
 175 ## ternlogv
 176
 177 also, another possible variant involving swizzle-like selection
 178 and masking, this only requires 3 64 bit registers (RA, RS, RB) and
 179 only 16 LUT3s.
 180
 181 Note however that unless XLEN matches sz, this instruction
 182 is a Read-Modify-Write: RS must be read as a second operand
 183 and all unmodified bits preserved.  SVP64 may provide limited
 184 alternative destination for RS from RS-as-source, but again
 185 all unmodified bits must still be copied.
 186
 187 | 0.5|6.10|11.15|16.20|21.28 | 29.30 |31|
 188 | -- | -- | --- | --- | ---- | ----- |--|
 189 | NN | RS | RA  | RB  |idx0-3|  01   |sz|
 190
 191     SZ = (1+sz) * 8 # 8 or 16
 192     raoff = MIN(XLEN, idx0 * SZ)
 193     rboff = MIN(XLEN, idx1 * SZ)
 194     rcoff = MIN(XLEN, idx2 * SZ)
 195     rsoff = MIN(XLEN, idx3 * SZ)
 196     imm = RB[0:8]
 197     for i in range(MIN(XLEN, SZ)):
 198         ra = RA[raoff:+i]
 199         rb = RA[rboff+i]
 200         rc = RA[rcoff+i]
 201         res = lut3(imm, ra, rb, rc)
 202         RS[rsoff+i] = res
 203
 204 ## ternlogcr
 205
 206 another mode selection would be CRs not Ints.
 207
 208 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
 209 | -- | -- | --- | --- | --- |-----|----- | -----|--|
 210 | NN | BT | BA  | BB  | BC  |m0-2 | imm  |  01  |m3|
 211
 212     mask = m0-3,m4
 213     for i in range(4):
 214         a,b,c = CRs[BA][i], CRs[BB][i], CRs[BC][i])
 215         if mask[i] CRs[BT][i] = lut3(imm, a, b, c)
 216
 217 # int ops
 218
 219 ## min/m
 220
 221 required for the [[sv/av_opcodes]]
 222
 223 signed and unsigned min/max for integer.  this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned.  when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
 224
 225 signed/unsigned min/max gives more flexibility.
 226
 227 ```
 228 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
 229 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
 230 }
 231 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
 232 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
 233 }
 234 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
 235 { return rs1 < rs2 ? rs1 : rs2;
 236 }
 237 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
 238 { return rs1 > rs2 ? rs1 : rs2;
 239 }
 240 ```
 241
 242 ## average
 243
 244 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 245 but not scalar
 246
 247 ```
 248 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
 249      return (rs1 + rs2 + 1) >> 1:
 250 }
 251 ```
 252
 253 ## abs
 254
 255 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
 256 but not scalar
 257
 258 ```
 259 uint_xlen_t intabs(uint_xlen_t rs1, uint_xlen_t rs2) {
 260      return (src1 > src2) ? (src1-src2) : (src2-src1)
 261 }
 262 ```
 263
 264 # shift-and-add
 265
 266 Power ISA is missing LD/ST with shift, which is present in both ARM and x86.
 267 Too complex to add more LD/ST, a compromise is to add shift-and-add.
 268 Replaces a pair of explicit instructions in hot-loops.
 269
 270 ```
 271 uint_xlen_t shadd(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
 272     return (rs1 << (sh+1)) + rs2;
 273 }
 274
 275 uint_xlen_t shadduw(uint_xlen_t rs1, uint_xlen_t rs2, uint8_t sh) {
 276     uint_xlen_t rs1z = rs1 & 0xFFFFFFFF;
 277     return (rs1z << (sh+1)) + rs2;
 278 }
 279 ```
 280
 281 # cmix
 282
 283 based on RV bitmanip, covered by ternlog bitops
 284
 285 ```
 286 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
 287     return (RA & RB) | (RC & ~RB);
 288 }
 289 ```
 290
 291
 292 # bitmask set
 293
 294 based on RV bitmanip singlebit set, instruction format similar to shift
 295 [[isa/fixedshift]].  bmext is actually covered already (shift-with-mask rldicl but only immediate version).
 296 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
 297
 298 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate.  for the immediate version this does not matter, so a bmexti is not required.
 299 bmrev however there is no direct equivalent and consequently a bmrevi is required.
 300
 301 bmset (register for mask amount) is particularly useful for creating
 302 predicate masks where the length is a dynamic runtime quantity.
 303 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
 304
 305 | 0.5|6.10|11.15|16.20|21.25| 26..30  |31| name  |
 306 | -- | -- | --- | --- | --- | ------- |--| ----- |
 307 | NN | RS | RA  | RB  | RC  | mode 010 |Rc| bm\*   |
 308
 309 Immediate-variant is an overwrite form:
 310
 311 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
 312 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
 313 | NN | RS | RB  | sh  | SH | itype | 1000 110 |Rc| bm\*i |
 314
 315 ```
 316 def MASK(x, y):
 317      if x < y:
 318          x = x+1
 319          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 320          mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
 321      elif x == y:
 322          return 1 << x
 323      else:
 324          x = x+1
 325          mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
 326          mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
 327      return mask_a ^ mask_b
 328
 329
 330 uint_xlen_t bmset(RS, RB, sh)
 331 {
 332     int shamt = RB & (XLEN - 1);
 333     mask = (2<<sh)-1;
 334     return RS | (mask << shamt);
 335 }
 336
 337 uint_xlen_t bmclr(RS, RB, sh)
 338 {
 339     int shamt = RB & (XLEN - 1);
 340     mask = (2<<sh)-1;
 341     return RS & ~(mask << shamt);
 342 }
 343
 344 uint_xlen_t bminv(RS, RB, sh)
 345 {
 346     int shamt = RB & (XLEN - 1);
 347     mask = (2<<sh)-1;
 348     return RS ^ (mask << shamt);
 349 }
 350
 351 uint_xlen_t bmext(RS, RB, sh)
 352 {
 353     int shamt = RB & (XLEN - 1);
 354     mask = (2<<sh)-1;
 355     return mask & (RS >> shamt);
 356 }
 357 ```
 358
 359 bitmask extract with reverse.  can be done by bit-order-inverting all of RB and getting bits of RB from the opposite end.
 360
 361 when RA is zero, no shift occurs. this makes bmextrev useful for
 362 simply reversing all bits of a register.
 363
 364 ```
 365 msb = ra[5:0];
 366 rev[0:msb] = rb[msb:0];
 367 rt = ZE(rev[msb:0]);
 368
 369 uint_xlen_t bmextrev(RA, RB, sh)
 370 {
 371     int shamt = XLEN-1;
 372     if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
 373     shamt = (XLEN-1)-shamt;  # shift other end
 374     bra = bitreverse(RB)     # swap LSB-MSB
 375     mask = (2<<sh)-1;
 376     return mask & (bra >> shamt);
 377 }
 378 ```
 379
 380 | 0.5|6.10|11.15|16.20|21.26| 27..30  |31| name   |
 381 | -- | -- | --- | --- | --- | ------- |--| ------ |
 382 | NN | RT | RA  | RB  | sh  | 1   011 |Rc| bmrevi |
 383
 384
 385 # grevlut
 386
 387 generalised reverse combined with a pair of LUT2s and allowing
 388 a constant `0b0101...0101` when RA=0, and an option to invert
 389 (including when RA=0, giving a constant 0b1010...1010 as the
 390 initial value) provides a wide range of instructions
 391 and a means to set hundreds of regular 64 bit patterns with one
 392 single 32 bit instruction.
 393
 394 the two LUT2s are applied left-half (when not swapping)
 395 and right-half (when swapping) so as to allow a wider
 396 range of options.
 397
 398 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
 399
 400 * A value of `0b11001010` for the immediate provides
 401 the functionality of a standard "grev".
 402 * `0b11101110` provides gorc
 403
 404 grevlut should be arranged so as to produce the constants
 405 needed to put into bext (bitextract) so as in turn to
 406 be able to emulate x86 pmovmask instructions <https://www.felixcloutier.com/x86/pmovmskb>.
 407 This only requires 2 instructions (grevlut, bext).
 408
 409 Note that if the mask is required to be placed
 410 directly into CR Fields (for use as CR Predicate
 411 masks rather than a integer mask) then sv.cmpi or sv.ori
 412 may be used instead, bearing in mind that sv.ori
 413 is a 64-bit instruction, and `VL` must have been
 414 set to the required length:
 415
 416     sv.ori./elwid=8 r10.v, r10.v, 0
 417
 418 The following settings provide the required mask constants:
 419
 420 | RA       | RB      | imm        | iv | result        |
 421 | -------  | ------- | ---------- | -- | ----------    |
 422 | 0x555..  | 0b10    | 0b01101100 | 0  | 0x111111...   |
 423 | 0x555..  | 0b110   | 0b01101100 | 0  | 0x010101...   |
 424 | 0x555..  | 0b1110  | 0b01101100 | 0  | 0x00010001...   |
 425 | 0x555..  | 0b10    | 0b11000110 | 1  | 0x88888...   |
 426 | 0x555..  | 0b110   | 0b11000110 | 1  | 0x808080...   |
 427 | 0x555..  | 0b1110  | 0b11000110 | 1  | 0x80008000...   |
 428
 429 Better diagram showing the correct ordering of shamt (RB).  A LUT2
 430 is applied to all locations marked in red using the first 4
 431 bits of the immediate, and a separate LUT2 applied to all
 432 locations in green using the upper 4 bits of the immediate.
 433
 434 <img src="/openpower/sv/grevlut.png" width=700 />
 435
 436 demo code [[openpower/sv/grevlut.py]]
 437
 438 ```
 439 lut2(imm, a, b):
 440     idx = b << 1 | a
 441     return imm[idx] # idx by LSB0 order
 442
 443 dorow(imm8, step_i, chunksize):
 444     for j in 0 to 63:
 445         if (j&chunk_size) == 0
 446            imm = imm8[0..3]
 447         else
 448            imm = imm8[4..7]
 449         step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
 450     return step_o
 451
 452 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 imm, bool iv)
 453 {
 454     uint64_t x = 0x5555_5555_5555_5555;
 455     if (RA != 0) x = GPR(RA);
 456     if (iv) x = ~x;
 457     int shamt = RB & 63;
 458     for i in 0 to 6
 459         step = 1<<i
 460         if (shamt & step) x = dorow(imm, x, step)
 461     return x;
 462 }
 463
 464 ```
 465
 466 | 0.5|6.10|11.15|16.20 |21..25   | 26....30    |31| name |
 467 | -- | -- | --- | ---  | -----   | --------    |--| ------ |
 468 | NN | RT | RA  | s0-4 | im0-4   | im5-7  1 iv |s5| grevlogi |
 469 | NN | RT | RA  | RB   | im0-4   | im5-7  00   |1 | grevlog |
 470
 471
 472 # grev
 473
 474 superceded by grevlut
 475
 476 based on RV bitmanip, this is also known as a butterfly network. however
 477 where a butterfly network allows setting of every crossbar setting in
 478 every row and every column, generalised-reverse (grev) only allows
 479 a per-row decision: every entry in the same row must either switch or
 480 not-switch.
 481
 482 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
 483
 484 ```
 485 uint64_t grev64(uint64_t RA, uint64_t RB)
 486 {
 487     uint64_t x = RA;
 488     int shamt = RB & 63;
 489     if (shamt & 1) x = ((x &  0x5555555555555555LL) <<  1) |
 490                         ((x & 0xAAAAAAAAAAAAAAAALL) >>  1);
 491     if (shamt & 2) x = ((x &  0x3333333333333333LL) <<  2) |
 492                         ((x & 0xCCCCCCCCCCCCCCCCLL) >>  2);
 493     if (shamt & 4) x = ((x &  0x0F0F0F0F0F0F0F0FLL) <<  4) |
 494                         ((x & 0xF0F0F0F0F0F0F0F0LL) >>  4);
 495     if (shamt & 8) x = ((x &  0x00FF00FF00FF00FFLL) <<  8) |
 496                         ((x & 0xFF00FF00FF00FF00LL) >>  8);
 497     if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
 498                         ((x & 0xFFFF0000FFFF0000LL) >> 16);
 499     if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
 500                         ((x & 0xFFFFFFFF00000000LL) >> 32);
 501     return x;
 502 }
 503
 504 ```
 505
 506 # gorc
 507
 508 based on RV bitmanip, gorc is superceded by grevlut
 509
 510 ```
 511 uint32_t gorc32(uint32_t RA, uint32_t RB)
 512 {
 513     uint32_t x = RA;
 514     int shamt = RB & 31;
 515     if (shamt & 1) x |= ((x & 0x55555555) << 1)   |  ((x &  0xAAAAAAAA) >> 1);
 516     if (shamt & 2) x |= ((x & 0x33333333) << 2)   |  ((x &  0xCCCCCCCC) >> 2);
 517     if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4)   |  ((x &  0xF0F0F0F0) >> 4);
 518     if (shamt & 8) x |= ((x & 0x00FF00FF) << 8)   |  ((x &  0xFF00FF00) >> 8);
 519     if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) |  ((x &  0xFFFF0000) >> 16);
 520     return x;
 521 }
 522 uint64_t gorc64(uint64_t RA, uint64_t RB)
 523 {
 524     uint64_t x = RA;
 525     int shamt = RB & 63;
 526     if (shamt & 1) x |= ((x & 0x5555555555555555LL)   <<   1) |
 527                          ((x & 0xAAAAAAAAAAAAAAAALL)  >>  1);
 528     if (shamt & 2) x |= ((x & 0x3333333333333333LL)   <<   2) |
 529                          ((x & 0xCCCCCCCCCCCCCCCCLL)  >>  2);
 530     if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL)   <<   4) |
 531                          ((x & 0xF0F0F0F0F0F0F0F0LL)  >>  4);
 532     if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL)   <<   8) |
 533                          ((x & 0xFF00FF00FF00FF00LL)  >>  8);
 534     if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL)  << 16) |
 535                          ((x & 0xFFFF0000FFFF0000LL)  >> 16);
 536     if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL)  << 32) |
 537                          ((x & 0xFFFFFFFF00000000LL)  >> 32);
 538     return x;
 539 }
 540
 541 ```
 542
 543 # xperm
 544
 545 based on RV bitmanip.
 546
 547 RA contains a vector of indices to select parts of RB to be
 548 copied to RT.  The immediate-variant allows up to an 8 bit
 549 pattern (repeated) to be targetted at different parts of RT.
 550
 551 xperm shares some similarity with one of the uses of bmator
 552 in that xperm indices are binary addressing where bitmator
 553 may be considered to be unary addressing.
 554
 555 ```
 556 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
 557 {
 558     uint_xlen_t r = 0;
 559     uint_xlen_t sz = 1LL << sz_log2;
 560     uint_xlen_t mask = (1LL << sz) - 1;
 561     uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
 562     for (int i = 0; i < XLEN; i += sz) {
 563         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 564         if (pos < XLEN)
 565             r |= ((RB >> pos) & mask) << i;
 566     }
 567     return r;
 568 }
 569 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
 570 {
 571     uint_xlen_t r = 0;
 572     uint_xlen_t sz = 1LL << sz_log2;
 573     uint_xlen_t mask = (1LL << sz) - 1;
 574     for (int i = 0; i < XLEN; i += sz) {
 575         uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
 576         if (pos < XLEN)
 577             r |= ((RB >> pos) & mask) << i;
 578     }
 579     return r;
 580 }
 581 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
 582 {  return xperm(RA, RB, 2); }
 583 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
 584 {  return xperm(RA, RB, 3); }
 585 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
 586 {  return xperm(RA, RB, 4); }
 587 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
 588 {  return xperm(RA, RB, 5); }
 589 ```
 590
 591 # bitmatrix
 592
 593 ```
 594 uint64_t bmatflip(uint64_t RA)
 595 {
 596     uint64_t x = RA;
 597     x = shfl64(x, 31);
 598     x = shfl64(x, 31);
 599     x = shfl64(x, 31);
 600     return x;
 601 }
 602 uint64_t bmatxor(uint64_t RA, uint64_t RB)
 603 {
 604     // transpose of RB
 605     uint64_t RBt = bmatflip(RB);
 606     uint8_t u[8]; // rows of RA
 607     uint8_t v[8]; // cols of RB
 608     for (int i = 0; i < 8; i++) {
 609         u[i] = RA >> (i*8);
 610         v[i] = RBt >> (i*8);
 611     }
 612     uint64_t x = 0;
 613     for (int i = 0; i < 64; i++) {
 614         if (pcnt(u[i / 8] & v[i % 8]) & 1)
 615             x |= 1LL << i;
 616     }
 617     return x;
 618 }
 619 uint64_t bmator(uint64_t RA, uint64_t RB)
 620 {
 621     // transpose of RB
 622     uint64_t RBt = bmatflip(RB);
 623     uint8_t u[8]; // rows of RA
 624     uint8_t v[8]; // cols of RB
 625     for (int i = 0; i < 8; i++) {
 626         u[i] = RA >> (i*8);
 627         v[i] = RBt >> (i*8);
 628     }
 629     uint64_t x = 0;
 630     for (int i = 0; i < 64; i++) {
 631         if ((u[i / 8] & v[i % 8]) != 0)
 632             x |= 1LL << i;
 633     }
 634     return x;
 635 }
 636
 637 ```
 638
 639 # Introduction to Carry-less and GF arithmetic
 640
 641 * obligatory xkcd <https://xkcd.com/2595/>
 642
 643 There are three completely separate types of Galois-Field-based arithmetic
 644 that we implement which are not well explained even in introductory
 645 literature.  A slightly oversimplified explanation is followed by more
 646 accurate descriptions:
 647
 648 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
 649   but is accidentally referred to as GF(2) - see below as to why.
 650 * `GF(p)` modulo arithmetic with a Prime number, these are "proper"
 651    Galois Fields
 652 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
 653   (2^N) and a second "reducing" polynomial (similar to a prime number), these
 654   are said to be GF(2^N) arithmetic.
 655
 656 further detailed and more precise explanations are provided below
 657
 658 * **Polynomials with coefficients in `GF(2)`**
 659   (aka. Carry-less arithmetic -- the `cl*` instructions).
 660   This isn't actually a Galois Field, but its coefficients are. This is
 661   basically binary integer addition, subtraction, and multiplication like
 662   usual, except that carries aren't propagated at all, effectively turning
 663   both addition and subtraction into the bitwise xor operation. Division and
 664   remainder are defined to match how addition and multiplication works.
 665 * **Galois Fields with a prime size**
 666   (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
 667   This is basically just the integers mod `p`.
 668 * **Galois Fields with a power-of-a-prime size**
 669   (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
 670   integer `n > 0`).
 671   We only implement these for `p == 2`, called Binary Galois Fields
 672   (`GF(2^n)` -- the `gfb*` instructions).
 673   For any prime `p`, `GF(p^n)` is implemented as polynomials with
 674   coefficients in `GF(p)` and degree `< n`, where the polynomials are the
 675   remainders of dividing by a specificly chosen polynomial in `GF(p)` called
 676   the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
 677   Polynomial must be an irreducable polynomial (like primes, but for
 678   polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
 679   and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
 680   affect `GF(p^n)`'s mathematical shape, all that changes is the specific
 681   polynomials used to implement `GF(p^n)`.
 682
 683 Many implementations and much of the literature do not make a clear
 684 distinction between these three categories, which makes it confusing
 685 to understand what their purpose and value is.
 686
 687 * carry-less multiply is extremely common and is used for the ubiquitous
 688   CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
 689 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
 690   has significant uses throughout cryptography
 691 * GF(p) is the basis again of a significant quantity of algorithms
 692   (TODO, list them, jacob knows what they are), even though the
 693   modulo is limited to be below 64-bit (size of a scalar int)
 694
 695 # Instructions for Carry-less Operations
 696
 697 aka. Polynomials with coefficients in `GF(2)`
 698
 699 Carry-less addition/subtraction is simply XOR, so a `cladd`
 700 instruction is not provided since the `xor[i]` instruction can be used instead.
 701
 702 These are operations on polynomials with coefficients in `GF(2)`, with the
 703 polynomial's coefficients packed into integers with the following algorithm:
 704
 705 ```python
 706 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
 707 ```
 708
 709 ## Carry-less Multiply Instructions
 710
 711 based on RV bitmanip
 712 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
 713 <https://www.felixcloutier.com/x86/pclmulqdq> and
 714 <https://en.m.wikipedia.org/wiki/Carry-less_product>
 715
 716 They are worth adding as their own non-overwrite operations
 717 (in the same pipeline).
 718
 719 ### `clmul` Carry-less Multiply
 720
 721 ```python
 722 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
 723 ```
 724
 725 ### `clmulh` Carry-less Multiply High
 726
 727 ```python
 728 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
 729 ```
 730
 731 ### `clmulr` Carry-less Multiply (Reversed)
 732
 733 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
 734 bit-reversed inputs.
 735
 736 ```python
 737 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
 738 ```
 739
 740 ## `clmadd` Carry-less Multiply-Add
 741
 742 ```
 743 clmadd RT, RA, RB, RC
 744 ```
 745
 746 ```
 747 (RT) = clmul((RA), (RB)) ^ (RC)
 748 ```
 749
 750 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
 751
 752 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
 753 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
 754 to avoid the need for a temp register. RS is written to as well as RT.
 755
 756 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
 757 definition of the Inverse Discrete Fourier Transform involves calculating a
 758 multiplicative inverse, which may not exist in every Ring, therefore the
 759 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
 760 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
 761 TODO: check with someone who knows for sure if that's correct.)
 762
 763 ```
 764 cltmadd RT, RA, RB, RC
 765 ```
 766
 767 TODO: add link to explanation for where `RS` comes from.
 768
 769 ```
 770 a = (RA)
 771 c = (RC)
 772 # read all inputs before writing to any outputs in case
 773 # an input overlaps with an output register.
 774 (RT) = clmul(a, (RB)) ^ c
 775 (RS) = a ^ c
 776 ```
 777
 778 ## `cldivrem` Carry-less Division and Remainder
 779
 780 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
 781 for other instructions.
 782
 783 ```python
 784 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
 785 ```
 786
 787 ## `cldiv` Carry-less Division
 788
 789 ```
 790 cldiv RT, RA, RB
 791 ```
 792
 793 ```
 794 n = (RA)
 795 d = (RB)
 796 q, r = cldivrem(n, d, width=XLEN)
 797 (RT) = q
 798 ```
 799
 800 ## `clrem` Carry-less Remainder
 801
 802 ```
 803 clrem RT, RA, RB
 804 ```
 805
 806 ```
 807 n = (RA)
 808 d = (RB)
 809 q, r = cldivrem(n, d, width=XLEN)
 810 (RT) = r
 811 ```
 812
 813 # Instructions for Binary Galois Fields `GF(2^m)`
 814
 815 see:
 816
 817 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
 818 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
 819 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
 820
 821 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
 822 instruction is not provided since the `xor[i]` instruction can be used instead.
 823
 824 ## `GFBREDPOLY` SPR -- Reducing Polynomial
 825
 826 In order to save registers and to make operations orthogonal with standard
 827 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
 828 This also allows hardware to pre-compute useful parameters (such as the
 829 degree, or look-up tables) based on the reducing polynomial, and store them
 830 alongside the SPR in hidden registers, only recomputing them whenever the SPR
 831 is written to, rather than having to recompute those values for every
 832 instruction.
 833
 834 Because Galois Fields require the reducing polynomial to be an irreducible
 835 polynomial, that guarantees that any polynomial of `degree > 1` must have
 836 the LSB set, since otherwise it would be divisible by the polynomial `x`,
 837 making it reducible, making whatever we're working on no longer a Field.
 838 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
 839
 840 ```python
 841 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
 842 ```
 843
 844 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
 845
 846 unless this is an immediate op, `mtspr` is completely sufficient.
 847
 848 ```python
 849 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
 850 ```
 851
 852 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
 853
 854 ```
 855 gfbmul RT, RA, RB
 856 ```
 857
 858 ```python
 859 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
 860 ```
 861
 862 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
 863
 864 ```
 865 gfbmadd RT, RA, RB, RC
 866 ```
 867
 868 ```python
 869 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
 870 ```
 871
 872 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
 873
 874 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
 875 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
 876 for a temp register. RS is written to as well as RT.
 877
 878 ```
 879 gfbtmadd RT, RA, RB, RC
 880 ```
 881
 882 TODO: add link to explanation for where `RS` comes from.
 883
 884 ```
 885 a = (RA)
 886 c = (RC)
 887 # read all inputs before writing to any outputs in case
 888 # an input overlaps with an output register.
 889 (RT) = gfbmadd(a, (RB), c)
 890 # use gfbmadd again since it reduces the result
 891 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
 892 ```
 893
 894 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
 895
 896 ```
 897 gfbinv RT, RA
 898 ```
 899
 900 ```python
 901 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
 902 ```
 903
 904 # Instructions for Prime Galois Fields `GF(p)`
 905
 906 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
 907
 908 ## `gfpadd` Prime Galois Field `GF(p)` Addition
 909
 910 ```
 911 gfpadd RT, RA, RB
 912 ```
 913
 914 ```python
 915 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
 916 ```
 917
 918 the addition happens on infinite-precision integers
 919
 920 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
 921
 922 ```
 923 gfpsub RT, RA, RB
 924 ```
 925
 926 ```python
 927 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
 928 ```
 929
 930 the subtraction happens on infinite-precision integers
 931
 932 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
 933
 934 ```
 935 gfpmul RT, RA, RB
 936 ```
 937
 938 ```python
 939 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
 940 ```
 941
 942 the multiplication happens on infinite-precision integers
 943
 944 ## `gfpinv` Prime Galois Field `GF(p)` Invert
 945
 946 ```
 947 gfpinv RT, RA
 948 ```
 949
 950 Some potential hardware implementations are found in:
 951 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
 952
 953 ```python
 954 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
 955 ```
 956
 957 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
 958
 959 ```
 960 gfpmadd RT, RA, RB, RC
 961 ```
 962
 963 ```python
 964 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
 965 ```
 966
 967 the multiplication and addition happens on infinite-precision integers
 968
 969 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
 970
 971 ```
 972 gfpmsub RT, RA, RB, RC
 973 ```
 974
 975 ```python
 976 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
 977 ```
 978
 979 the multiplication and subtraction happens on infinite-precision integers
 980
 981 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
 982
 983 ```
 984 gfpmsubr RT, RA, RB, RC
 985 ```
 986
 987 ```python
 988 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
 989 ```
 990
 991 the multiplication and subtraction happens on infinite-precision integers
 992
 993 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
 994
 995 Used in combination with SV FFT REMAP to perform
 996 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
 997 to avoid the need for a temp register. RS is written
 998 to as well as RT.
 999
1000 ```
1001 gfpmaddsubr RT, RA, RB, RC
1002 ```
1003
1004 TODO: add link to explanation for where `RS` comes from.
1005
1006 ```
1007 factor1 = (RA)
1008 factor2 = (RB)
1009 term = (RC)
1010 # read all inputs before writing to any outputs in case
1011 # an input overlaps with an output register.
1012 (RT) = gfpmadd(factor1, factor2, term)
1013 (RS) = gfpmsubr(factor1, factor2, term)
1014 ```
1015
1016 # Already in POWER ISA
1017
1018 ## count leading/trailing zeros with mask
1019
1020 in v3.1 p105
1021
1022 ```
1023 count = 0
1024 do i = 0 to 63 if((RB)i=1) then do
1025 if((RS)i=1) then break end end count ← count + 1
1026 RA ← EXTZ64(count)
1027 ```
1028
1029 ## bit deposit
1030
1031 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1032
1033     do while(m < 64)
1034        if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1035           result = VSR[VRA+32].dword[i].bit[63-k]
1036           VSR[VRT+32].dword[i].bit[63-m] = result
1037           k = k + 1
1038        m = m + 1
1039
1040 ```
1041
1042 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1043 {
1044     uint_xlen_t r = 0;
1045     for (int i = 0, j = 0; i < XLEN; i++)
1046         if ((RB >> i) & 1) {
1047             if ((RA >> j) & 1)
1048                 r |= uint_xlen_t(1) << i;
1049             j++;
1050         }
1051     return r;
1052 }
1053
1054 ```
1055
1056 ## bit extract
1057
1058 other way round: identical to RV bext: pextd, found in v3.1 p196
1059
1060 ```
1061 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1062 {
1063     uint_xlen_t r = 0;
1064     for (int i = 0, j = 0; i < XLEN; i++)
1065         if ((RB >> i) & 1) {
1066             if ((RA >> i) & 1)
1067                 r |= uint_xlen_t(1) << j;
1068             j++;
1069         }
1070     return r;
1071 }
1072 ```
1073
1074 ## centrifuge
1075
1076 found in v3.1 p106 so not to be added here
1077
1078 ```
1079 ptr0 = 0
1080 ptr1 = 0
1081 do i = 0 to 63
1082     if((RB)i=0) then do
1083        resultptr0 = (RS)i
1084     end
1085     ptr0 = ptr0 + 1
1086     if((RB)63-i==1) then do
1087         result63-ptr1 = (RS)63-i
1088     end
1089     ptr1 = ptr1 + 1
1090 RA = result
1091 ```
1092
1093 ## bit to byte permute
1094
1095 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1096 these perform a transpose. TODO this looks VSX is there a scalar variant
1097 in v3.0/1 already
1098
1099     do j = 0 to 7
1100       do k = 0 to 7
1101          b = VSR[VRB+32].dword[i].byte[k].bit[j]
1102          VSR[VRT+32].dword[i].byte[j].bit[k] = b
1103
1104 # Appendix
1105
1106 see [[bitmanip/appendix]]
1107