(no commit message)
[libreriscv.git] / openpower / sv / bitmanip.mdwn
1 [[!tag standards]]
2
3 [[!toc levels=1]]
4
5 # Implementation Log
6
7 * ternlogi <https://bugs.libre-soc.org/show_bug.cgi?id=745>
8 * grev <https://bugs.libre-soc.org/show_bug.cgi?id=755>
9 * GF2^M <https://bugs.libre-soc.org/show_bug.cgi?id=782>
10
11
12 # bitmanipulation
13
14 **DRAFT STATUS**
15
16 pseudocode: [[openpower/isa/bitmanip]]
17
18 this extension amalgamates bitmanipulation primitives from many sources, including RISC-V bitmanip, Packed SIMD, AVX-512 and OpenPOWER VSX. Vectorisation and SIMD are removed: these are straight scalar (element) operations making them suitable for embedded applications.
19 Vectorisation Context is provided by [[openpower/sv]].
20
21 When combined with SV, scalar variants of bitmanip operations found in VSX are added so that VSX may be retired as "legacy" in the far future (10 to 20 years). Also, VSX is hundreds of opcodes, requires 128 bit pathways, and is wholly unsuited to low power or embedded scenarios.
22
23 ternlogv is experimental and is the only operation that may be considered a "Packed SIMD". It is added as a variant of the already well-justified ternlog operation (done in AVX512 as an immediate only) "because it looks fun". As it is based on the LUT4 concept it will allow accelerated emulation of FPGAs. Other vendors of ISAs are buying FPGA companies to achieve similar objectives.
24
25 general-purpose Galois Field 2^M operations are added so as to avoid huge custom opcode proliferation across many areas of Computer Science. however for convenience and also to avoid setup costs, some of the more common operations (clmul, crc32) are also added. The expectation is that these operations would all be covered by the same pipeline.
26
27 note that there are brownfield spaces below that could incorporate some of the set-before-first and other scalar operations listed in [[sv/vector_ops]], and
28 the [[sv/av_opcodes]] as well as [[sv/setvl]]
29
30 Useful resource:
31
32 * <https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders>
33 * <https://maths-people.anu.edu.au/~brent/pd/rpb232tr.pdf>
34
35 # summary
36
37 two major opcodes are needed
38
39 ternlog has its own major opcode
40
41 | 29.30 |31| name |
42 | ------ |--| --------- |
43 | 0 0 |Rc| ternlogi |
44 | 0 1 |sz| ternlogv |
45 | 1 iv | | grevlogi |
46
47 2nd major opcode for other bitmanip: minor opcode allocation
48
49 | 28.30 |31| name |
50 | ------ |--| --------- |
51 | -00 |0 | xpermi |
52 | -00 |1 | grevlog |
53 | -01 | | crternlog |
54 | 010 |Rc| bitmask |
55 | 011 | | SVP64 |
56 | 110 |Rc| 1/2-op |
57 | 111 | | bmrevi |
58
59
60 1-op and variants
61
62 | dest | src1 | subop | op |
63 | ---- | ---- | ----- | -------- |
64 | RT | RA | .. | bmatflip |
65
66 2-op and variants
67
68 | dest | src1 | src2 | subop | op |
69 | ---- | ---- | ---- | ----- | -------- |
70 | RT | RA | RB | or | bmatflip |
71 | RT | RA | RB | xor | bmatflip |
72 | RT | RA | RB | | grev |
73 | RT | RA | RB | | clmul* |
74 | RT | RA | RB | | gorc |
75 | RT | RA | RB | shuf | shuffle |
76 | RT | RA | RB | unshuf| shuffle |
77 | RT | RA | RB | width | xperm |
78 | RT | RA | RB | type | av minmax |
79 | RT | RA | RB | | av abs avgadd |
80 | RT | RA | RB | type | vmask ops |
81 | RT | RA | RB | | |
82
83 3 ops
84
85 * grevlog
86 * GF mul-add
87 * bitmask-reverse
88
89 TODO: convert all instructions to use RT and not RS
90
91 | 0.5|6.10|11.15|16.20 |21..25 | 26....30 |31| name |
92 | -- | -- | --- | --- | ----- | -------- |--| ------ |
93 | NN | RT | RA |itype/| im0-4 | im5-7 00 |0 | xpermi |
94 | NN | RT | RA | RB | im0-4 | im5-7 00 |1 | grevlog |
95 | NN | | | | | ----- 01 |m3| crternlog |
96 | NN | RT | RA | RB | RC | mode 010 |Rc| bitmask* |
97 | NN | | | | | 00 011 | | rsvd |
98 | NN | | | | | 01 011 |0 | svshape |
99 | NN | | | | | 01 011 |1 | svremap |
100 | NN | | | | | 10 011 |Rc| svstep |
101 | NN | | | | | 11 011 |Rc| setvl |
102 | NN | | | | | ---- 110 | | 1/2 ops |
103 | NN | RT | RA | RB | sh0-4 | sh5 1 111 |Rc| bmrevi |
104
105 ops (note that av avg and abs as well as vec scalar mask
106 are included here [[sv/vector_ops]], and
107 the [[sv/av_opcodes]])
108
109 TODO: convert from RA, RB, and RC to correct field names of RT, RA, and RB, and
110 double check that instructions didn't need 3 inputs.
111
112 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
113 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
114 | NN | RS | me | sh | SH | ME 0 | nn00 110 |Rc| bmopsi |
115 | NN | RS | RB | sh | SH | 0 1 | nn00 110 |Rc| bmopsi |
116 | NN | RT | RA | RB | 1 | 00 | 0001 110 |Rc| cldiv |
117 | NN | RT | RA | RB | 1 | 01 | 0001 110 |Rc| clmod |
118 | NN | RT | RA | | 1 | 10 | 0001 110 |Rc| bmatflip |
119 | NN | | | | 1 | 11 | 0001 110 |Rc| rsvd |
120 | NN | RA | RB | RC | 0 | 00 | 0001 110 |Rc| vec sbfm |
121 | NN | RA | RB | RC | 0 | 01 | 0001 110 |Rc| vec sofm |
122 | NN | RA | RB | RC | 0 | 10 | 0001 110 |Rc| vec sifm |
123 | NN | RA | RB | RC | 0 | 11 | 0001 110 |Rc| vec cprop |
124 | NN | | | | 0 | | 0101 110 |Rc| rsvd |
125 | NN | RT | RA | RB | 1 | itype | 0101 110 |Rc| xperm |
126 | NN | RA | RB | RC | 0 | itype | 1001 110 |Rc| av minmax |
127 | NN | RA | RB | RC | 1 | 00 | 1001 110 |Rc| av abss |
128 | NN | RA | RB | RC | 1 | 01 | 1001 110 |Rc| av absu |
129 | NN | RA | RB | | 1 | 10 | 1001 110 |Rc| av avgadd |
130 | NN | | | | 1 | 11 | 1001 110 |Rc| rsvd |
131 | NN | | | | | | 1101 110 |Rc| rsvd |
132 | NN | RA | RB | RC | 0 | 00 | 0010 110 |Rc| gorc |
133 | NN | RA | RB | sh | SH | 00 | 1010 110 |Rc| gorci |
134 | NN | RA | RB | RC | 0 | 00 | 0110 110 |Rc| gorcw |
135 | NN | RA | RB | sh | 0 | 00 | 1110 110 |Rc| gorcwi |
136 | NN | RA | RB | RC | 1 | 00 | 1110 110 |Rc| bmator |
137 | NN | RA | RB | RC | 0 | 01 | 0010 110 |Rc| grev |
138 | NN | RA | RB | RC | 1 | 01 | 0010 110 |Rc| clmul |
139 | NN | RA | RB | sh | SH | 01 | 1010 110 |Rc| grevi |
140 | NN | RA | RB | RC | 0 | 01 | 0110 110 |Rc| grevw |
141 | NN | RA | RB | sh | 0 | 01 | 1110 110 |Rc| grevwi |
142 | NN | RA | RB | RC | 1 | 01 | 1110 110 |Rc| bmatxor |
143 | NN | | | | | 10 | --10 110 |Rc| rsvd |
144 | NN | RA | RB | RC | 0 | 11 | 1110 110 |Rc| clmulr |
145 | NN | RA | RB | RC | 1 | 11 | 1110 110 |Rc| clmulh |
146 | NN | | | | | | --11 110 |Rc| rsvd |
147
148 # ternlog bitops
149
150 Similar to FPGA LUTs: for every bit perform a lookup into a table using an 8bit immediate, or in another register.
151
152 Like the x86 AVX512F [vpternlogd/vpternlogq](https://www.felixcloutier.com/x86/vpternlogd:vpternlogq) instructions.
153
154 ## ternlogi
155
156 | 0.5|6.10|11.15|16.20| 21..28|29.30|31|
157 | -- | -- | --- | --- | ----- | --- |--|
158 | NN | RT | RA | RB | im0-7 | 00 |Rc|
159
160 lut3(imm, a, b, c):
161 idx = c << 2 | b << 1 | a
162 return imm[idx] # idx by LSB0 order
163
164 for i in range(64):
165 RT[i] = lut3(imm, RB[i], RA[i], RT[i])
166
167 ## ternlogv
168
169 also, another possible variant involving swizzle-like selection
170 and masking, this only requires 3 64 bit registers (RA, RS, RB) and
171 only 16 LUT3s.
172
173 Note however that unless XLEN matches sz, this instruction
174 is a Read-Modify-Write: RS must be read as a second operand
175 and all unmodified bits preserved. SVP64 may provide limited
176 alternative destination for RS from RS-as-source, but again
177 all unmodified bits must still be copied.
178
179 | 0.5|6.10|11.15|16.20|21.28 | 29.30 |31|
180 | -- | -- | --- | --- | ---- | ----- |--|
181 | NN | RS | RA | RB |idx0-3| 01 |sz|
182
183 SZ = (1+sz) * 8 # 8 or 16
184 raoff = MIN(XLEN, idx0 * SZ)
185 rboff = MIN(XLEN, idx1 * SZ)
186 rcoff = MIN(XLEN, idx2 * SZ)
187 rsoff = MIN(XLEN, idx3 * SZ)
188 imm = RB[0:8]
189 for i in range(MIN(XLEN, SZ)):
190 ra = RA[raoff:+i]
191 rb = RA[rboff+i]
192 rc = RA[rcoff+i]
193 res = lut3(imm, ra, rb, rc)
194 RS[rsoff+i] = res
195
196 ## ternlogcr
197
198 another mode selection would be CRs not Ints.
199
200 | 0.5|6.8 | 9.11|12.14|15.17|18.20|21.28 | 29.30|31|
201 | -- | -- | --- | --- | --- |-----|----- | -----|--|
202 | NN | BT | BA | BB | BC |m0-2 | imm | 01 |m3|
203
204 mask = m0-3,m4
205 for i in range(4):
206 a,b,c = CRs[BA][i], CRs[BB][i], CRs[BC][i])
207 if mask[i] CRs[BT][i] = lut3(imm, a, b, c)
208
209 # int ops
210
211 ## min/m
212
213 required for the [[sv/av_opcodes]]
214
215 signed and unsigned min/max for integer. this is sort-of partly synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg is one of the sources, but not both signed and unsigned. when the dest is also one of the srces and the mv fails due to the CR bittest failing this will only overwrite the dest where the src is greater (or less).
216
217 signed/unsigned min/max gives more flexibility.
218
219 ```
220 uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
221 { return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
222 }
223 uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
224 { return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
225 }
226 uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
227 { return rs1 < rs2 ? rs1 : rs2;
228 }
229 uint_xlen_t maxu(uint_xlen_t rs1, uint_xlen_t rs2)
230 { return rs1 > rs2 ? rs1 : rs2;
231 }
232 ```
233
234 ## average
235
236 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
237 but not scalar
238
239 ```
240 uint_xlen_t intavg(uint_xlen_t rs1, uint_xlen_t rs2) {
241 return (rs1 + rs2 + 1) >> 1:
242 }
243 ```
244
245 ## abs
246
247 required for the [[sv/av_opcodes]], these exist in Packed SIMD (VSX)
248 but not scalar
249
250 ```
251 uint_xlen_t intabs(uint_xlen_t rs1, uint_xlen_t rs2) {
252 return (src1 > src2) ? (src1-src2) : (src2-src1)
253 }
254 ```
255
256 # cmix
257
258 based on RV bitmanip, covered by ternlog bitops
259
260 ```
261 uint_xlen_t cmix(uint_xlen_t RA, uint_xlen_t RB, uint_xlen_t RC) {
262 return (RA & RB) | (RC & ~RB);
263 }
264 ```
265
266
267 # bitmask set
268
269 based on RV bitmanip singlebit set, instruction format similar to shift
270 [[isa/fixedshift]]. bmext is actually covered already (shift-with-mask rldicl but only immediate version).
271 however bitmask-invert is not, and set/clr are not covered, although they can use the same Shift ALU.
272
273 bmext (RB) version is not the same as rldicl because bmext is a right shift by RC, where rldicl is a left rotate. for the immediate version this does not matter, so a bmexti is not required.
274 bmrev however there is no direct equivalent and consequently a bmrevi is required.
275
276 bmset (register for mask amount) is particularly useful for creating
277 predicate masks where the length is a dynamic runtime quantity.
278 bmset(RA=0, RB=0, RC=mask) will produce a run of ones of length "mask" in a single instruction without needing to initialise or depend on any other registers.
279
280 | 0.5|6.10|11.15|16.20|21.25| 26..30 |31| name |
281 | -- | -- | --- | --- | --- | ------- |--| ----- |
282 | NN | RS | RA | RB | RC | mode 010 |Rc| bm* |
283
284 Immediate-variant is an overwrite form:
285
286 | 0.5|6.10|11.15|16.20| 21 | 22.23 | 24....30 |31| name |
287 | -- | -- | --- | --- | -- | ----- | -------- |--| ---- |
288 | NN | RS | RB | sh | SH | itype | 1000 110 |Rc| bm*i |
289
290 ```
291 def MASK(x, y):
292 if x < y:
293 x = x+1
294 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
295 mask_b = ((1 << y) - 1) & ((1 << 64) - 1)
296 elif x == y:
297 return 1 << x
298 else:
299 x = x+1
300 mask_a = ((1 << x) - 1) & ((1 << 64) - 1)
301 mask_b = (~((1 << y) - 1)) & ((1 << 64) - 1)
302 return mask_a ^ mask_b
303
304
305 uint_xlen_t bmset(RS, RB, sh)
306 {
307 int shamt = RB & (XLEN - 1);
308 mask = (2<<sh)-1;
309 return RS | (mask << shamt);
310 }
311
312 uint_xlen_t bmclr(RS, RB, sh)
313 {
314 int shamt = RB & (XLEN - 1);
315 mask = (2<<sh)-1;
316 return RS & ~(mask << shamt);
317 }
318
319 uint_xlen_t bminv(RS, RB, sh)
320 {
321 int shamt = RB & (XLEN - 1);
322 mask = (2<<sh)-1;
323 return RS ^ (mask << shamt);
324 }
325
326 uint_xlen_t bmext(RS, RB, sh)
327 {
328 int shamt = RB & (XLEN - 1);
329 mask = (2<<sh)-1;
330 return mask & (RS >> shamt);
331 }
332 ```
333
334 bitmask extract with reverse. can be done by bit-order-inverting all of RB and getting bits of RB from the opposite end.
335
336 when RA is zero, no shift occurs. this makes bmextrev useful for
337 simply reversing all bits of a register.
338
339 ```
340 msb = ra[5:0];
341 rev[0:msb] = rb[msb:0];
342 rt = ZE(rev[msb:0]);
343
344 uint_xlen_t bmextrev(RA, RB, sh)
345 {
346 int shamt = XLEN-1;
347 if (RA != 0) shamt = (GPR(RA) & (XLEN - 1));
348 shamt = (XLEN-1)-shamt; # shift other end
349 bra = bitreverse(RB) # swap LSB-MSB
350 mask = (2<<sh)-1;
351 return mask & (bra >> shamt);
352 }
353 ```
354
355 | 0.5|6.10|11.15|16.20|21.26| 27..30 |31| name |
356 | -- | -- | --- | --- | --- | ------- |--| ------ |
357 | NN | RT | RA | RB | sh | 1 011 |Rc| bmrevi |
358
359
360 # grevlut
361
362 generalised reverse combined with a pair of LUT2s and allowing
363 a constant `0b0101...0101` when RA=0, and an option to invert
364 (including when RA=0, giving a constant 0b1010...1010 as the
365 initial value) provides a wide range of instructions
366 and a means to set hundreds of regular 64 bit patterns with one
367 single 32 bit instruction.
368
369 the two LUT2s are applied left-half (when not swapping)
370 and right-half (when swapping) so as to allow a wider
371 range of options.
372
373 <img src="/openpower/sv/grevlut2x2.jpg" width=700 />
374
375 * A value of `0b11001010` for the immediate provides
376 the functionality of a standard "grev".
377 * `0b11101110` provides gorc
378
379 grevlut should be arranged so as to produce the constants
380 needed to put into bext (bitextract) so as in turn to
381 be able to emulate x86 pmovmask instructions <https://www.felixcloutier.com/x86/pmovmskb>.
382 This only requires 2 instructions (grevlut, bext).
383
384 Note that if the mask is required to be placed
385 directly into CR Fields (for use as CR Predicate
386 masks rather than a integer mask) then sv.cmpi or sv.ori
387 may be used instead, bearing in mind that sv.ori
388 is a 64-bit instruction, and `VL` must have been
389 set to the required length:
390
391 sv.ori./elwid=8 r10.v, r10.v, 0
392
393 The following settings provide the required mask constants:
394
395 | RA | RB | imm | iv | result |
396 | ------- | ------- | ---------- | -- | ---------- |
397 | 0x555.. | 0b10 | 0b01101100 | 0 | 0x111111... |
398 | 0x555.. | 0b110 | 0b01101100 | 0 | 0x010101... |
399 | 0x555.. | 0b1110 | 0b01101100 | 0 | 0x00010001... |
400 | 0x555.. | 0b10 | 0b11000110 | 1 | 0x88888... |
401 | 0x555.. | 0b110 | 0b11000110 | 1 | 0x808080... |
402 | 0x555.. | 0b1110 | 0b11000110 | 1 | 0x80008000... |
403
404 Better diagram showing the correct ordering of shamt (RB). A LUT2
405 is applied to all locations marked in red using the first 4
406 bits of the immediate, and a separate LUT2 applied to all
407 locations in green using the upper 4 bits of the immediate.
408
409 <img src="/openpower/sv/grevlut.png" width=700 />
410
411 demo code [[openpower/sv/grevlut.py]]
412
413 ```
414 lut2(imm, a, b):
415 idx = b << 1 | a
416 return imm[idx] # idx by LSB0 order
417
418 dorow(imm8, step_i, chunksize):
419 for j in 0 to 63:
420 if (j&chunk_size) == 0
421 imm = imm8[0..3]
422 else
423 imm = imm8[4..7]
424 step_o[j] = lut2(imm, step_i[j], step_i[j ^ chunk_size])
425 return step_o
426
427 uint64_t grevlut64(uint64_t RA, uint64_t RB, uint8 imm, bool iv)
428 {
429 uint64_t x = 0x5555_5555_5555_5555;
430 if (RA != 0) x = GPR(RA);
431 if (iv) x = ~x;
432 int shamt = RB & 63;
433 for i in 0 to 6
434 step = 1<<i
435 if (shamt & step) x = dorow(imm, x, step)
436 return x;
437 }
438
439 ```
440
441 | 0.5|6.10|11.15|16.20 |21..25 | 26....30 |31| name |
442 | -- | -- | --- | --- | ----- | -------- |--| ------ |
443 | NN | RT | RA | s0-4 | im0-4 | im5-7 1 iv |s5| grevlogi |
444 | NN | RT | RA | RB | im0-4 | im5-7 00 |1 | grevlog |
445
446
447 # grev
448
449 based on RV bitmanip, this is also known as a butterfly network. however
450 where a butterfly network allows setting of every crossbar setting in
451 every row and every column, generalised-reverse (grev) only allows
452 a per-row decision: every entry in the same row must either switch or
453 not-switch.
454
455 <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Butterfly_Network.jpg/474px-Butterfly_Network.jpg" />
456
457 ```
458 uint64_t grev64(uint64_t RA, uint64_t RB)
459 {
460 uint64_t x = RA;
461 int shamt = RB & 63;
462 if (shamt & 1) x = ((x & 0x5555555555555555LL) << 1) |
463 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
464 if (shamt & 2) x = ((x & 0x3333333333333333LL) << 2) |
465 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
466 if (shamt & 4) x = ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
467 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
468 if (shamt & 8) x = ((x & 0x00FF00FF00FF00FFLL) << 8) |
469 ((x & 0xFF00FF00FF00FF00LL) >> 8);
470 if (shamt & 16) x = ((x & 0x0000FFFF0000FFFFLL) << 16) |
471 ((x & 0xFFFF0000FFFF0000LL) >> 16);
472 if (shamt & 32) x = ((x & 0x00000000FFFFFFFFLL) << 32) |
473 ((x & 0xFFFFFFFF00000000LL) >> 32);
474 return x;
475 }
476
477 ```
478
479 # xperm
480
481 based on RV bitmanip.
482
483 RA contains a vector of indices to select parts of RB to be
484 copied to RT. The immediate-variant allows up to an 8 bit
485 pattern (repeated) to be targetted at different parts of RT.
486
487 xperm shares some similarity with one of the uses of bmator
488 in that xperm indices are binary addressing where bitmator
489 may be considered to be unary addressing.
490
491 ```
492 uint_xlen_t xpermi(uint8_t imm8, uint_xlen_t RB, int sz_log2)
493 {
494 uint_xlen_t r = 0;
495 uint_xlen_t sz = 1LL << sz_log2;
496 uint_xlen_t mask = (1LL << sz) - 1;
497 uint_xlen_t RA = imm8 | imm8<<8 | ... | imm8<<56;
498 for (int i = 0; i < XLEN; i += sz) {
499 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
500 if (pos < XLEN)
501 r |= ((RB >> pos) & mask) << i;
502 }
503 return r;
504 }
505 uint_xlen_t xperm(uint_xlen_t RA, uint_xlen_t RB, int sz_log2)
506 {
507 uint_xlen_t r = 0;
508 uint_xlen_t sz = 1LL << sz_log2;
509 uint_xlen_t mask = (1LL << sz) - 1;
510 for (int i = 0; i < XLEN; i += sz) {
511 uint_xlen_t pos = ((RA >> i) & mask) << sz_log2;
512 if (pos < XLEN)
513 r |= ((RB >> pos) & mask) << i;
514 }
515 return r;
516 }
517 uint_xlen_t xperm_n (uint_xlen_t RA, uint_xlen_t RB)
518 { return xperm(RA, RB, 2); }
519 uint_xlen_t xperm_b (uint_xlen_t RA, uint_xlen_t RB)
520 { return xperm(RA, RB, 3); }
521 uint_xlen_t xperm_h (uint_xlen_t RA, uint_xlen_t RB)
522 { return xperm(RA, RB, 4); }
523 uint_xlen_t xperm_w (uint_xlen_t RA, uint_xlen_t RB)
524 { return xperm(RA, RB, 5); }
525 ```
526
527 # gorc
528
529 based on RV bitmanip
530
531 ```
532 uint32_t gorc32(uint32_t RA, uint32_t RB)
533 {
534 uint32_t x = RA;
535 int shamt = RB & 31;
536 if (shamt & 1) x |= ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
537 if (shamt & 2) x |= ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
538 if (shamt & 4) x |= ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
539 if (shamt & 8) x |= ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8);
540 if (shamt & 16) x |= ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
541 return x;
542 }
543 uint64_t gorc64(uint64_t RA, uint64_t RB)
544 {
545 uint64_t x = RA;
546 int shamt = RB & 63;
547 if (shamt & 1) x |= ((x & 0x5555555555555555LL) << 1) |
548 ((x & 0xAAAAAAAAAAAAAAAALL) >> 1);
549 if (shamt & 2) x |= ((x & 0x3333333333333333LL) << 2) |
550 ((x & 0xCCCCCCCCCCCCCCCCLL) >> 2);
551 if (shamt & 4) x |= ((x & 0x0F0F0F0F0F0F0F0FLL) << 4) |
552 ((x & 0xF0F0F0F0F0F0F0F0LL) >> 4);
553 if (shamt & 8) x |= ((x & 0x00FF00FF00FF00FFLL) << 8) |
554 ((x & 0xFF00FF00FF00FF00LL) >> 8);
555 if (shamt & 16) x |= ((x & 0x0000FFFF0000FFFFLL) << 16) |
556 ((x & 0xFFFF0000FFFF0000LL) >> 16);
557 if (shamt & 32) x |= ((x & 0x00000000FFFFFFFFLL) << 32) |
558 ((x & 0xFFFFFFFF00000000LL) >> 32);
559 return x;
560 }
561
562 ```
563
564 # bitmatrix
565
566 ```
567 uint64_t bmatflip(uint64_t RA)
568 {
569 uint64_t x = RA;
570 x = shfl64(x, 31);
571 x = shfl64(x, 31);
572 x = shfl64(x, 31);
573 return x;
574 }
575 uint64_t bmatxor(uint64_t RA, uint64_t RB)
576 {
577 // transpose of RB
578 uint64_t RBt = bmatflip(RB);
579 uint8_t u[8]; // rows of RA
580 uint8_t v[8]; // cols of RB
581 for (int i = 0; i < 8; i++) {
582 u[i] = RA >> (i*8);
583 v[i] = RBt >> (i*8);
584 }
585 uint64_t x = 0;
586 for (int i = 0; i < 64; i++) {
587 if (pcnt(u[i / 8] & v[i % 8]) & 1)
588 x |= 1LL << i;
589 }
590 return x;
591 }
592 uint64_t bmator(uint64_t RA, uint64_t RB)
593 {
594 // transpose of RB
595 uint64_t RBt = bmatflip(RB);
596 uint8_t u[8]; // rows of RA
597 uint8_t v[8]; // cols of RB
598 for (int i = 0; i < 8; i++) {
599 u[i] = RA >> (i*8);
600 v[i] = RBt >> (i*8);
601 }
602 uint64_t x = 0;
603 for (int i = 0; i < 64; i++) {
604 if ((u[i / 8] & v[i % 8]) != 0)
605 x |= 1LL << i;
606 }
607 return x;
608 }
609
610 ```
611
612 # Introduction to Carry-less and GF arithmetic
613
614 * obligatory xkcd <https://xkcd.com/2595/>
615
616 There are three completely separate types of Galois-Field-based
617 arithmetic that we implement which are not well explained even in introductory literature. A slightly oversimplified explanation
618 is followed by more accurate descriptions:
619
620 * `GF(2)` carry-less binary arithmetic. this is not actually a Galois Field,
621 but is accidentally referred to as GF(2) - see below as to why.
622 * `GF(p)` modulo arithmetic with a Prime number, these are "proper" Galois Fields
623 * `GF(2^N)` carry-less binary arithmetic with two limits: modulo a power-of-2
624 (2^N) and a second "reducing" polynomial (similar to a prime number), these
625 are said to be GF(2^N) arithmetic.
626
627 further detailed and more precise explanations are provided below
628
629 * **Polynomials with coefficients in `GF(2)`**
630 (aka. Carry-less arithmetic -- the `cl*` instructions).
631 This isn't actually a Galois Field, but its coefficients are. This is
632 basically binary integer addition, subtraction, and multiplication like
633 usual, except that carries aren't propagated at all, effectively turning
634 both addition and subtraction into the bitwise xor operation. Division and
635 remainder are defined to match how addition and multiplication works.
636 * **Galois Fields with a prime size**
637 (aka. `GF(p)` or Prime Galois Fields -- the `gfp*` instructions).
638 This is basically just the integers mod `p`.
639 * **Galois Fields with a power-of-a-prime size**
640 (aka. `GF(p^n)` or `GF(q)` where `q == p^n` for prime `p` and
641 integer `n > 0`).
642 We only implement these for `p == 2`, called Binary Galois Fields
643 (`GF(2^n)` -- the `gfb*` instructions).
644 For any prime `p`, `GF(p^n)` is implemented as polynomials with
645 coefficients in `GF(p)` and degree `< n`, where the polynomials are the
646 remainders of dividing by a specificly chosen polynomial in `GF(p)` called
647 the Reducing Polynomial (we will denote that by `red_poly`). The Reducing
648 Polynomial must be an irreducable polynomial (like primes, but for
649 polynomials), as well as have degree `n`. All `GF(p^n)` for the same `p`
650 and `n` are isomorphic to each other -- the choice of `red_poly` doesn't
651 affect `GF(p^n)`'s mathematical shape, all that changes is the specific
652 polynomials used to implement `GF(p^n)`.
653
654 Many implementations and much of the literature do not make a clear
655 distinction between these three categories, which makes it confusing
656 to understand what their purpose and value is.
657
658 * carry-less multiply is extremely common and is used for the ubiquitous
659 CRC32 algorithm. [TODO add many others, helps justify to ISA WG]
660 * GF(2^N) forms the basis of Rijndael (the current AES standard) and
661 has significant uses throughout cryptography
662 * GF(p) is the basis again of a significant quantity of algorithms
663 (TODO, list them, jacob knows what they are), even though the
664 modulo is limited to be below 64-bit (size of a scalar int)
665
666 # Instructions for Carry-less Operations
667
668 aka. Polynomials with coefficients in `GF(2)`
669
670 Carry-less addition/subtraction is simply XOR, so a `cladd`
671 instruction is not provided since the `xor[i]` instruction can be used instead.
672
673 These are operations on polynomials with coefficients in `GF(2)`, with the
674 polynomial's coefficients packed into integers with the following algorithm:
675
676 ```python
677 [[!inline pagenames="gf_reference/pack_poly.py" raw="yes"]]
678 ```
679
680 ## Carry-less Multiply Instructions
681
682 based on RV bitmanip
683 see <https://en.wikipedia.org/wiki/CLMUL_instruction_set> and
684 <https://www.felixcloutier.com/x86/pclmulqdq> and
685 <https://en.m.wikipedia.org/wiki/Carry-less_product>
686
687 They are worth adding as their own non-overwrite operations
688 (in the same pipeline).
689
690 ### `clmul` Carry-less Multiply
691
692 ```python
693 [[!inline pagenames="gf_reference/clmul.py" raw="yes"]]
694 ```
695
696 ### `clmulh` Carry-less Multiply High
697
698 ```python
699 [[!inline pagenames="gf_reference/clmulh.py" raw="yes"]]
700 ```
701
702 ### `clmulr` Carry-less Multiply (Reversed)
703
704 Useful for CRCs. Equivalent to bit-reversing the result of `clmul` on
705 bit-reversed inputs.
706
707 ```python
708 [[!inline pagenames="gf_reference/clmulr.py" raw="yes"]]
709 ```
710
711 ## `clmadd` Carry-less Multiply-Add
712
713 ```
714 clmadd RT, RA, RB, RC
715 ```
716
717 ```
718 (RT) = clmul((RA), (RB)) ^ (RC)
719 ```
720
721 ## `cltmadd` Twin Carry-less Multiply-Add (for FFTs)
722
723 Used in combination with SV FFT REMAP to perform a full Discrete Fourier
724 Transform of Polynomials over GF(2) in-place. Possible by having 3-in 2-out,
725 to avoid the need for a temp register. RS is written to as well as RT.
726
727 Note: Polynomials over GF(2) are a Ring rather than a Field, so, because the
728 definition of the Inverse Discrete Fourier Transform involves calculating a
729 multiplicative inverse, which may not exist in every Ring, therefore the
730 Inverse Discrete Fourier Transform may not exist. (AFAICT the number of inputs
731 to the IDFT must be odd for the IDFT to be defined for Polynomials over GF(2).
732 TODO: check with someone who knows for sure if that's correct.)
733
734 ```
735 cltmadd RT, RA, RB, RC
736 ```
737
738 TODO: add link to explanation for where `RS` comes from.
739
740 ```
741 a = (RA)
742 c = (RC)
743 # read all inputs before writing to any outputs in case
744 # an input overlaps with an output register.
745 (RT) = clmul(a, (RB)) ^ c
746 (RS) = a ^ c
747 ```
748
749 ## `cldivrem` Carry-less Division and Remainder
750
751 `cldivrem` isn't an actual instruction, but is just used in the pseudo-code
752 for other instructions.
753
754 ```python
755 [[!inline pagenames="gf_reference/cldivrem.py" raw="yes"]]
756 ```
757
758 ## `cldiv` Carry-less Division
759
760 ```
761 cldiv RT, RA, RB
762 ```
763
764 ```
765 n = (RA)
766 d = (RB)
767 q, r = cldivrem(n, d, width=XLEN)
768 (RT) = q
769 ```
770
771 ## `clrem` Carry-less Remainder
772
773 ```
774 clrem RT, RA, RB
775 ```
776
777 ```
778 n = (RA)
779 d = (RB)
780 q, r = cldivrem(n, d, width=XLEN)
781 (RT) = r
782 ```
783
784 # Instructions for Binary Galois Fields `GF(2^m)`
785
786 see:
787
788 * <https://courses.csail.mit.edu/6.857/2016/files/ffield.py>
789 * <https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture7.pdf>
790 * <https://foss.heptapod.net/math/libgf2/-/blob/branch/default/src/libgf2/gf2.py>
791
792 Binary Galois Field addition/subtraction is simply XOR, so a `gfbadd`
793 instruction is not provided since the `xor[i]` instruction can be used instead.
794
795 ## `GFBREDPOLY` SPR -- Reducing Polynomial
796
797 In order to save registers and to make operations orthogonal with standard
798 arithmetic, the reducing polynomial is stored in a dedicated SPR `GFBREDPOLY`.
799 This also allows hardware to pre-compute useful parameters (such as the
800 degree, or look-up tables) based on the reducing polynomial, and store them
801 alongside the SPR in hidden registers, only recomputing them whenever the SPR
802 is written to, rather than having to recompute those values for every
803 instruction.
804
805 Because Galois Fields require the reducing polynomial to be an irreducible
806 polynomial, that guarantees that any polynomial of `degree > 1` must have
807 the LSB set, since otherwise it would be divisible by the polynomial `x`,
808 making it reducible, making whatever we're working on no longer a Field.
809 Therefore, we can reuse the LSB to indicate `degree == XLEN`.
810
811 ```python
812 [[!inline pagenames="gf_reference/decode_reducing_polynomial.py" raw="yes"]]
813 ```
814
815 ## `gfbredpoly` -- Set the Reducing Polynomial SPR `GFBREDPOLY`
816
817 unless this is an immediate op, `mtspr` is completely sufficient.
818
819 ```python
820 [[!inline pagenames="gf_reference/gfbredpoly.py" raw="yes"]]
821 ```
822
823 ## `gfbmul` -- Binary Galois Field `GF(2^m)` Multiplication
824
825 ```
826 gfbmul RT, RA, RB
827 ```
828
829 ```python
830 [[!inline pagenames="gf_reference/gfbmul.py" raw="yes"]]
831 ```
832
833 ## `gfbmadd` -- Binary Galois Field `GF(2^m)` Multiply-Add
834
835 ```
836 gfbmadd RT, RA, RB, RC
837 ```
838
839 ```python
840 [[!inline pagenames="gf_reference/gfbmadd.py" raw="yes"]]
841 ```
842
843 ## `gfbtmadd` -- Binary Galois Field `GF(2^m)` Twin Multiply-Add (for FFT)
844
845 Used in combination with SV FFT REMAP to perform a full `GF(2^m)` Discrete
846 Fourier Transform in-place. Possible by having 3-in 2-out, to avoid the need
847 for a temp register. RS is written to as well as RT.
848
849 ```
850 gfbtmadd RT, RA, RB, RC
851 ```
852
853 TODO: add link to explanation for where `RS` comes from.
854
855 ```
856 a = (RA)
857 c = (RC)
858 # read all inputs before writing to any outputs in case
859 # an input overlaps with an output register.
860 (RT) = gfbmadd(a, (RB), c)
861 # use gfbmadd again since it reduces the result
862 (RS) = gfbmadd(a, 1, c) # "a * 1 + c"
863 ```
864
865 ## `gfbinv` -- Binary Galois Field `GF(2^m)` Inverse
866
867 ```
868 gfbinv RT, RA
869 ```
870
871 ```python
872 [[!inline pagenames="gf_reference/gfbinv.py" raw="yes"]]
873 ```
874
875 # Instructions for Prime Galois Fields `GF(p)`
876
877 ## `GFPRIME` SPR -- Prime Modulus For `gfp*` Instructions
878
879 ## `gfpadd` Prime Galois Field `GF(p)` Addition
880
881 ```
882 gfpadd RT, RA, RB
883 ```
884
885 ```python
886 [[!inline pagenames="gf_reference/gfpadd.py" raw="yes"]]
887 ```
888
889 the addition happens on infinite-precision integers
890
891 ## `gfpsub` Prime Galois Field `GF(p)` Subtraction
892
893 ```
894 gfpsub RT, RA, RB
895 ```
896
897 ```python
898 [[!inline pagenames="gf_reference/gfpsub.py" raw="yes"]]
899 ```
900
901 the subtraction happens on infinite-precision integers
902
903 ## `gfpmul` Prime Galois Field `GF(p)` Multiplication
904
905 ```
906 gfpmul RT, RA, RB
907 ```
908
909 ```python
910 [[!inline pagenames="gf_reference/gfpmul.py" raw="yes"]]
911 ```
912
913 the multiplication happens on infinite-precision integers
914
915 ## `gfpinv` Prime Galois Field `GF(p)` Invert
916
917 ```
918 gfpinv RT, RA
919 ```
920
921 Some potential hardware implementations are found in:
922 <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5233&rep=rep1&type=pdf>
923
924 ```python
925 [[!inline pagenames="gf_reference/gfpinv.py" raw="yes"]]
926 ```
927
928 ## `gfpmadd` Prime Galois Field `GF(p)` Multiply-Add
929
930 ```
931 gfpmadd RT, RA, RB, RC
932 ```
933
934 ```python
935 [[!inline pagenames="gf_reference/gfpmadd.py" raw="yes"]]
936 ```
937
938 the multiplication and addition happens on infinite-precision integers
939
940 ## `gfpmsub` Prime Galois Field `GF(p)` Multiply-Subtract
941
942 ```
943 gfpmsub RT, RA, RB, RC
944 ```
945
946 ```python
947 [[!inline pagenames="gf_reference/gfpmsub.py" raw="yes"]]
948 ```
949
950 the multiplication and subtraction happens on infinite-precision integers
951
952 ## `gfpmsubr` Prime Galois Field `GF(p)` Multiply-Subtract-Reversed
953
954 ```
955 gfpmsubr RT, RA, RB, RC
956 ```
957
958 ```python
959 [[!inline pagenames="gf_reference/gfpmsubr.py" raw="yes"]]
960 ```
961
962 the multiplication and subtraction happens on infinite-precision integers
963
964 ## `gfpmaddsubr` Prime Galois Field `GF(p)` Multiply-Add and Multiply-Sub-Reversed (for FFT)
965
966 Used in combination with SV FFT REMAP to perform
967 a full Number-Theoretic-Transform in-place. Possible by having 3-in 2-out,
968 to avoid the need for a temp register. RS is written
969 to as well as RT.
970
971 ```
972 gfpmaddsubr RT, RA, RB, RC
973 ```
974
975 TODO: add link to explanation for where `RS` comes from.
976
977 ```
978 factor1 = (RA)
979 factor2 = (RB)
980 term = (RC)
981 # read all inputs before writing to any outputs in case
982 # an input overlaps with an output register.
983 (RT) = gfpmadd(factor1, factor2, term)
984 (RS) = gfpmsubr(factor1, factor2, term)
985 ```
986
987 # Already in POWER ISA
988
989 ## count leading/trailing zeros with mask
990
991 in v3.1 p105
992
993 ```
994 count = 0
995 do i = 0 to 63 if((RB)i=1) then do
996 if((RS)i=1) then break end end count ← count + 1
997 RA ← EXTZ64(count)
998 ```
999
1000 ## bit deposit
1001
1002 pdepd VRT,VRA,VRB, identical to RV bitmamip bdep, found already in v3.1 p106
1003
1004 do while(m < 64)
1005 if VSR[VRB+32].dword[i].bit[63-m]=1 then do
1006 result = VSR[VRA+32].dword[i].bit[63-k]
1007 VSR[VRT+32].dword[i].bit[63-m] = result
1008 k = k + 1
1009 m = m + 1
1010
1011 ```
1012
1013 uint_xlen_t bdep(uint_xlen_t RA, uint_xlen_t RB)
1014 {
1015 uint_xlen_t r = 0;
1016 for (int i = 0, j = 0; i < XLEN; i++)
1017 if ((RB >> i) & 1) {
1018 if ((RA >> j) & 1)
1019 r |= uint_xlen_t(1) << i;
1020 j++;
1021 }
1022 return r;
1023 }
1024
1025 ```
1026
1027 ## bit extract
1028
1029 other way round: identical to RV bext: pextd, found in v3.1 p196
1030
1031 ```
1032 uint_xlen_t bext(uint_xlen_t RA, uint_xlen_t RB)
1033 {
1034 uint_xlen_t r = 0;
1035 for (int i = 0, j = 0; i < XLEN; i++)
1036 if ((RB >> i) & 1) {
1037 if ((RA >> i) & 1)
1038 r |= uint_xlen_t(1) << j;
1039 j++;
1040 }
1041 return r;
1042 }
1043 ```
1044
1045 ## centrifuge
1046
1047 found in v3.1 p106 so not to be added here
1048
1049 ```
1050 ptr0 = 0
1051 ptr1 = 0
1052 do i = 0 to 63
1053 if((RB)i=0) then do
1054 resultptr0 = (RS)i
1055 end
1056 ptr0 = ptr0 + 1
1057 if((RB)63-i==1) then do
1058 result63-ptr1 = (RS)63-i
1059 end
1060 ptr1 = ptr1 + 1
1061 RA = result
1062 ```
1063
1064 ## bit to byte permute
1065
1066 similar to matrix permute in RV bitmanip, which has XOR and OR variants,
1067 these perform a transpose. TODO this looks VSX is there a scalar variant
1068 in v3.0/1 already
1069
1070 do j = 0 to 7
1071 do k = 0 to 7
1072 b = VSR[VRB+32].dword[i].byte[k].bit[j]
1073 VSR[VRT+32].dword[i].byte[j].bit[k] = b
1074
1075 # Appendix
1076
1077 see [[bitmanip/appendix]]
1078