7 * Revision 0.0: 21apr2022 <https://www.youtube.com/watch?v=8hrIG7-E77o>
8 * Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
9 * Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
10 * Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
11 * Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
12 * Revision 0.05: 27apr2022 add vector shift section (no new instructions)
16 This page covers an analysis of big integer operations, to
17 work out optimal Scalar Instructions to propose be submitted to
18 the OpenPOWER ISA WG, that when combined with Draft SVP64 give
19 high performance compact Big Integer Vector Arithmetic. Leverage
20 of existing Scalar Power ISA instructions is also explained.
22 Use of smaller sub-operations is a given: worst-case in a Scalar
23 context, addition is O(N) whilst multiply and divide are O(N^2),
24 and their Vectorization would reduce those (for small N) to
25 O(1) and O(N). Knuth's big-integer scalar algorithms provide
26 useful real-world grounding into the types of operations needed,
27 making it easy to demonstrate how they would be Vectorized.
29 The basic principle behind Knuth's algorithms is to break the
30 problem down into a single scalar op against a Vector operand.
31 *This fits naturally with a Scalable Vector ISA such as SVP64*.
32 It only remains to exploit Carry (1-bit and 64-bit) in a Scalable Vector
33 context and the picture is complete.
37 * <https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>
38 * <https://web.archive.org/web/20141021201141/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf>
39 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004700.html>
40 * <https://news.ycombinator.com/item?id=21151646>
41 * <https://twitter.com/lkcl/status/1517169267912986624>
42 * <https://www.youtube.com/watch?v=8hrIG7-E77o>
43 * <https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/>
44 * <https://bugs.libre-soc.org/show_bug.cgi?id=817>
46 # Vector Add and Subtract
48 Surprisingly, no new additional instructions are required to perform
49 a straightforward big-integer add or subtract. Vectorized `adde`
50 or `addex` is perfectly sufficient to produce arbitrary-length
51 big-integer add due to the rules set in SVP64 that all Vector Operations
52 are directly equivalent to the strict Program Order Execution of
53 their element-level operations. Assuming that the two bigints (or
54 a part thereof) have been loaded into sequentially-contiguous
55 registers, with the least-significant bits being in the lowest-numbered
56 register in each case:
58 R0,CA = A0+B0+CA adde r0,a0,b0
62 R1,CA = A1+B1+CA adde r1,a1,b1
66 R2,CA = A2+B2+CA adde r2,a2,b2
68 This pattern - sequential execution of individual instructions
69 with incrementing register numbers - is precisely the very definition
71 Thus, due to sequential execution of `adde` both consuming and producing
72 a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA,
73 `sv.adde` is in effect an alias for Big-Integer Vectorized add. As such,
74 implementors are entirely at liberty to recognise Horizontal-First Vector
75 adds and send the vector of registers to a much larger and wider back-end
76 ALU, and short-cut the intermediate storage of XER.CA on an element
77 level in back-end hardware that need only:
79 * read the first incoming XER.CA
80 * implement a large Vector-aware carry propagation algorithm
81 * store the very last XER.CA in the batch
83 The size and implementation of the underlying back-end SIMD ALU
84 is entirely at the discretion of the implementer, as is whether to
85 deploy the above strategy. The only hard requirement for
86 implementors of SVP64 is to comply with strict and precise Program Order
87 even at the Element level.
89 If there is pressure on the register file (or
90 multi-million-digit big integers)
91 then a partial-sum may be carried out with LD and ST in a standard
92 Cray-style Vector Loop:
97 li r0, 0 # used to help clear CA
98 addic r0, r0, 0 # CA to zero as well
99 setmvli 8 # set MAXVL to 8
101 setvl t0, n # n is the number of digits
102 mulli t1, t0, 8 # 8 bytes per digit/element
103 sv.ldu a0, aptr, t1 # update advances pointer
104 sv.ldu b0, bptr, t1 # likewise
105 sv.adde r0, a0, b0 # takes in CA, updates CA
106 sv.stu rptr, r0, t1 # pointer advances too
107 sub. n, n, t0 # should not alter CA
108 bnz loop # do more digits
110 This is not that different from a Scalar Big-Int add, it is
111 just that like all Cray-style Vectorization, a variable number
112 of elements are covered by one instruction. Of interest
113 to people unfamiliar with Cray-style Vectors: if VL is not
114 permitted to exceed 1 (because MAXVL is set to 1) then the above
115 actually becomes a Scalar Big-Int add algorithm.
119 Long-multiply, assuming an O(N^2) algorithm, is performed by summing
120 NxN separate smaller multiplications together. Karatsuba's algorithm
121 reduces the number of small multiplies at the expense of increasing
122 the number of additions. Some algorithms follow the Vedic Multiply
123 pattern by grouping together all multiplies of the same magnitude/power
124 (same column) whilst others perform row-based multiplication: a single
125 digit of B multiplies the entirety of A, summed a row at a time.
127 algorithm is the basis of the analysis below (Knuth's Algorithm M).
129 Multiply is tricky: 64 bit operands actually produce a 128-bit result,
130 which clearly cannot fit into an orthogonal register file.
131 Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
132 instructions, whilst some (OpenRISC) have "Accumulators" from which
133 the results of the multiply must be explicitly extracted. High
134 performance RISC advocates
135 recommend "macro-op fusion" which is in effect where the second instruction
136 gains access to the cached copy of the HI half of the
137 multiply result, which had already been
138 computed by the first. This approach quickly complicates the internal
139 microarchitecture, especially at the decode phase.
141 Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
142 both HI and LO halves of the multiply to reach registers with a single
143 instruction. If however done as a
144 multiply-and-accumulate this becomes quite an expensive operation:
145 (3 64-Bit in, 2 64-bit registers out).
147 Long-multiplication may be performed a row at a time, starting
157 * R0 contains C0 plus the LO half of A0 times B0
158 * R1 contains C1 plus the LO half of A1 times B0
159 plus the HI half of A0 times B0.
160 * R2 contains C2 plus the LO half of A2 times B0
161 plus the HI half of A1 times B0.
163 This would on the face of it be a 4-in operation:
164 the upper half of a previous multiply, two new operands
165 to multiply, and an additional accumulator (C). However if
166 C is left out (and added afterwards with a Vector-Add)
167 things become more manageable.
169 Demonstrating in c, a Row-based multiply using a temporary vector.
170 Adapted from a simple implementation
171 of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
174 // this becomes the basis for sv.maddedu in RS=RC Mode,
175 // where k is RC. k takes the upper half of product
176 // and adds it in on the next iteration
178 for (i = 0; i < m; i++) {
179 unsigned product = u[i]*v[j] + k;
181 plo[i] = product; // & 0xffff
183 // this is simply sv.adde where k is XER.CA
185 for (i = 0; i < m; i++) {
186 t = plo[i] + w[i + j] + k;
187 w[i + j] = t; // (I.e., t & 0xFFFF).
188 k = t >> 16; // carry: should only be 1 bit
192 We therefore propose an operation that is 3-in, 2-out,
193 that, noting that the connection between successive
194 mul-adds has the UPPER half of the previous operation
195 as its input, writes the UPPER half of the current
196 product into a second output register for exactly the
197 purpose of letting it be added onto the next BigInt digit.
200 RT = lowerhalf(product)
201 RC = upperhalf(product)
203 Horizontal-First Mode therefore may be applied to just this one
205 Successive sequential iterations effectively use RC as a kind of
207 as noted by Intel in their notes on mulx,
208 `RA*RB+RC+RD` cannot overflow, so does not require
209 setting an additional CA flag. We first cover the chain of
210 `RA*RB+RC` as follows:
212 RT0, RC0 = RA0 * RB0 + 0
216 RT1, RC1 = RA1 * RB1 + RC0
220 RT2, RC2 = RA2 * RB2 + RC1
222 Following up to add each partially-computed row to what will become
223 the final result is achieved with a Vectorized big-int
224 `sv.adde`. Thus, the key inner loop of
225 Knuth's Algorithm M may be achieved in four instructions, two of
226 which are scalar initialisation:
228 li r16, 0 # zero accumulator
229 addic r16, r16, 0 # CA to zero as well
230 sv.maddedu *r0, *r8, r17, r16 # mul vector
231 sv.adde *r24, *r24, *r0 # big-add row to result
233 Normally, in a Scalar ISA, the use of a register as both a source
234 and destination like this would create costly Dependency Hazards, so
235 such an instruction would never be proposed. However: it turns out
236 that, just as with repeated chained application of `adde`, macro-op
237 fusion may be internally applied to a sequence of these strange multiply
238 operations. (*Such a trick works equally as well in a Scalar-only
239 Out-of-Order microarchitecture, although the conditions are harder to
242 **Application of SVP64**
244 SVP64 has the means to mark registers as scalar or vector. However
245 the available space in the prefix is extremely limited (9 bits).
246 With effectively 5 operands (3 in, 2 out) some compromises are needed.
247 A little thought gives a useful workaround: two modes,
248 controlled by a single bit in `RM.EXTRA`, determine whether the 5th
249 register is set to RC or whether to RT+MAXVL. This then leaves only
250 4 registers to qualify as scalar/vector, which can use four
251 EXTRA2 designators and fits into the available 9-bit space.
256 RT = lowerhalf(product)
257 RS=RT+MAXVL = upperhalf(product)
262 RT = lowerhalf(product)
263 RS=RC = upperhalf(product)
265 Now there is much more potential, including setting RC to a Scalar,
266 which would be useful as a 64 bit Carry. RC as a Vector would produce
267 a Vector of the HI halves of a Vector of multiplies. RS=RT+MAXVL Mode
268 would allow that same Vector of HI halves to not be an overwrite of RC.
269 Also it is possible to specify that any of RA, RB or RC are scalar or
270 vector. Overall it is extremely powerful.
274 The decision here was made to follow the same principle as
275 for multiply-and-accumulate. Whilst Intel has
276 [srd](https://www.felixcloutier.com/x86/shrd)
277 which is very similar (but only 2-in 2-out),
278 the decision was made to create a 3-in 2-out
279 instruction that effectively uses one of the inputs and one of
280 the outputs as a "sort of 64-bit carry".
282 Without the `dsrd` instruction, it is necessary to have three
283 instructions in an inner loop.
284 Keeping the shift amount within the range of the element (64 bit)
285 a Vector bit-shift may be synthesised from a pair of shift operations
286 and an OR, all of which are standard Scalar Power ISA instructions.
287 that when Vectorized are exactly what is needed, but that critically
288 requires Vertical-First Mode.
291 void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
292 for (int i = 0; i < n - 1; i++)
293 r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
294 r[n - 1] = un[n - 1] >> s;
298 With SVP64 being on top of the standard scalar regfile the offset by
299 one of the elements may be achieved simply by referencing the same
300 vector data offset by one. Given that all three instructions
301 (`srd`, `sld`, `or`) are an SVP64 type `RM-1P-2S1D` and are `EXTRA3`,
302 it is possible to reference the full 128 64-bit registers (r0-r127):
304 subfic t1, t0, 64 # compute 64-s (s in t0)
305 sv.srd *r8, *r24, t0 # shift each element of r24 vector up by s
306 sv.sld *r16, *r25, t1 # offset start of vector by one (r25)
307 sv.or *r8, *r8, *r16 # OR two parts together
309 Predication with zeroing may be utilised on sld to ensure that the
310 last element is zero, avoiding over-run.
312 The reason why three instructions are needed instead of one in the
313 case of big-add is because multiple bits chain through to the
314 next element, where for add it is a single bit (carry-in, carry-out),
315 and this is precisely what `adde` already does.
316 For multiply, divide and shift it is worthwhile to use
317 one scalar register effectively as a full 64-bit carry/chain.
319 The basic principle of the 3-in 2-out `dsrd` is:
322 # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
323 temp <- ROT128(RA || RC, RB[58:63])
328 A 128-bit shift is performed, taking the lower half into RS
329 and the upper half into RT. However there is a trick that
330 may be applied, which only requires `ROT64`:
334 v <- ROTL64((RA), 64-n)
336 RT <- (v[0:63] & mask) | ((RC) & ¬mask)
337 RS <- v[0:63] & ¬mask
340 The trick here is that the *entirety* of `RA` is rotated,
341 then parts of it are masked into the destinations.
342 RC, if also properly masked, can be ORed into RT, as
343 long as the bits of RC are in the right place.
344 The really interesting bit is that when Vectorised,
345 the upper bits (now in RS) *are* in the right bit-positions
346 to be ORed into the second `dsrd` operation. This allows
347 us to create a chain `sv.dsrd`.
349 For larger shift amounts beyond an element bitwidth standard register move
350 operations may be used, or, if the shift amount is static,
351 to reference an alternate starting point in
352 the registers containing the Vector elements
353 because SVP64 sits on top of a standard Scalar register file.
354 `sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
355 by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
359 The simplest implementation of big-int divide is the standard schoolbook
360 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
361 Algorithm D performs estimates which, if wrong, are compensated for
362 afterwards. Essentially there are three phases:
364 * Calculation of the quotient estimate. This uses a single
365 Scalar divide, which is covered separately in a later section
366 * Big Integer multiply and subtract.
367 * Carry-Correction with a big integer add, if the estimate from
368 phase 1 was wrong by one digit.
370 From Knuth's Algorithm D, implemented in
371 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD),
372 Phase 2 is expressed in c, as:
375 // Multiply and subtract.
377 for (i = 0; i < n; i++) {
378 p = qhat*vn[i]; // 64-bit product
379 t = un[i+j] - k - (p & 0xFFFFFFFFLL);
381 k = (p >> 32) - (t >> 32);
385 Where analysis of this algorithm, if a temporary vector is acceptable,
386 shows that it can be split into two in exactly the same way as Algorithm M,
387 this time using subtract instead of add.
391 // this is just sv.maddedu again
392 for (int i = 0; i <= n; i++) {
393 uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
394 carry = (uint32_t)(value >> 32); // upper half for next loop
395 product[i] = (uint32_t)value; // lower into vector
398 // this is simply sv.subfe where ca is XER.CA
399 for (int i = 0; i <= n; i++) {
400 uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
401 ca = value >> 32 != 0;
404 bool need_fixup = !ca; // for phase 3 correction
407 In essence then the primary focus of Vectorized Big-Int divide is in
408 fact big-integer multiply
410 Detection of the fixup (phase 3) is determined by the Carry (borrow)
411 bit at the end. Logically: if borrow was required then the qhat estimate
412 was too large and the correction is required, which is, again,
413 nothing more than a Vectorized big-integer add (one instruction).
414 However this is not the full story
416 **128/64-bit divisor**
418 As mentioned above, the first part of the Knuth Algorithm D involves
419 computing an estimate for the divisor. This involves using the three
420 most significant digits, performing a scalar divide, and consequently
421 requires a scalar division with *twice* the number of bits of the
422 size of individual digits (for example, a 64-bit array). In this
424 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD)
425 the digits are 32 bit and, special-casing the overflow, a 64/32 divide is sufficient (64-bit dividend, 32-bit divisor):
428 // Compute estimate qhat of q[j] from top 2 digits
429 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
430 if (un[j+n] >= vn[n-1]) {
431 // rhat can be bigger than 32-bit when the division overflows
433 rhat = dig2 - (uint64_t)UINT32_MAX * vn[n - 1];
435 qhat = dig2 / vn[n - 1]; // 64/32 divide
436 rhat = dig2 % vn[n - 1]; // 64/32 modulo
438 // use 3rd-from-top digit to obtain better accuracy
440 while (rhat < b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) {
442 rhat = rhat + vn[n - 1];
446 However when moving to 64-bit digits (desirable because the algorithm
447 is `O(N^2)`) this in turn means that the estimate has to be computed
448 from a *128* bit dividend and a 64-bit divisor. Such an operation
449 simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
450 comes close with `divdeu`, by placing one operand in the upper half
451 of a 128-bit dividend, the lower half is zero. Again Power ISA
452 has a Packed SIMD instruction `vdivuq` which is a 128/128
453 (quad) divide, not a 128/64, and its use would require considerable
454 effort to move registers to and from GPRs. Some investigation into
455 soft-implementations of 128/128 or 128/64 divide show it to be typically
456 implemented bit-wise, with all that implies.
458 The irony is, therefore, that attempting to
459 improve big-integer divide by moving to 64-bit digits in order to take
460 advantage of the efficiency of 64-bit scalar multiply when Vectorized
462 lock up CPU time performing a 128/64 scalar division. With the Vector
463 Multiply operations being critically dependent on that `qhat` estimate, and
464 because that scalar is as an input into each of the vector digit
465 multiples, as a Dependency Hazard it would cause *all* Parallel
466 SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
468 Whilst one solution is to reduce the digit width to 32-bit in order to
469 go back to 64/32 divide, this increases the completion time by a factor
470 of 4 due to the algorithm being `O(N^2)`.
472 **Reducing completion time of 128/64-bit Scalar division**
474 Scalar division is a known computer science problem because, as even the
475 Big-Int Divide shows, it requires looping around a multiply (or, if reduced
476 to 1-bit per loop, a simple compare, shift, and subtract). If the simplest
477 approach were deployed then the completion time for the 128/64 scalar
478 divide would be a whopping 128 cycles. To be workable an alternative
479 algorithm is required, and one of the fastest appears to be
480 Goldschmidt Division. Whilst typically deployed for Floating Point,
481 there is no reason why it should not be adapted to Fixed Point.
482 In this way a Scalar Integer divide can be performed in the same
483 time-order as Newton-Raphson, using two hardware multipliers
486 **Back to Vector carry-looping**
488 There is however another reason for having a 128/64 division
489 instruction, and it's effectively the reverse of `maddedu`.
490 Look closely at Algorithm D when the divisor is only a scalar
494 k = 0; // the case of a
495 for (j = m - 1; j >= 0; j--)
497 uint64_t dig2 = ((k << 32) | u[j]);
498 q[j] = dig2 / v[0]; // divisor here.
499 k = dig2 % v[0]; // modulo back into next loop
503 Here, just as with `maddedu` which can put the hi-half of the 128 bit product
504 back in as a form of 64-bit carry, a scalar divisor of a vector dividend
505 puts the modulo back in as the hi-half of a 128/64-bit divide.
507 RT0 = (( 0<<64) | RA0) / RB0
508 RC0 = (( 0<<64) | RA0) % RB0
512 RT1 = ((RC0<<64) | RA1) / RB1
513 RC1 = ((RC0<<64) | RA1) % RB1
517 RT2 = ((RC1<<64) | RA2) / RB2
518 RC2 = ((RC1<<64) | RA2) % RB2
520 By a nice coincidence this is exactly the same 128/64-bit operation
521 needed (once, rather than chained) for the `qhat` estimate if it may
522 produce both the quotient and the remainder.
523 The pseudocode cleanly covering both scenarios (leaving out
524 overflow for clarity) can be written as:
526 `divmod2du RT,RA,RB,RC`
528 dividend = (RC) || (RA)
529 divisor = EXTZ128(RB)
530 RT = UDIV(dividend, divisor)
531 RS = UREM(dividend, divisor)
533 Again, in an SVP64 context, using EXTRA mode bit 8 allows for
534 selecting whether `RS=RC` or
535 `RS=RT+MAXVL`. Similar flexibility in the scalar-vector settings
536 allows the instruction to perform full parallel vector div/mod,
537 or act in loop-back mode for big-int division by a scalar,
538 or for a single scalar 128/64 div/mod.
540 Again, just as with `sv.maddedu` and `sv.adde`, adventurous implementors
541 may perform massively-wide DIV/MOD by transparently merging (fusing)
542 the Vector element operations together, only inputting a single RC and
543 outputting the last RC. Where efficient algorithms such as Goldschmidt
544 are deployed internally this could dramatically reduce the cycle completion
545 time for massive Vector DIV/MOD. Thus, just as with the other operations
546 the apparent limitation of creating chains is overcome: SVP64 is,
547 by design, an "expression of intent" where the implementor is free to
548 achieve that intent in any way they see fit
549 as long as strict precise-aware Program Order is
550 preserved (even on the VL for-loops).
552 Just as with `divdeu` on which this instruction is based an overflow
553 detection is required. When the divisor is too small compared to
554 the dividend then the result may not fit into 64 bit. Knuth's
555 original algorithm detects overflow and manually places 0xffffffff
556 (all ones) into `qhat`. With there being so many operands already
557 in `divmod2du` a `cmpl` instruction can be used instead to detect
558 the overflow. This saves having to add an Rc=1 or OE=1 mode when
559 the available space in VA-Form EXT04 is extremely limited.
561 Looking closely at the loop however we can see that overflow
562 will not occur. The initial value k is zero: as long as a divide-by-zero
563 is not requested this always fulfils the condition `RC < RA`, and on
564 subsequent iterations the new k, being the modulo, is always less than the
565 divisor as well. Thus the condition (the loop invariant) `RC < RA`
566 is preserved, as long as RC starts at zero.
570 One of the worst things for any ISA is that an algorithm's completion
571 time is directly affected by different implementations having instructions
572 take longer or shorter times. Knuth's Big-Integer division is unfortunately
575 Assuming that the computation of qhat takes 128 cycles to complete on
576 a small power-efficient embedded design, this time would dominate
577 compared to the 64 bit multiplications. However if the element width
578 was reduced to 8, such that the computation of qhat only took 16 cycles,
579 the calculation of qhat would not dominate, but the number of
580 multiplications would rise: somewhere in between there would be an
581 elwidth and a Vector Length that would suit that particular embedded
584 By contrast a high performance microarchitecture may deploy Goldschmidt
585 or other efficient Scalar Division, which could complete 128/64 qhat
586 computation in say only 5 to 8 cycles, which would be tolerable.
587 Thus, for general-purpose software, it would be necessary to ship
588 multiple implementations of the same algorithm and dynamically
591 The very fact that programmers even have to consider multiple
592 implementations and compare their performance is an unavoidable nuisance.
593 SVP64 is supposed to be designed such that only one implementation of
594 any given algorithm is needed. In some ways it is reassuring that
595 some algorithms just don't fit. Slightly more reassuring is that
596 Goldschmidt Divide, which uses two multiplications that can be
597 performed in parallel, would be a much better fit with SVP64 (and
598 Vector Processing in general), the only downside being that it is
599 regarded as worthwhile for much larger integers.