sync_up: Updated my section
[libreriscv.git] / openpower / sv / biginteger / analysis.mdwn
1 [[!tag standards]]
2
3 # Analysis
4
5 **DRAFT SVP64**
6
7 * Revision 0.0: 21apr2022 <https://www.youtube.com/watch?v=8hrIG7-E77o>
8 * Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
9 * Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
10 * Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
11 * Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
12 * Revision 0.05: 27apr2022 add vector shift section (no new instructions)
13
14 **Introduction**
15
16 This page covers an analysis of big integer operations, to
17 work out optimal Scalar Instructions to propose be submitted to
18 the OpenPOWER ISA WG, that when combined with Draft SVP64 give
19 high performance compact Big Integer Vector Arithmetic. Leverage
20 of existing Scalar Power ISA instructions is also explained.
21
22 Use of smaller sub-operations is a given: worst-case in a Scalar
23 context, addition is O(N) whilst multiply and divide are O(N^2),
24 and their Vectorization would reduce those (for small N) to
25 O(1) and O(N). Knuth's big-integer scalar algorithms provide
26 useful real-world grounding into the types of operations needed,
27 making it easy to demonstrate how they would be Vectorized.
28
29 The basic principle behind Knuth's algorithms is to break the
30 problem down into a single scalar op against a Vector operand.
31 *This fits naturally with a Scalable Vector ISA such as SVP64*.
32 It only remains to exploit Carry (1-bit and 64-bit) in a Scalable Vector
33 context and the picture is complete.
34
35 Links
36
37 * <https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>
38 * <https://web.archive.org/web/20141021201141/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf>
39 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004700.html>
40 * <https://news.ycombinator.com/item?id=21151646>
41 * <https://twitter.com/lkcl/status/1517169267912986624>
42 * <https://www.youtube.com/watch?v=8hrIG7-E77o>
43 * <https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/>
44 * <https://bugs.libre-soc.org/show_bug.cgi?id=817>
45
46 # Vector Add and Subtract
47
48 Surprisingly, no new additional instructions are required to perform
49 a straightforward big-integer add or subtract. Vectorized `adde`
50 or `addex` is perfectly sufficient to produce arbitrary-length
51 big-integer add due to the rules set in SVP64 that all Vector Operations
52 are directly equivalent to the strict Program Order Execution of
53 their element-level operations. Assuming that the two bigints (or
54 a part thereof) have been loaded into sequentially-contiguous
55 registers, with the least-significant bits being in the lowest-numbered
56 register in each case:
57
58 R0,CA = A0+B0+CA adde r0,a0,b0
59 |
60 +----------+
61 |
62 R1,CA = A1+B1+CA adde r1,a1,b1
63 |
64 +----------+
65 |
66 R2,CA = A2+B2+CA adde r2,a2,b2
67
68 This pattern - sequential execution of individual instructions
69 with incrementing register numbers - is precisely the very definition
70 of how SVP64 works!
71 Thus, due to sequential execution of `adde` both consuming and producing
72 a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA,
73 `sv.adde` is in effect an alias for Big-Integer Vectorized add. As such,
74 implementors are entirely at liberty to recognise Horizontal-First Vector
75 adds and send the vector of registers to a much larger and wider back-end
76 ALU, and short-cut the intermediate storage of XER.CA on an element
77 level in back-end hardware that need only:
78
79 * read the first incoming XER.CA
80 * implement a large Vector-aware carry propagation algorithm
81 * store the very last XER.CA in the batch
82
83 The size and implementation of the underlying back-end SIMD ALU
84 is entirely at the discretion of the implementer, as is whether to
85 deploy the above strategy. The only hard requirement for
86 implementors of SVP64 is to comply with strict and precise Program Order
87 even at the Element level.
88
89 If there is pressure on the register file (or
90 multi-million-digit big integers)
91 then a partial-sum may be carried out with LD and ST in a standard
92 Cray-style Vector Loop:
93
94 aptr = A address
95 bptr = B address
96 rptr = Result address
97 li r0, 0 # used to help clear CA
98 addic r0, r0, 0 # CA to zero as well
99 setmvli 8 # set MAXVL to 8
100 loop:
101 setvl t0, n # n is the number of digits
102 mulli t1, t0, 8 # 8 bytes per digit/element
103 sv.ldu a0, aptr, t1 # update advances pointer
104 sv.ldu b0, bptr, t1 # likewise
105 sv.adde r0, a0, b0 # takes in CA, updates CA
106 sv.stu rptr, r0, t1 # pointer advances too
107 sub. n, n, t0 # should not alter CA
108 bnz loop # do more digits
109
110 This is not that different from a Scalar Big-Int add, it is
111 just that like all Cray-style Vectorization, a variable number
112 of elements are covered by one instruction. Of interest
113 to people unfamiliar with Cray-style Vectors: if VL is not
114 permitted to exceed 1 (because MAXVL is set to 1) then the above
115 actually becomes a Scalar Big-Int add algorithm.
116
117 # Vector Multiply
118
119 Long-multiply, assuming an O(N^2) algorithm, is performed by summing
120 NxN separate smaller multiplications together. Karatsuba's algorithm
121 reduces the number of small multiplies at the expense of increasing
122 the number of additions. Some algorithms follow the Vedic Multiply
123 pattern by grouping together all multiplies of the same magnitude/power
124 (same column) whilst others perform row-based multiplication: a single
125 digit of B multiplies the entirety of A, summed a row at a time.
126 A Row-based
127 algorithm is the basis of the analysis below (Knuth's Algorithm M).
128
129 Multiply is tricky: 64 bit operands actually produce a 128-bit result,
130 which clearly cannot fit into an orthogonal register file.
131 Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
132 instructions, whilst some (OpenRISC) have "Accumulators" from which
133 the results of the multiply must be explicitly extracted. High
134 performance RISC advocates
135 recommend "macro-op fusion" which is in effect where the second instruction
136 gains access to the cached copy of the HI half of the
137 multiply result, which had already been
138 computed by the first. This approach quickly complicates the internal
139 microarchitecture, especially at the decode phase.
140
141 Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
142 both HI and LO halves of the multiply to reach registers with a single
143 instruction. If however done as a
144 multiply-and-accumulate this becomes quite an expensive operation:
145 (3 64-Bit in, 2 64-bit registers out).
146
147 Long-multiplication may be performed a row at a time, starting
148 with B0:
149
150 C4 C3 C2 C1 C0
151 A0xB0
152 A1xB0
153 A2xB0
154 A3xB0
155 R4 R3 R2 R1 R0
156
157 * R0 contains C0 plus the LO half of A0 times B0
158 * R1 contains C1 plus the LO half of A1 times B0
159 plus the HI half of A0 times B0.
160 * R2 contains C2 plus the LO half of A2 times B0
161 plus the HI half of A1 times B0.
162
163 This would on the face of it be a 4-in operation:
164 the upper half of a previous multiply, two new operands
165 to multiply, and an additional accumulator (C). However if
166 C is left out (and added afterwards with a Vector-Add)
167 things become more manageable.
168
169 Demonstrating in c, a Row-based multiply using a temporary vector.
170 Adapted from a simple implementation
171 of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
172
173 ```
174 // this becomes the basis for sv.maddedu in RS=RC Mode,
175 // where k is RC. k takes the upper half of product
176 // and adds it in on the next iteration
177 k = 0;
178 for (i = 0; i < m; i++) {
179 unsigned product = u[i]*v[j] + k;
180 k = product>>16;
181 plo[i] = product; // & 0xffff
182 }
183 // this is simply sv.adde where k is XER.CA
184 k = 0;
185 for (i = 0; i < m; i++) {
186 t = plo[i] + w[i + j] + k;
187 w[i + j] = t; // (I.e., t & 0xFFFF).
188 k = t >> 16; // carry: should only be 1 bit
189 }
190 ```
191
192 We therefore propose an operation that is 3-in, 2-out,
193 that, noting that the connection between successive
194 mul-adds has the UPPER half of the previous operation
195 as its input, writes the UPPER half of the current
196 product into a second output register for exactly the
197 purpose of letting it be added onto the next BigInt digit.
198
199 product = RA*RB+RC
200 RT = lowerhalf(product)
201 RC = upperhalf(product)
202
203 Horizontal-First Mode therefore may be applied to just this one
204 instruction.
205 Successive sequential iterations effectively use RC as a kind of
206 64-bit carry, and
207 as noted by Intel in their notes on mulx,
208 `RA*RB+RC+RD` cannot overflow, so does not require
209 setting an additional CA flag. We first cover the chain of
210 `RA*RB+RC` as follows:
211
212 RT0, RC0 = RA0 * RB0 + 0
213 |
214 +----------------+
215 |
216 RT1, RC1 = RA1 * RB1 + RC0
217 |
218 +----------------+
219 |
220 RT2, RC2 = RA2 * RB2 + RC1
221
222 Following up to add each partially-computed row to what will become
223 the final result is achieved with a Vectorized big-int
224 `sv.adde`. Thus, the key inner loop of
225 Knuth's Algorithm M may be achieved in four instructions, two of
226 which are scalar initialisation:
227
228 li r16, 0 # zero accumulator
229 addic r16, r16, 0 # CA to zero as well
230 sv.maddedu *r0, *r8, r17, r16 # mul vector
231 sv.adde *r24, *r24, *r0 # big-add row to result
232
233 Normally, in a Scalar ISA, the use of a register as both a source
234 and destination like this would create costly Dependency Hazards, so
235 such an instruction would never be proposed. However: it turns out
236 that, just as with repeated chained application of `adde`, macro-op
237 fusion may be internally applied to a sequence of these strange multiply
238 operations. (*Such a trick works equally as well in a Scalar-only
239 Out-of-Order microarchitecture, although the conditions are harder to
240 detect*).
241
242 **Application of SVP64**
243
244 SVP64 has the means to mark registers as scalar or vector. However
245 the available space in the prefix is extremely limited (9 bits).
246 With effectively 5 operands (3 in, 2 out) some compromises are needed.
247 A little thought gives a useful workaround: two modes,
248 controlled by a single bit in `RM.EXTRA`, determine whether the 5th
249 register is set to RC or whether to RT+MAXVL. This then leaves only
250 4 registers to qualify as scalar/vector, which can use four
251 EXTRA2 designators and fits into the available 9-bit space.
252
253 RS=RT+MAXVL Mode:
254
255 product = RA*RB+RC
256 RT = lowerhalf(product)
257 RS=RT+MAXVL = upperhalf(product)
258
259 and RS=RC Mode:
260
261 product = RA*RB+RC
262 RT = lowerhalf(product)
263 RS=RC = upperhalf(product)
264
265 Now there is much more potential, including setting RC to a Scalar,
266 which would be useful as a 64 bit Carry. RC as a Vector would produce
267 a Vector of the HI halves of a Vector of multiplies. RS=RT+MAXVL Mode
268 would allow that same Vector of HI halves to not be an overwrite of RC.
269 Also it is possible to specify that any of RA, RB or RC are scalar or
270 vector. Overall it is extremely powerful.
271
272 # Vector Shift
273
274 The decision here was made to follow the same principle as
275 for multiply-and-accumulate. Whilst Intel has
276 [srd](https://www.felixcloutier.com/x86/shrd)
277 which is very similar (but only 2-in 2-out),
278 the decision was made to create a 3-in 2-out
279 instruction that effectively uses one of the inputs and one of
280 the outputs as a "sort of 64-bit carry".
281
282 Without the `dsrd` instruction, it is necessary to have three
283 instructions in an inner loop.
284 Keeping the shift amount within the range of the element (64 bit)
285 a Vector bit-shift may be synthesised from a pair of shift operations
286 and an OR, all of which are standard Scalar Power ISA instructions.
287 that when Vectorized are exactly what is needed, but that critically
288 requires Vertical-First Mode.
289
290 ```
291 void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
292 for (int i = 0; i < n - 1; i++)
293 r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
294 r[n - 1] = un[n - 1] >> s;
295 }
296 ```
297
298 With SVP64 being on top of the standard scalar regfile the offset by
299 one of the elements may be achieved simply by referencing the same
300 vector data offset by one. Given that all three instructions
301 (`srd`, `sld`, `or`) are an SVP64 type `RM-1P-2S1D` and are `EXTRA3`,
302 it is possible to reference the full 128 64-bit registers (r0-r127):
303
304 subfic t1, t0, 64 # compute 64-s (s in t0)
305 sv.srd *r8, *r24, t0 # shift each element of r24 vector up by s
306 sv.sld *r16, *r25, t1 # offset start of vector by one (r25)
307 sv.or *r8, *r8, *r16 # OR two parts together
308
309 Predication with zeroing may be utilised on sld to ensure that the
310 last element is zero, avoiding over-run.
311
312 The reason why three instructions are needed instead of one in the
313 case of big-add is because multiple bits chain through to the
314 next element, where for add it is a single bit (carry-in, carry-out),
315 and this is precisely what `adde` already does.
316 For multiply, divide and shift it is worthwhile to use
317 one scalar register effectively as a full 64-bit carry/chain.
318
319 The limitations of this approach therefore become pretty clear:
320 not only must Vertical-First Mode be used but also the predication
321 with zeroing trick. Worse than that, an entire temporary vector
322 is required which wastes register space.
323 A better way would be to create a single
324 scalar instruction that can do the long-shift in-place.
325
326 The basic principle of the 3-in 2-out `dsrd` is:
327
328 ```
329 # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
330 temp <- ROT128(RA || RC, RB[58:63])
331 RT <- temp[64:127]
332 RS <- temp[0:63]
333 ```
334
335 A 128-bit shift is performed, taking the lower half into RS
336 and the upper half into RT. However there is a trick that
337 may be applied, which only requires `ROT64`:
338
339 ```
340 n <- (RB)[58:63]
341 v <- ROTL64((RA), 64-n)
342 mask <- MASK(n, 63)
343 RT <- (v[0:63] & mask) | ((RC) & ¬mask)
344 RS <- v[0:63] & ¬mask
345 ```
346
347 The trick here is that the *entirety* of `RA` is rotated,
348 then parts of it are masked into the destinations.
349 RC, if also properly masked, can be ORed into RT, as
350 long as the bits of RC are in the right place.
351 The really interesting bit is that when Vectorised,
352 the upper bits (now in RS) *are* in the right bit-positions
353 to be ORed into the second `dsrd` operation. This allows
354 us to create a chain `sv.dsrd`, and a single instruction
355 replaces all four above:
356
357 sv.dsrd *r8, *r24, t1, t0
358
359 For larger shift amounts beyond an element bitwidth standard register move
360 operations may be used, or, if the shift amount is static,
361 to reference an alternate starting point in
362 the registers containing the Vector elements
363 because SVP64 sits on top of a standard Scalar register file.
364 `sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
365 by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
366
367 # Vector Divide
368
369 The simplest implementation of big-int divide is the standard schoolbook
370 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
371 Algorithm D performs estimates which, if wrong, are compensated for
372 afterwards. Essentially there are three phases:
373
374 * Calculation of the quotient estimate. This uses a single
375 Scalar divide, which is covered separately in a later section
376 * Big Integer multiply and subtract.
377 * Carry-Correction with a big integer add, if the estimate from
378 phase 1 was wrong by one digit.
379
380 From Knuth's Algorithm D, implemented in
381 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD),
382 Phase 2 is expressed in c, as:
383
384 ```
385 // Multiply and subtract.
386 k = 0;
387 for (i = 0; i < n; i++) {
388 p = qhat*vn[i]; // 64-bit product
389 t = un[i+j] - k - (p & 0xFFFFFFFFLL);
390 un[i+j] = t;
391 k = (p >> 32) - (t >> 32);
392 }
393 ```
394
395 Where analysis of this algorithm, if a temporary vector is acceptable,
396 shows that it can be split into two in exactly the same way as Algorithm M,
397 this time using subtract instead of add.
398
399 ```
400 uint32_t carry = 0;
401 // this is just sv.maddedu again
402 for (int i = 0; i <= n; i++) {
403 uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
404 carry = (uint32_t)(value >> 32); // upper half for next loop
405 product[i] = (uint32_t)value; // lower into vector
406 }
407 bool ca = true;
408 // this is simply sv.subfe where ca is XER.CA
409 for (int i = 0; i <= n; i++) {
410 uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
411 ca = value >> 32 != 0;
412 un_j[i] = value;
413 }
414 bool need_fixup = !ca; // for phase 3 correction
415 ```
416
417 In essence then the primary focus of Vectorized Big-Int divide is in
418 fact big-integer multiply
419
420 Detection of the fixup (phase 3) is determined by the Carry (borrow)
421 bit at the end. Logically: if borrow was required then the qhat estimate
422 was too large and the correction is required, which is, again,
423 nothing more than a Vectorized big-integer add (one instruction).
424 However this is not the full story
425
426 **128/64-bit divisor**
427
428 As mentioned above, the first part of the Knuth Algorithm D involves
429 computing an estimate for the divisor. This involves using the three
430 most significant digits, performing a scalar divide, and consequently
431 requires a scalar division with *twice* the number of bits of the
432 size of individual digits (for example, a 64-bit array). In this
433 example taken from
434 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD)
435 the digits are 32 bit and, special-casing the overflow, a 64/32 divide is sufficient (64-bit dividend, 32-bit divisor):
436
437 ```
438 // Compute estimate qhat of q[j] from top 2 digits
439 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
440 if (un[j+n] >= vn[n-1]) {
441 // rhat can be bigger than 32-bit when the division overflows
442 qhat = UINT32_MAX;
443 rhat = dig2 - (uint64_t)UINT32_MAX * vn[n - 1];
444 } else {
445 qhat = dig2 / vn[n - 1]; // 64/32 divide
446 rhat = dig2 % vn[n - 1]; // 64/32 modulo
447 }
448 // use 3rd-from-top digit to obtain better accuracy
449 b = 1UL<<32;
450 while (rhat < b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) {
451 qhat = qhat - 1;
452 rhat = rhat + vn[n - 1];
453 }
454 ```
455
456 However when moving to 64-bit digits (desirable because the algorithm
457 is `O(N^2)`) this in turn means that the estimate has to be computed
458 from a *128* bit dividend and a 64-bit divisor. Such an operation
459 simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
460 comes close with `divdeu`, by placing one operand in the upper half
461 of a 128-bit dividend, the lower half is zero. Again Power ISA
462 has a Packed SIMD instruction `vdivuq` which is a 128/128
463 (quad) divide, not a 128/64, and its use would require considerable
464 effort to move registers to and from GPRs. Some investigation into
465 soft-implementations of 128/128 or 128/64 divide show it to be typically
466 implemented bit-wise, with all that implies.
467
468 The irony is, therefore, that attempting to
469 improve big-integer divide by moving to 64-bit digits in order to take
470 advantage of the efficiency of 64-bit scalar multiply when Vectorized
471 would instead
472 lock up CPU time performing a 128/64 scalar division. With the Vector
473 Multiply operations being critically dependent on that `qhat` estimate, and
474 because that scalar is as an input into each of the vector digit
475 multiples, as a Dependency Hazard it would cause *all* Parallel
476 SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
477
478 Whilst one solution is to reduce the digit width to 32-bit in order to
479 go back to 64/32 divide, this increases the completion time by a factor
480 of 4 due to the algorithm being `O(N^2)`.
481
482 **Reducing completion time of 128/64-bit Scalar division**
483
484 Scalar division is a known computer science problem because, as even the
485 Big-Int Divide shows, it requires looping around a multiply (or, if reduced
486 to 1-bit per loop, a simple compare, shift, and subtract). If the simplest
487 approach were deployed then the completion time for the 128/64 scalar
488 divide would be a whopping 128 cycles. To be workable an alternative
489 algorithm is required, and one of the fastest appears to be
490 Goldschmidt Division. Whilst typically deployed for Floating Point,
491 there is no reason why it should not be adapted to Fixed Point.
492 In this way a Scalar Integer divide can be performed in the same
493 time-order as Newton-Raphson, using two hardware multipliers
494 and a subtract.
495
496 **Back to Vector carry-looping**
497
498 There is however another reason for having a 128/64 division
499 instruction, and it's effectively the reverse of `maddedu`.
500 Look closely at Algorithm D when the divisor is only a scalar
501 (`v[0]`):
502
503 ```
504 k = 0; // the case of a
505 for (j = m - 1; j >= 0; j--)
506 { // single-digit
507 uint64_t dig2 = ((k << 32) | u[j]);
508 q[j] = dig2 / v[0]; // divisor here.
509 k = dig2 % v[0]; // modulo back into next loop
510 }
511 ```
512
513 Here, just as with `maddedu` which can put the hi-half of the 128 bit product
514 back in as a form of 64-bit carry, a scalar divisor of a vector dividend
515 puts the modulo back in as the hi-half of a 128/64-bit divide.
516
517 RT0 = (( 0<<64) | RA0) / RB0
518 RC0 = (( 0<<64) | RA0) % RB0
519 |
520 +-------+
521 |
522 RT1 = ((RC0<<64) | RA1) / RB1
523 RC1 = ((RC0<<64) | RA1) % RB1
524 |
525 +-------+
526 |
527 RT2 = ((RC1<<64) | RA2) / RB2
528 RC2 = ((RC1<<64) | RA2) % RB2
529
530 By a nice coincidence this is exactly the same 128/64-bit operation
531 needed (once, rather than chained) for the `qhat` estimate if it may
532 produce both the quotient and the remainder.
533 The pseudocode cleanly covering both scenarios (leaving out
534 overflow for clarity) can be written as:
535
536 `divmod2du RT,RA,RB,RC`
537
538 dividend = (RC) || (RA)
539 divisor = EXTZ128(RB)
540 RT = UDIV(dividend, divisor)
541 RS = UREM(dividend, divisor)
542
543 Again, in an SVP64 context, using EXTRA mode bit 8 allows for
544 selecting whether `RS=RC` or
545 `RS=RT+MAXVL`. Similar flexibility in the scalar-vector settings
546 allows the instruction to perform full parallel vector div/mod,
547 or act in loop-back mode for big-int division by a scalar,
548 or for a single scalar 128/64 div/mod.
549
550 Again, just as with `sv.maddedu` and `sv.adde`, adventurous implementors
551 may perform massively-wide DIV/MOD by transparently merging (fusing)
552 the Vector element operations together, only inputting a single RC and
553 outputting the last RC. Where efficient algorithms such as Goldschmidt
554 are deployed internally this could dramatically reduce the cycle completion
555 time for massive Vector DIV/MOD. Thus, just as with the other operations
556 the apparent limitation of creating chains is overcome: SVP64 is,
557 by design, an "expression of intent" where the implementor is free to
558 achieve that intent in any way they see fit
559 as long as strict precise-aware Program Order is
560 preserved (even on the VL for-loops).
561
562 Just as with `divdeu` on which this instruction is based an overflow
563 detection is required. When the divisor is too small compared to
564 the dividend then the result may not fit into 64 bit. Knuth's
565 original algorithm detects overflow and manually places 0xffffffff
566 (all ones) into `qhat`. With there being so many operands already
567 in `divmod2du` a `cmpl` instruction can be used instead to detect
568 the overflow. This saves having to add an Rc=1 or OE=1 mode when
569 the available space in VA-Form EXT04 is extremely limited.
570
571 Looking closely at the loop however we can see that overflow
572 will not occur. The initial value k is zero: as long as a divide-by-zero
573 is not requested this always fulfils the condition `RC < RA`, and on
574 subsequent iterations the new k, being the modulo, is always less than the
575 divisor as well. Thus the condition (the loop invariant) `RC < RA`
576 is preserved, as long as RC starts at zero.
577
578 **Limitations**
579
580 One of the worst things for any ISA is that an algorithm's completion
581 time is directly affected by different implementations having instructions
582 take longer or shorter times. Knuth's Big-Integer division is unfortunately
583 one such algorithm.
584
585 Assuming that the computation of qhat takes 128 cycles to complete on
586 a small power-efficient embedded design, this time would dominate
587 compared to the 64 bit multiplications. However if the element width
588 was reduced to 8, such that the computation of qhat only took 16 cycles,
589 the calculation of qhat would not dominate, but the number of
590 multiplications would rise: somewhere in between there would be an
591 elwidth and a Vector Length that would suit that particular embedded
592 processor.
593
594 By contrast a high performance microarchitecture may deploy Goldschmidt
595 or other efficient Scalar Division, which could complete 128/64 qhat
596 computation in say only 5 to 8 cycles, which would be tolerable.
597 Thus, for general-purpose software, it would be necessary to ship
598 multiple implementations of the same algorithm and dynamically
599 select the best one.
600
601 The very fact that programmers even have to consider multiple
602 implementations and compare their performance is an unavoidable nuisance.
603 SVP64 is supposed to be designed such that only one implementation of
604 any given algorithm is needed. In some ways it is reassuring that
605 some algorithms just don't fit. Slightly more reassuring is that
606 Goldschmidt Divide, which uses two multiplications that can be
607 performed in parallel, would be a much better fit with SVP64 (and
608 Vector Processing in general), the only downside being that it is
609 regarded as worthwhile for much larger integers.
610
611 # Conclusion
612
613 The majority of ISAs do not have 3-in 2-out instructions for very good
614 reasons: they are horrendously expensive if implemented as operations
615 that can write to two registers every clock cycle. Implementations
616 that take two clock cycles to write to a single write port are frowned-upon,
617 and the Hazard Management complexity also goes up.
618
619 However when looked at from a Vectorised perspective and using "chaining"
620 (64-bit carry-out becomes the 64-bit output for the next operation) then
621 "Operand Forwarding" makes one read and one write effectively disappear,
622 reducing most of the operations to a manageable efficent 2-in 1-out.
623 In normal Scalar hardware being used to attempt Vectored bigint operations
624 the opportunity for "Forwarding" is quite hard to spot, but because of the
625 Vectorisation it is known *in advance* that a Vector of values is being used,
626 i.e. only *one* instruction issued not several. Therefore it is easy to
627 micro-code the Operand Forwarding.
628
629 In essence by taking things that extra step further the complexity of
630 Scalar ISAs disappears through the introduction of some uniform Vector-Looping
631 that, although complex in itself in hardware, at least only has to be done once
632 and becomes uniform-RISC applicable to *all* instructions. These comprehensive
633 powerful Scalar arithmetic instructions will significantly reduce the
634 complexity of big-integer operations.