d535a93968fdfed170764b31be333fbdc566475f
[libreriscv.git] / openpower / sv / biginteger / analysis.mdwn
1 [[!tag standards]]
2
3 # Analysis
4
5 **DRAFT SVP64**
6
7 * Revision 0.0: 21apr2022 <https://www.youtube.com/watch?v=8hrIG7-E77o>
8 * Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
9 * Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
10 * Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
11 * Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
12 * Revision 0.05: 27apr2022 add vector shift section (no new instructions)
13
14 **Introduction**
15
16 This page covers an analysis of big integer operations, to
17 work out optimal Scalar Instructions to propose be submitted to
18 the OpenPOWER ISA WG, that when combined with Draft SVP64 give
19 high performance compact Big Integer Vector Arithmetic. Leverage
20 of existing Scalar Power ISA instructions is also explained.
21
22 Use of smaller sub-operations is a given: worst-case in a Scalar
23 context, addition is O(N) whilst multiply and divide are O(N^2),
24 and their Vectorization would reduce those (for small N) to
25 O(1) and O(N). Knuth's big-integer scalar algorithms provide
26 useful real-world grounding into the types of operations needed,
27 making it easy to demonstrate how they would be Vectorized.
28
29 The basic principle behind Knuth's algorithms is to break the
30 problem down into a single scalar op against a Vector operand.
31 *This fits naturally with a Scalable Vector ISA such as SVP64*.
32 It only remains to exploit Carry (1-bit and 64-bit) in a Scalable Vector
33 context and the picture is complete.
34
35 Links
36
37 * <https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>
38 * <https://web.archive.org/web/20141021201141/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf>
39 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004700.html>
40 * <https://news.ycombinator.com/item?id=21151646>
41 * <https://twitter.com/lkcl/status/1517169267912986624>
42 * <https://www.youtube.com/watch?v=8hrIG7-E77o>
43 * <https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/>
44 * <https://bugs.libre-soc.org/show_bug.cgi?id=817>
45
46 # Vector Add and Subtract
47
48 Surprisingly, no new additional instructions are required to perform
49 a straightforward big-integer add or subtract. Vectorized `adde`
50 or `addex` is perfectly sufficient to produce arbitrary-length
51 big-integer add due to the rules set in SVP64 that all Vector Operations
52 are directly equivalent to the strict Program Order Execution of
53 their element-level operations. Assuming that the two bigints (or
54 a part thereof) have been loaded into sequentially-contiguous
55 registers, with the least-significant bits being in the lowest-numbered
56 register in each case:
57
58 R0,CA = A0+B0+CA adde r0,a0,b0
59 |
60 +----------+
61 |
62 R1,CA = A1+B1+CA adde r1,a1,b1
63 |
64 +----------+
65 |
66 R2,CA = A2+B2+CA adde r2,a2,b2
67
68 This pattern - sequential execution of individual instructions
69 with incrementing register numbers - is precisely the very definition
70 of how SVP64 works!
71 Thus, due to sequential execution of `adde` both consuming and producing
72 a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA,
73 `sv.adde` is in effect an alias for Big-Integer Vectorized add. As such,
74 implementors are entirely at liberty to recognise Horizontal-First Vector
75 adds and send the vector of registers to a much larger and wider back-end
76 ALU, and short-cut the intermediate storage of XER.CA on an element
77 level in back-end hardware that need only:
78
79 * read the first incoming XER.CA
80 * implement a large Vector-aware carry propagation algorithm
81 * store the very last XER.CA in the batch
82
83 The size and implementation of the underlying back-end SIMD ALU
84 is entirely at the discretion of the implementer, as is whether to
85 deploy the above strategy. The only hard requirement for
86 implementors of SVP64 is to comply with strict and precise Program Order
87 even at the Element level.
88
89 If there is pressure on the register file (or
90 multi-million-digit big integers)
91 then a partial-sum may be carried out with LD and ST in a standard
92 Cray-style Vector Loop:
93
94 aptr = A address
95 bptr = B address
96 rptr = Result address
97 li r0, 0 # used to help clear CA
98 addic r0, r0, 0 # CA to zero as well
99 setmvli 8 # set MAXVL to 8
100 loop:
101 setvl t0, n # n is the number of digits
102 mulli t1, t0, 8 # 8 bytes per digit/element
103 sv.ldu a0, aptr, t1 # update advances pointer
104 sv.ldu b0, bptr, t1 # likewise
105 sv.adde r0, a0, b0 # takes in CA, updates CA
106 sv.stu rptr, r0, t1 # pointer advances too
107 sub. n, n, t0 # should not alter CA
108 bnz loop # do more digits
109
110 This is not that different from a Scalar Big-Int add, it is
111 just that like all Cray-style Vectorization, a variable number
112 of elements are covered by one instruction. Of interest
113 to people unfamiliar with Cray-style Vectors: if VL is not
114 permitted to exceed 1 (because MAXVL is set to 1) then the above
115 actually becomes a Scalar Big-Int add algorithm.
116
117 # Vector Multiply
118
119 Long-multiply, assuming an O(N^2) algorithm, is performed by summing
120 NxN separate smaller multiplications together. Karatsuba's algorithm
121 reduces the number of small multiplies at the expense of increasing
122 the number of additions. Some algorithms follow the Vedic Multiply
123 pattern by grouping together all multiplies of the same magnitude/power
124 (same column) whilst others perform row-based multiplication: a single
125 digit of B multiplies the entirety of A, summed a row at a time.
126 A Row-based
127 algorithm is the basis of the analysis below (Knuth's Algorithm M).
128
129 Multiply is tricky: 64 bit operands actually produce a 128-bit result,
130 which clearly cannot fit into an orthogonal register file.
131 Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
132 instructions, whilst some (OpenRISC) have "Accumulators" from which
133 the results of the multiply must be explicitly extracted. High
134 performance RISC advocates
135 recommend "macro-op fusion" which is in effect where the second instruction
136 gains access to the cached copy of the HI half of the
137 multiply result, which had already been
138 computed by the first. This approach quickly complicates the internal
139 microarchitecture, especially at the decode phase.
140
141 Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
142 both HI and LO halves of the multiply to reach registers with a single
143 instruction. If however done as a
144 multiply-and-accumulate this becomes quite an expensive operation:
145 (3 64-Bit in, 2 64-bit registers out).
146
147 Long-multiplication may be performed a row at a time, starting
148 with B0:
149
150 C4 C3 C2 C1 C0
151 A0xB0
152 A1xB0
153 A2xB0
154 A3xB0
155 R4 R3 R2 R1 R0
156
157 * R0 contains C0 plus the LO half of A0 times B0
158 * R1 contains C1 plus the LO half of A1 times B0
159 plus the HI half of A0 times B0.
160 * R2 contains C2 plus the LO half of A2 times B0
161 plus the HI half of A1 times B0.
162
163 This would on the face of it be a 4-in operation:
164 the upper half of a previous multiply, two new operands
165 to multiply, and an additional accumulator (C). However if
166 C is left out (and added afterwards with a Vector-Add)
167 things become more manageable.
168
169 Demonstrating in c, a Row-based multiply using a temporary vector.
170 Adapted from a simple implementation
171 of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
172
173 ```
174 // this becomes the basis for sv.maddedu in RS=RC Mode,
175 // where k is RC. k takes the upper half of product
176 // and adds it in on the next iteration
177 k = 0;
178 for (i = 0; i < m; i++) {
179 unsigned product = u[i]*v[j] + k;
180 k = product>>16;
181 plo[i] = product; // & 0xffff
182 }
183 // this is simply sv.adde where k is XER.CA
184 k = 0;
185 for (i = 0; i < m; i++) {
186 t = plo[i] + w[i + j] + k;
187 w[i + j] = t; // (I.e., t & 0xFFFF).
188 k = t >> 16; // carry: should only be 1 bit
189 }
190 ```
191
192 We therefore propose an operation that is 3-in, 2-out,
193 that, noting that the connection between successive
194 mul-adds has the UPPER half of the previous operation
195 as its input, writes the UPPER half of the current
196 product into a second output register for exactly the
197 purpose of letting it be added onto the next BigInt digit.
198
199 product = RA*RB+RC
200 RT = lowerhalf(product)
201 RC = upperhalf(product)
202
203 Horizontal-First Mode therefore may be applied to just this one
204 instruction.
205 Successive sequential iterations effectively use RC as a kind of
206 64-bit carry, and
207 as noted by Intel in their notes on mulx,
208 `RA*RB+RC+RD` cannot overflow, so does not require
209 setting an additional CA flag. We first cover the chain of
210 `RA*RB+RC` as follows:
211
212 RT0, RC0 = RA0 * RB0 + 0
213 |
214 +----------------+
215 |
216 RT1, RC1 = RA1 * RB1 + RC0
217 |
218 +----------------+
219 |
220 RT2, RC2 = RA2 * RB2 + RC1
221
222 Following up to add each partially-computed row to what will become
223 the final result is achieved with a Vectorized big-int
224 `sv.adde`. Thus, the key inner loop of
225 Knuth's Algorithm M may be achieved in four instructions, two of
226 which are scalar initialisation:
227
228 li r16, 0 # zero accumulator
229 addic r16, r16, 0 # CA to zero as well
230 sv.maddedu *r0, *r8, r17, r16 # mul vector
231 sv.adde *r24, *r24, *r0 # big-add row to result
232
233 Normally, in a Scalar ISA, the use of a register as both a source
234 and destination like this would create costly Dependency Hazards, so
235 such an instruction would never be proposed. However: it turns out
236 that, just as with repeated chained application of `adde`, macro-op
237 fusion may be internally applied to a sequence of these strange multiply
238 operations. (*Such a trick works equally as well in a Scalar-only
239 Out-of-Order microarchitecture, although the conditions are harder to
240 detect*).
241
242 **Application of SVP64**
243
244 SVP64 has the means to mark registers as scalar or vector. However
245 the available space in the prefix is extremely limited (9 bits).
246 With effectively 5 operands (3 in, 2 out) some compromises are needed.
247 A little thought gives a useful workaround: two modes,
248 controlled by a single bit in `RM.EXTRA`, determine whether the 5th
249 register is set to RC or whether to RT+MAXVL. This then leaves only
250 4 registers to qualify as scalar/vector, which can use four
251 EXTRA2 designators and fits into the available 9-bit space.
252
253 RS=RT+MAXVL Mode:
254
255 product = RA*RB+RC
256 RT = lowerhalf(product)
257 RS=RT+MAXVL = upperhalf(product)
258
259 and RS=RC Mode:
260
261 product = RA*RB+RC
262 RT = lowerhalf(product)
263 RS=RC = upperhalf(product)
264
265 Now there is much more potential, including setting RC to a Scalar,
266 which would be useful as a 64 bit Carry. RC as a Vector would produce
267 a Vector of the HI halves of a Vector of multiplies. RS=RT+MAXVL Mode
268 would allow that same Vector of HI halves to not be an overwrite of RC.
269 Also it is possible to specify that any of RA, RB or RC are scalar or
270 vector. Overall it is extremely powerful.
271
272 # Vector Shift
273
274 The decision here was made to follow the same principle as
275 for multiply-and-accumulate. Whilst Intel has
276 [srd](https://www.felixcloutier.com/x86/shrd)
277 which is very similar (but only 2-in 2-out),
278 the decision was made to create a 3-in 2-out
279 instruction that effectively uses one of the inputs and one of
280 the outputs as a "sort of 64-bit carry".
281
282 Without the `dsrd` instruction, it is necessary to have three
283 instructions in an inner loop.
284 Keeping the shift amount within the range of the element (64 bit)
285 a Vector bit-shift may be synthesised from a pair of shift operations
286 and an OR, all of which are standard Scalar Power ISA instructions.
287 that when Vectorized are exactly what is needed, but that critically
288 requires Vertical-First Mode.
289
290 ```
291 void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
292 for (int i = 0; i < n - 1; i++)
293 r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
294 r[n - 1] = un[n - 1] >> s;
295 }
296 ```
297
298 With SVP64 being on top of the standard scalar regfile the offset by
299 one of the elements may be achieved simply by referencing the same
300 vector data offset by one. Given that all three instructions
301 (`srd`, `sld`, `or`) are an SVP64 type `RM-1P-2S1D` and are `EXTRA3`,
302 it is possible to reference the full 128 64-bit registers (r0-r127):
303
304 subfic t1, t0, 64 # compute 64-s (s in t0)
305 sv.srd *r8, *r24, t0 # shift each element of r24 vector up by s
306 sv.sld *r16, *r25, t1 # offset start of vector by one (r25)
307 sv.or *r8, *r8, *r16 # OR two parts together
308
309 Predication with zeroing may be utilised on sld to ensure that the
310 last element is zero, avoiding over-run.
311
312 The reason why three instructions are needed instead of one in the
313 case of big-add is because multiple bits chain through to the
314 next element, where for add it is a single bit (carry-in, carry-out),
315 and this is precisely what `adde` already does.
316 For multiply, divide and shift it is worthwhile to use
317 one scalar register effectively as a full 64-bit carry/chain.
318
319 The basic principle of the 3-in 2-out `dsrd` is:
320
321 ```
322 # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
323 temp <- ROT128(RA || RC, RB[58:63])
324 RT <- temp[64:127]
325 RS <- temp[0:63]
326 ```
327
328 A 128-bit shift is performed, taking the lower half into RS
329 and the upper half into RT. However there is a trick that
330 may be applied, which only requires `ROT64`:
331
332 ```
333 n <- (RB)[58:63]
334 v <- ROTL64((RA), 64-n)
335 mask <- MASK(n, 63)
336 RT <- (v[0:63] & mask) | ((RC) & ¬mask)
337 RS <- v[0:63] & ¬mask
338 ```
339
340 The trick here is that the *entirety* of `RA` is rotated,
341 then parts of it are masked into the destinations.
342 RC, if also properly masked, can be ORed into RT, as
343 long as the bits of RC are in the right place.
344 The really interesting bit is that when Vectorised,
345 the upper bits (now in RS) *are* in the right bit-positions
346 to be ORed into the second `dsrd` operation. This allows
347 us to create a chain `sv.dsrd`.
348
349 For larger shift amounts beyond an element bitwidth standard register move
350 operations may be used, or, if the shift amount is static,
351 to reference an alternate starting point in
352 the registers containing the Vector elements
353 because SVP64 sits on top of a standard Scalar register file.
354 `sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
355 by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
356
357 # Vector Divide
358
359 The simplest implementation of big-int divide is the standard schoolbook
360 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
361 Algorithm D performs estimates which, if wrong, are compensated for
362 afterwards. Essentially there are three phases:
363
364 * Calculation of the quotient estimate. This uses a single
365 Scalar divide, which is covered separately in a later section
366 * Big Integer multiply and subtract.
367 * Carry-Correction with a big integer add, if the estimate from
368 phase 1 was wrong by one digit.
369
370 From Knuth's Algorithm D, implemented in
371 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD),
372 Phase 2 is expressed in c, as:
373
374 ```
375 // Multiply and subtract.
376 k = 0;
377 for (i = 0; i < n; i++) {
378 p = qhat*vn[i]; // 64-bit product
379 t = un[i+j] - k - (p & 0xFFFFFFFFLL);
380 un[i+j] = t;
381 k = (p >> 32) - (t >> 32);
382 }
383 ```
384
385 Where analysis of this algorithm, if a temporary vector is acceptable,
386 shows that it can be split into two in exactly the same way as Algorithm M,
387 this time using subtract instead of add.
388
389 ```
390 uint32_t carry = 0;
391 // this is just sv.maddedu again
392 for (int i = 0; i <= n; i++) {
393 uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
394 carry = (uint32_t)(value >> 32); // upper half for next loop
395 product[i] = (uint32_t)value; // lower into vector
396 }
397 bool ca = true;
398 // this is simply sv.subfe where ca is XER.CA
399 for (int i = 0; i <= n; i++) {
400 uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
401 ca = value >> 32 != 0;
402 un_j[i] = value;
403 }
404 bool need_fixup = !ca; // for phase 3 correction
405 ```
406
407 In essence then the primary focus of Vectorized Big-Int divide is in
408 fact big-integer multiply
409
410 Detection of the fixup (phase 3) is determined by the Carry (borrow)
411 bit at the end. Logically: if borrow was required then the qhat estimate
412 was too large and the correction is required, which is, again,
413 nothing more than a Vectorized big-integer add (one instruction).
414 However this is not the full story
415
416 **128/64-bit divisor**
417
418 As mentioned above, the first part of the Knuth Algorithm D involves
419 computing an estimate for the divisor. This involves using the three
420 most significant digits, performing a scalar divide, and consequently
421 requires a scalar division with *twice* the number of bits of the
422 size of individual digits (for example, a 64-bit array). In this
423 example taken from
424 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD)
425 the digits are 32 bit and, special-casing the overflow, a 64/32 divide is sufficient (64-bit dividend, 32-bit divisor):
426
427 ```
428 // Compute estimate qhat of q[j] from top 2 digits
429 uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
430 if (un[j+n] >= vn[n-1]) {
431 // rhat can be bigger than 32-bit when the division overflows
432 qhat = UINT32_MAX;
433 rhat = dig2 - (uint64_t)UINT32_MAX * vn[n - 1];
434 } else {
435 qhat = dig2 / vn[n - 1]; // 64/32 divide
436 rhat = dig2 % vn[n - 1]; // 64/32 modulo
437 }
438 // use 3rd-from-top digit to obtain better accuracy
439 b = 1UL<<32;
440 while (rhat < b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) {
441 qhat = qhat - 1;
442 rhat = rhat + vn[n - 1];
443 }
444 ```
445
446 However when moving to 64-bit digits (desirable because the algorithm
447 is `O(N^2)`) this in turn means that the estimate has to be computed
448 from a *128* bit dividend and a 64-bit divisor. Such an operation
449 simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
450 comes close with `divdeu`, by placing one operand in the upper half
451 of a 128-bit dividend, the lower half is zero. Again Power ISA
452 has a Packed SIMD instruction `vdivuq` which is a 128/128
453 (quad) divide, not a 128/64, and its use would require considerable
454 effort to move registers to and from GPRs. Some investigation into
455 soft-implementations of 128/128 or 128/64 divide show it to be typically
456 implemented bit-wise, with all that implies.
457
458 The irony is, therefore, that attempting to
459 improve big-integer divide by moving to 64-bit digits in order to take
460 advantage of the efficiency of 64-bit scalar multiply when Vectorized
461 would instead
462 lock up CPU time performing a 128/64 scalar division. With the Vector
463 Multiply operations being critically dependent on that `qhat` estimate, and
464 because that scalar is as an input into each of the vector digit
465 multiples, as a Dependency Hazard it would cause *all* Parallel
466 SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
467
468 Whilst one solution is to reduce the digit width to 32-bit in order to
469 go back to 64/32 divide, this increases the completion time by a factor
470 of 4 due to the algorithm being `O(N^2)`.
471
472 **Reducing completion time of 128/64-bit Scalar division**
473
474 Scalar division is a known computer science problem because, as even the
475 Big-Int Divide shows, it requires looping around a multiply (or, if reduced
476 to 1-bit per loop, a simple compare, shift, and subtract). If the simplest
477 approach were deployed then the completion time for the 128/64 scalar
478 divide would be a whopping 128 cycles. To be workable an alternative
479 algorithm is required, and one of the fastest appears to be
480 Goldschmidt Division. Whilst typically deployed for Floating Point,
481 there is no reason why it should not be adapted to Fixed Point.
482 In this way a Scalar Integer divide can be performed in the same
483 time-order as Newton-Raphson, using two hardware multipliers
484 and a subtract.
485
486 **Back to Vector carry-looping**
487
488 There is however another reason for having a 128/64 division
489 instruction, and it's effectively the reverse of `maddedu`.
490 Look closely at Algorithm D when the divisor is only a scalar
491 (`v[0]`):
492
493 ```
494 k = 0; // the case of a
495 for (j = m - 1; j >= 0; j--)
496 { // single-digit
497 uint64_t dig2 = ((k << 32) | u[j]);
498 q[j] = dig2 / v[0]; // divisor here.
499 k = dig2 % v[0]; // modulo back into next loop
500 }
501 ```
502
503 Here, just as with `maddedu` which can put the hi-half of the 128 bit product
504 back in as a form of 64-bit carry, a scalar divisor of a vector dividend
505 puts the modulo back in as the hi-half of a 128/64-bit divide.
506
507 RT0 = (( 0<<64) | RA0) / RB0
508 RC0 = (( 0<<64) | RA0) % RB0
509 |
510 +-------+
511 |
512 RT1 = ((RC0<<64) | RA1) / RB1
513 RC1 = ((RC0<<64) | RA1) % RB1
514 |
515 +-------+
516 |
517 RT2 = ((RC1<<64) | RA2) / RB2
518 RC2 = ((RC1<<64) | RA2) % RB2
519
520 By a nice coincidence this is exactly the same 128/64-bit operation
521 needed (once, rather than chained) for the `qhat` estimate if it may
522 produce both the quotient and the remainder.
523 The pseudocode cleanly covering both scenarios (leaving out
524 overflow for clarity) can be written as:
525
526 `divmod2du RT,RA,RB,RC`
527
528 dividend = (RC) || (RA)
529 divisor = EXTZ128(RB)
530 RT = UDIV(dividend, divisor)
531 RS = UREM(dividend, divisor)
532
533 Again, in an SVP64 context, using EXTRA mode bit 8 allows for
534 selecting whether `RS=RC` or
535 `RS=RT+MAXVL`. Similar flexibility in the scalar-vector settings
536 allows the instruction to perform full parallel vector div/mod,
537 or act in loop-back mode for big-int division by a scalar,
538 or for a single scalar 128/64 div/mod.
539
540 Again, just as with `sv.maddedu` and `sv.adde`, adventurous implementors
541 may perform massively-wide DIV/MOD by transparently merging (fusing)
542 the Vector element operations together, only inputting a single RC and
543 outputting the last RC. Where efficient algorithms such as Goldschmidt
544 are deployed internally this could dramatically reduce the cycle completion
545 time for massive Vector DIV/MOD. Thus, just as with the other operations
546 the apparent limitation of creating chains is overcome: SVP64 is,
547 by design, an "expression of intent" where the implementor is free to
548 achieve that intent in any way they see fit
549 as long as strict precise-aware Program Order is
550 preserved (even on the VL for-loops).
551
552 Just as with `divdeu` on which this instruction is based an overflow
553 detection is required. When the divisor is too small compared to
554 the dividend then the result may not fit into 64 bit. Knuth's
555 original algorithm detects overflow and manually places 0xffffffff
556 (all ones) into `qhat`. With there being so many operands already
557 in `divmod2du` a `cmpl` instruction can be used instead to detect
558 the overflow. This saves having to add an Rc=1 or OE=1 mode when
559 the available space in VA-Form EXT04 is extremely limited.
560
561 Looking closely at the loop however we can see that overflow
562 will not occur. The initial value k is zero: as long as a divide-by-zero
563 is not requested this always fulfils the condition `RC < RA`, and on
564 subsequent iterations the new k, being the modulo, is always less than the
565 divisor as well. Thus the condition (the loop invariant) `RC < RA`
566 is preserved, as long as RC starts at zero.
567
568 **Limitations**
569
570 One of the worst things for any ISA is that an algorithm's completion
571 time is directly affected by different implementations having instructions
572 take longer or shorter times. Knuth's Big-Integer division is unfortunately
573 one such algorithm.
574
575 Assuming that the computation of qhat takes 128 cycles to complete on
576 a small power-efficient embedded design, this time would dominate
577 compared to the 64 bit multiplications. However if the element width
578 was reduced to 8, such that the computation of qhat only took 16 cycles,
579 the calculation of qhat would not dominate, but the number of
580 multiplications would rise: somewhere in between there would be an
581 elwidth and a Vector Length that would suit that particular embedded
582 processor.
583
584 By contrast a high performance microarchitecture may deploy Goldschmidt
585 or other efficient Scalar Division, which could complete 128/64 qhat
586 computation in say only 5 to 8 cycles, which would be tolerable.
587 Thus, for general-purpose software, it would be necessary to ship
588 multiple implementations of the same algorithm and dynamically
589 select the best one.
590
591 The very fact that programmers even have to consider multiple
592 implementations and compare their performance is an unavoidable nuisance.
593 SVP64 is supposed to be designed such that only one implementation of
594 any given algorithm is needed. In some ways it is reassuring that
595 some algorithms just don't fit. Slightly more reassuring is that
596 Goldschmidt Divide, which uses two multiplications that can be
597 performed in parallel, would be a much better fit with SVP64 (and
598 Vector Processing in general), the only downside being that it is
599 regarded as worthwhile for much larger integers.
600
601 # Conclusion
602
603 TODO