openpower/sv/biginteger/analysis.mdwn

   1 [[!tag standards]]
   2
   3 # Analysis
   4
   5 **DRAFT SVP64**
   6
   7 * Revision 0.0: 21apr2022 <https://www.youtube.com/watch?v=8hrIG7-E77o>
   8 * Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
   9 * Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
  10 * Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
  11 * Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
  12 * Revision 0.05: 27apr2022 add vector shift section (no new instructions)
  13
  14 **Introduction**
  15
  16 This page covers an analysis of big integer operations, to
  17 work out optimal Scalar Instructions to propose be submitted to
  18 the OpenPOWER ISA WG, that when combined with Draft SVP64 give
  19 high performance compact Big Integer Vector Arithmetic. Leverage
  20 of existing Scalar Power ISA instructions is also explained.
  21
  22 Use of smaller sub-operations is a given: worst-case in a Scalar
  23 context, addition is O(N) whilst multiply and divide are O(N^2),
  24 and their Vectorization would reduce those (for small N) to
  25 O(1) and O(N).  Knuth's big-integer scalar algorithms provide
  26 useful real-world grounding into the types of operations needed,
  27 making it easy to demonstrate how they would be Vectorized.
  28
  29 The basic principle behind Knuth's algorithms is to break the
  30 problem down into a single scalar op against a Vector operand.
  31 *This fits naturally with a Scalable Vector ISA such as SVP64*.
  32 It only remains to exploit Carry (1-bit and 64-bit) in a Scalable Vector
  33 context and the picture is complete.
  34
  35 Links
  36
  37 * <https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>
  38 * <https://web.archive.org/web/20141021201141/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf>
  39 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004700.html>
  40 * <https://news.ycombinator.com/item?id=21151646>
  41 * <https://twitter.com/lkcl/status/1517169267912986624>
  42 * <https://www.youtube.com/watch?v=8hrIG7-E77o>
  43 * <https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/>
  44 * <https://bugs.libre-soc.org/show_bug.cgi?id=817>
  45
  46 # Vector Add and Subtract
  47
  48 Surprisingly, no new additional instructions are required to perform
  49 a straightforward big-integer add or subtract.  Vectorized `adde`
  50 or `addex` is perfectly sufficient to produce arbitrary-length
  51 big-integer add due to the rules set in SVP64 that all Vector Operations
  52 are directly equivalent to the strict Program Order Execution of
  53 their element-level operations.  Assuming that the two bigints (or
  54 a part thereof) have been loaded into sequentially-contiguous
  55 registers, with the least-significant bits being in the lowest-numbered
  56 register in each case:
  57
  58     R0,CA = A0+B0+CA  adde r0,a0,b0
  59         |
  60         +----------+
  61                    |
  62     R1,CA = A1+B1+CA  adde r1,a1,b1
  63         |
  64         +----------+
  65                    |
  66     R2,CA = A2+B2+CA  adde r2,a2,b2
  67
  68 This pattern - sequential execution of individual instructions
  69 with incrementing register numbers - is precisely the very definition
  70 of how SVP64 works!
  71 Thus, due to sequential execution of `adde` both consuming and producing
  72 a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA,
  73 `sv.adde` is in effect an alias for Big-Integer Vectorized add.  As such,
  74 implementors are entirely at liberty to recognise Horizontal-First Vector
  75 adds and send the vector of registers to a much larger and wider back-end
  76 ALU, and short-cut the intermediate storage of XER.CA on an element
  77 level in back-end hardware that need only:
  78
  79 * read the first incoming XER.CA
  80 * implement a large Vector-aware carry propagation algorithm
  81 * store the very last XER.CA in the batch
  82
  83 The size and implementation of the underlying back-end SIMD ALU
  84 is entirely at the discretion of the implementer, as is whether to
  85 deploy the above strategy. The only hard requirement for
  86 implementors of SVP64 is to comply with strict and precise Program Order
  87 even at the Element level.
  88
  89 If there is pressure on the register file (or
  90 multi-million-digit big integers)
  91 then a partial-sum may be carried out with LD and ST in a standard
  92 Cray-style Vector Loop:
  93
  94       aptr = A address
  95       bptr = B address
  96       rptr = Result address
  97       li r0, 0        # used to help clear CA
  98       addic r0, r0, 0 # CA to zero as well
  99       setmvli 8       # set MAXVL to 8
 100     loop:
 101       setvl t0, n         # n is the number of digits
 102       mulli t1, t0, 8     # 8 bytes per digit/element
 103       sv.ldu a0, aptr, t1 # update advances pointer
 104       sv.ldu b0, bptr, t1 # likewise
 105       sv.adde r0, a0, b0  # takes in CA, updates CA
 106       sv.stu rptr, r0, t1 # pointer advances too
 107       sub. n, n, t0       # should not alter CA
 108       bnz loop            # do more digits
 109
 110 This is not that different from a Scalar Big-Int add, it is
 111 just that like all Cray-style Vectorization, a variable number
 112 of elements are covered by one instruction.  Of interest
 113 to people unfamiliar with Cray-style Vectors: if VL is not
 114 permitted to exceed 1 (because MAXVL is set to 1) then the above
 115 actually becomes a Scalar Big-Int add algorithm.
 116
 117 # Vector Multiply
 118
 119 Long-multiply, assuming an O(N^2) algorithm, is performed by summing
 120 NxN separate smaller multiplications together.  Karatsuba's algorithm
 121 reduces the number of small multiplies at the expense of increasing
 122 the number of additions.  Some algorithms follow the Vedic Multiply
 123 pattern by grouping together all multiplies of the same magnitude/power
 124 (same column) whilst others perform row-based multiplication: a single
 125 digit of B multiplies the entirety of A, summed a row at a time.
 126 A Row-based
 127 algorithm is the basis of the analysis below (Knuth's Algorithm M).
 128
 129 Multiply is tricky: 64 bit operands actually produce a 128-bit result,
 130 which clearly cannot fit into an orthogonal register file.
 131 Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
 132 instructions, whilst some (OpenRISC) have "Accumulators" from which
 133 the results of the multiply must be explicitly extracted. High
 134 performance RISC advocates
 135 recommend "macro-op fusion" which is in effect where the second instruction
 136 gains access to the cached copy of the HI half of the
 137 multiply result, which had already been
 138 computed by the first. This approach quickly complicates the internal
 139 microarchitecture, especially at the decode phase.
 140
 141 Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
 142 both HI and LO halves of the multiply to reach registers with a single
 143 instruction.  If however done as a
 144 multiply-and-accumulate this becomes quite an expensive operation:
 145 (3 64-Bit in, 2 64-bit registers out).
 146
 147 Long-multiplication may be performed a row at a time, starting
 148 with B0:
 149
 150     C4 C3 C2 C1 C0
 151              A0xB0
 152           A1xB0
 153        A2xB0
 154     A3xB0
 155     R4 R3 R2 R1 R0
 156
 157 * R0 contains C0 plus the LO half of A0 times B0
 158 * R1 contains C1 plus the LO half of A1 times B0
 159   plus the HI half of A0 times B0.
 160 * R2 contains C2 plus the LO half of A2 times B0
 161   plus the HI half of A1 times B0.
 162
 163 This would on the face of it be a 4-in operation:
 164 the upper half of a previous multiply, two new operands
 165 to multiply, and an additional accumulator (C). However if
 166 C is left out (and added afterwards with a Vector-Add)
 167 things become more manageable.
 168
 169 Demonstrating in c, a Row-based multiply using a temporary vector.
 170 Adapted from a simple implementation
 171 of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
 172
 173 ```
 174       // this becomes the basis for sv.maddedu in RS=RC Mode,
 175       // where k is RC. k takes the upper half of product
 176       // and adds it in on the next iteration
 177       k = 0;
 178       for (i = 0; i < m; i++) {
 179          unsigned product = u[i]*v[j] + k;
 180          k = product>>16;
 181          plo[i] = product; // & 0xffff
 182       }
 183       // this is simply sv.adde where k is XER.CA
 184       k = 0;
 185       for (i = 0; i < m; i++) {
 186          t = plo[i] + w[i + j] + k;
 187          w[i + j] = t;          // (I.e., t & 0xFFFF).
 188          k = t >> 16; // carry: should only be 1 bit
 189       }
 190 ```
 191
 192 We therefore propose an operation that is 3-in, 2-out,
 193 that, noting that the connection between successive
 194 mul-adds has the UPPER half of the previous operation
 195 as its input, writes the UPPER half of the current
 196 product into a second output register for exactly the
 197 purpose of letting it be added onto the next BigInt digit.
 198
 199     product = RA*RB+RC
 200     RT = lowerhalf(product)
 201     RC = upperhalf(product)
 202
 203 Horizontal-First Mode therefore may be applied to just this one
 204 instruction.
 205 Successive sequential iterations effectively use RC as a kind of
 206 64-bit carry, and
 207 as noted by Intel in their notes on mulx,
 208 `RA*RB+RC+RD` cannot overflow, so does not require
 209 setting an additional CA flag. We first cover the chain of
 210 `RA*RB+RC` as follows:
 211
 212     RT0, RC0 = RA0 * RB0 + 0
 213           |
 214           +----------------+
 215                            |
 216     RT1, RC1 = RA1 * RB1 + RC0
 217           |
 218           +----------------+
 219                            |
 220     RT2, RC2 = RA2 * RB2 + RC1
 221
 222 Following up to add each partially-computed row to what will become
 223 the final result is achieved with a Vectorized big-int
 224 `sv.adde`. Thus, the key inner loop of
 225 Knuth's Algorithm M may be achieved in four instructions, two of
 226 which are scalar initialisation:
 227
 228     li r16, 0                     # zero accumulator
 229     addic r16, r16, 0             # CA to zero as well
 230     sv.maddedu *r0, *r8, r17, r16 # mul vector
 231     sv.adde *r24, *r24, *r0   # big-add row to result
 232
 233 Normally, in a Scalar ISA, the use of a register as both a source
 234 and destination like this would create costly Dependency Hazards, so
 235 such an instruction would never be proposed.  However: it turns out
 236 that, just as with repeated chained application of `adde`, macro-op
 237 fusion may be internally applied to a sequence of these strange multiply
 238 operations. (*Such a trick works equally as well in a Scalar-only
 239 Out-of-Order microarchitecture, although the conditions are harder to
 240 detect*).
 241
 242 **Application of SVP64**
 243
 244 SVP64 has the means to mark registers as scalar or vector. However
 245 the available space in the prefix is extremely limited (9 bits).
 246 With effectively 5 operands (3 in, 2 out) some compromises are needed.
 247 A little thought gives a useful workaround: two modes,
 248 controlled by a single bit in `RM.EXTRA`, determine whether the 5th
 249 register is set to RC or whether to RT+MAXVL. This then leaves only
 250 4 registers to qualify as scalar/vector, which can use four
 251 EXTRA2 designators and fits into the available 9-bit space.
 252
 253 RS=RT+MAXVL Mode:
 254
 255     product = RA*RB+RC
 256     RT = lowerhalf(product)
 257     RS=RT+MAXVL = upperhalf(product)
 258
 259 and RS=RC Mode:
 260
 261     product = RA*RB+RC
 262     RT = lowerhalf(product)
 263     RS=RC = upperhalf(product)
 264
 265 Now there is much more potential, including setting RC to a Scalar,
 266 which would be useful as a 64 bit Carry. RC as a Vector would produce
 267 a Vector of the HI halves of a Vector of multiplies.  RS=RT+MAXVL Mode
 268 would allow that same Vector of HI halves to not be an overwrite of RC.
 269 Also it is possible to specify that any of RA, RB or RC are scalar or
 270 vector. Overall it is extremely powerful.
 271
 272 # Vector Shift
 273
 274 The decision here was made to follow the same principle as
 275 for multiply-and-accumulate. Whilst Intel has
 276 [srd](https://www.felixcloutier.com/x86/shrd)
 277 which is very similar (but only 2-in 2-out),
 278 the decision was made to create a 3-in 2-out
 279 instruction that effectively uses one of the inputs and one of
 280 the outputs as a "sort of 64-bit carry".
 281
 282 Without the `dsrd` instruction, it is necessary to have three
 283 instructions in an inner loop.
 284 Keeping the shift amount within the range of the element (64 bit)
 285 a Vector bit-shift may be synthesised from a pair of shift operations
 286 and an OR, all of which are standard Scalar Power ISA instructions.
 287 that when Vectorized are exactly what is needed, but that critically
 288 requires Vertical-First Mode.
 289
 290 ```
 291 void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
 292     for (int i = 0; i < n - 1; i++)
 293         r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
 294     r[n - 1] = un[n - 1] >> s;
 295 }
 296 ```
 297
 298 With SVP64 being on top of the standard scalar regfile the offset by
 299 one of the elements may be achieved simply by referencing the same
 300 vector data offset by one.  Given that all three instructions
 301 (`srd`, `sld`, `or`) are an SVP64 type `RM-1P-2S1D` and are `EXTRA3`,
 302 it is possible to reference the full 128 64-bit registers (r0-r127):
 303
 304     subfic t1, t0, 64     # compute 64-s (s in t0)
 305     sv.srd *r8, *r24, t0  # shift each element of r24 vector up by s
 306     sv.sld *r16, *r25, t1 # offset start of vector by one (r25)
 307     sv.or  *r8, *r8, *r16 # OR two parts together
 308
 309 Predication with zeroing may be utilised on sld to ensure that the
 310 last element is zero, avoiding over-run.
 311
 312 The reason why three instructions are needed instead of one in the
 313 case of big-add is because multiple bits chain through to the
 314 next element, where for add it is a single bit (carry-in, carry-out),
 315 and this is precisely what `adde` already does.
 316 For multiply, divide and shift it is worthwhile to use
 317 one scalar register effectively as a full 64-bit carry/chain.
 318
 319 The basic principle of the 3-in 2-out `dsrd` is:
 320
 321 ```
 322     # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
 323     temp <- ROT128(RA || RC, RB[58:63])
 324     RT <- temp[64:127]
 325     RS <- temp[0:63]
 326 ```
 327
 328 A 128-bit shift is performed, taking the lower half into RS
 329 and the upper half into RT.  However there is a trick that
 330 may be applied, which only requires `ROT64`:
 331
 332 ```
 333     n <- (RB)[58:63]
 334     v <- ROTL64((RA), 64-n)
 335     mask <- MASK(n, 63)
 336     RT <- (v[0:63] & mask) | ((RC) & ¬mask)
 337     RS <- v[0:63] & ¬mask
 338 ```
 339
 340 The trick here is that the *entirety* of `RA` is rotated,
 341 then parts of it are masked into the destinations.
 342 RC, if also properly masked, can be ORed into RT, as
 343 long as the bits of RC are in the right place.
 344 The really interesting bit is that when Vectorised,
 345 the upper bits (now in RS) *are* in the right bit-positions
 346 to be ORed into the second `dsrd` operation.  This allows
 347 us to create a chain `sv.dsrd`.
 348
 349 For larger shift amounts beyond an element bitwidth standard register move
 350 operations may be used, or, if the shift amount is static,
 351 to reference an alternate starting point in
 352 the registers containing the Vector elements
 353 because SVP64 sits on top of a standard Scalar register file.
 354 `sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
 355 by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
 356
 357 # Vector Divide
 358
 359 The simplest implementation of big-int divide is the standard schoolbook
 360 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
 361 Algorithm D performs estimates which, if wrong, are compensated for
 362 afterwards.  Essentially there are three phases:
 363
 364 * Calculation of the quotient estimate. This uses a single
 365   Scalar divide, which is covered separately in a later section
 366 * Big Integer multiply and subtract.
 367 * Carry-Correction with a big integer add, if the estimate from
 368   phase 1 was wrong by one digit.
 369
 370 From Knuth's Algorithm D, implemented in
 371 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD),
 372 Phase 2 is expressed in c, as:
 373
 374 ```
 375       // Multiply and subtract.
 376       k = 0;
 377       for (i = 0; i < n; i++) {
 378          p = qhat*vn[i]; // 64-bit product
 379          t = un[i+j] - k - (p & 0xFFFFFFFFLL);
 380          un[i+j] = t;
 381          k = (p >> 32) - (t >> 32);
 382       }
 383 ```
 384
 385 Where analysis of this algorithm, if a temporary vector is acceptable,
 386 shows that it can be split into two in exactly the same way as Algorithm M,
 387 this time using subtract instead of add.
 388
 389 ```
 390         uint32_t carry = 0;
 391         // this is just sv.maddedu again
 392         for (int i = 0; i <= n; i++) {
 393             uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
 394             carry = (uint32_t)(value >> 32); // upper half for next loop
 395             product[i] = (uint32_t)value;    // lower into vector
 396         }
 397         bool ca = true;
 398         // this is simply sv.subfe where ca is XER.CA
 399         for (int i = 0; i <= n; i++) {
 400             uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
 401             ca = value >> 32 != 0;
 402             un_j[i] = value;
 403         }
 404         bool need_fixup = !ca; // for phase 3 correction
 405 ```
 406
 407 In essence then the primary focus of Vectorized Big-Int divide is in
 408 fact big-integer multiply
 409
 410 Detection of the fixup (phase 3) is determined by the Carry (borrow)
 411 bit at the end. Logically: if borrow was required then the qhat estimate
 412 was too large and the correction is required, which is, again,
 413 nothing more than a Vectorized big-integer add (one instruction).
 414 However this is not the full story
 415
 416 **128/64-bit divisor**
 417
 418 As mentioned above, the first part of the Knuth Algorithm D involves
 419 computing an estimate for the divisor. This involves using the three
 420 most significant digits, performing a scalar divide, and consequently
 421 requires a scalar division with *twice* the number of bits of the
 422 size of individual digits (for example, a 64-bit array).  In this
 423 example taken from
 424 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD)
 425 the digits are 32 bit and, special-casing the overflow, a 64/32 divide is sufficient (64-bit dividend, 32-bit divisor):
 426
 427 ```
 428         // Compute estimate qhat of q[j] from top 2 digits
 429         uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
 430         if (un[j+n] >= vn[n-1]) {
 431             // rhat can be bigger than 32-bit when the division overflows
 432             qhat = UINT32_MAX;
 433             rhat = dig2 - (uint64_t)UINT32_MAX * vn[n - 1];
 434         } else {
 435             qhat = dig2 / vn[n - 1]; // 64/32 divide
 436             rhat = dig2 % vn[n - 1]; // 64/32 modulo
 437         }
 438         // use 3rd-from-top digit to obtain better accuracy
 439         b = 1UL<<32;
 440         while (rhat < b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) {
 441             qhat = qhat - 1;
 442             rhat = rhat + vn[n - 1];
 443         }
 444 ```
 445
 446 However when moving to 64-bit digits (desirable because the algorithm
 447 is `O(N^2)`) this in turn means that the estimate has to be computed
 448 from a *128* bit dividend and a 64-bit divisor.  Such an operation
 449 simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
 450 comes close with `divdeu`, by placing one operand in the upper half
 451 of a 128-bit dividend, the lower half is zero.  Again Power ISA
 452 has a Packed SIMD instruction `vdivuq` which is a 128/128
 453 (quad) divide, not a 128/64, and its use would require considerable
 454 effort to move registers to and from GPRs.  Some investigation into
 455 soft-implementations of 128/128 or 128/64 divide show it to be typically
 456 implemented bit-wise, with all that implies.
 457
 458 The irony is, therefore, that attempting to
 459 improve big-integer divide by moving to 64-bit digits in order to take
 460 advantage of the efficiency of 64-bit scalar multiply when Vectorized
 461 would instead
 462 lock up CPU time performing a 128/64 scalar division.  With the Vector
 463 Multiply operations being critically dependent on that `qhat` estimate, and
 464 because that scalar is as an input into each of the vector digit
 465 multiples, as a Dependency Hazard it would cause *all* Parallel
 466 SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
 467
 468 Whilst one solution is to reduce the digit width to 32-bit in order to
 469 go back to 64/32 divide, this increases the completion time by a factor
 470 of 4 due to the algorithm being `O(N^2)`.
 471
 472 **Reducing completion time of 128/64-bit Scalar division**
 473
 474 Scalar division is a known computer science problem because, as even the
 475 Big-Int Divide shows, it requires looping around a multiply (or, if reduced
 476 to 1-bit per loop, a simple compare, shift, and subtract).  If the simplest
 477 approach were deployed then the completion time for the 128/64 scalar
 478 divide would be a whopping 128 cycles.  To be workable an alternative
 479 algorithm is required, and one of the fastest appears to be
 480 Goldschmidt Division.  Whilst typically deployed for Floating Point,
 481 there is no reason why it should not be adapted to Fixed Point.
 482 In this way a Scalar Integer divide can be performed in the same
 483 time-order as Newton-Raphson, using two hardware multipliers
 484 and a subtract.
 485
 486 **Back to Vector carry-looping**
 487
 488 There is however another reason for having a 128/64 division
 489 instruction, and it's effectively the reverse of `maddedu`.
 490 Look closely at Algorithm D when the divisor is only a scalar
 491 (`v[0]`):
 492
 493 ```
 494         k = 0; // the case of a
 495         for (j = m - 1; j >= 0; j--)
 496         {                                 // single-digit
 497             uint64_t dig2 = ((k << 32) | u[j]);
 498             q[j] = dig2 / v[0]; // divisor here.
 499             k = dig2 % v[0]; // modulo back into next loop
 500         }
 501 ```
 502
 503 Here, just as with `maddedu` which can put the hi-half of the 128 bit product
 504 back in as a form of 64-bit carry, a scalar divisor of a vector dividend
 505 puts the modulo back in as the hi-half of a 128/64-bit divide.
 506
 507     RT0      = ((  0<<64) | RA0) / RB0
 508          RC0 = ((  0<<64) | RA0) % RB0
 509           |
 510           +-------+
 511                   |
 512     RT1      = ((RC0<<64) | RA1) / RB1
 513          RC1 = ((RC0<<64) | RA1) % RB1
 514           |
 515           +-------+
 516                   |
 517     RT2      = ((RC1<<64) | RA2) / RB2
 518          RC2 = ((RC1<<64) | RA2) % RB2
 519
 520 By a nice coincidence this is exactly the same 128/64-bit operation
 521 needed (once, rather than chained) for the `qhat` estimate if it may
 522 produce both the quotient and the remainder.
 523 The pseudocode cleanly covering both scenarios (leaving out
 524 overflow for clarity) can be written as:
 525
 526 `divmod2du RT,RA,RB,RC`
 527
 528      dividend = (RC) || (RA)
 529      divisor = EXTZ128(RB)
 530      RT = UDIV(dividend, divisor)
 531      RS = UREM(dividend, divisor)
 532
 533 Again, in an SVP64 context, using EXTRA mode bit 8 allows for
 534 selecting whether `RS=RC` or
 535 `RS=RT+MAXVL`. Similar flexibility in the scalar-vector settings
 536 allows the instruction to perform full parallel vector div/mod,
 537 or act in loop-back mode for big-int division by a scalar,
 538 or for a single scalar 128/64 div/mod.
 539
 540 Again, just as with `sv.maddedu` and `sv.adde`, adventurous implementors
 541 may perform massively-wide DIV/MOD by transparently merging (fusing)
 542 the Vector element operations together, only inputting a single RC and
 543 outputting the last RC. Where efficient algorithms such as Goldschmidt
 544 are deployed internally this could dramatically reduce the cycle completion
 545 time for massive Vector DIV/MOD.  Thus, just as with the other operations
 546 the apparent limitation of creating chains is overcome: SVP64 is,
 547 by design, an "expression of intent" where the implementor is free to
 548 achieve that intent in any way they see fit
 549 as long as strict precise-aware Program Order is
 550 preserved (even on the VL for-loops).
 551
 552 Just as with `divdeu` on which this instruction is based an overflow
 553 detection is required.  When the divisor is too small compared to
 554 the dividend then the result may not fit into 64 bit.  Knuth's
 555 original algorithm detects overflow and manually places 0xffffffff
 556 (all ones) into `qhat`. With there being so many operands already
 557 in  `divmod2du`  a  `cmpl` instruction can be used instead to detect
 558 the overflow. This saves having to add an Rc=1 or OE=1 mode when
 559 the available space in VA-Form EXT04 is extremely limited.
 560
 561 Looking closely at the loop however we can see that overflow
 562 will not occur. The initial value k is zero: as long as a divide-by-zero
 563 is not requested this always fulfils the condition `RC < RA`, and on
 564 subsequent iterations the new k, being the modulo, is always less than the
 565 divisor as well. Thus the condition (the loop invariant) `RC < RA`
 566 is preserved, as long as RC starts at zero.
 567
 568 **Limitations**
 569
 570 One of the worst things for any ISA is that an algorithm's completion
 571 time is directly affected by different implementations having instructions
 572 take longer or shorter times.  Knuth's Big-Integer division is unfortunately
 573 one such algorithm.
 574
 575 Assuming that the computation of qhat takes 128 cycles to complete on
 576 a small power-efficient embedded design, this time would dominate
 577 compared to the 64 bit multiplications.  However if the element width
 578 was reduced to 8, such that the computation of qhat only took 16 cycles,
 579 the calculation of qhat would not dominate, but the number of
 580 multiplications would rise: somewhere in between there would be an
 581 elwidth and a Vector Length that would suit that particular embedded
 582 processor.
 583
 584 By contrast a high performance microarchitecture may deploy Goldschmidt
 585 or other efficient Scalar Division, which could complete 128/64 qhat
 586 computation in say only 5 to 8 cycles, which would be tolerable.
 587 Thus, for general-purpose software, it would be necessary to ship
 588 multiple implementations of the same algorithm and dynamically
 589 select the best one.
 590
 591 The very fact that programmers even have to consider multiple
 592 implementations and compare their performance is an unavoidable nuisance.
 593 SVP64 is supposed to be designed such that only one implementation of
 594 any given algorithm is needed. In some ways it is reassuring that
 595 some algorithms just don't fit.  Slightly more reassuring is that
 596 Goldschmidt Divide, which uses two multiplications that can be
 597 performed in parallel, would be a much better fit with SVP64 (and
 598 Vector Processing in general), the only downside being that it is
 599 regarded as worthwhile for much larger integers.
 600
 601 # Conclusion
 602
 603 TODO