openpower/sv/biginteger/analysis.mdwn

   1 [[!tag standards]]
   2
   3 # Analysis
   4
   5 **DRAFT SVP64**
   6
   7 * Revision 0.0: 21apr2022 <https://www.youtube.com/watch?v=8hrIG7-E77o>
   8 * Revision 0.01: 22apr2022 removal of msubed because sv.maddedu and sv.subfe works
   9 * Revision 0.02: 22apr2022 128/64 scalar divide, investigate Goldschmidt
  10 * Revision 0.03: 24apr2022 add 128/64 divmod2du, similar loop to maddedu
  11 * Revision 0.04: 26apr2022 Knuth original uses overflow on scalar div
  12 * Revision 0.05: 27apr2022 add vector shift section (no new instructions)
  13
  14 **Introduction**
  15
  16 This page covers an analysis of big integer operations, to
  17 work out optimal Scalar Instructions to propose be submitted to
  18 the OpenPOWER ISA WG, that when combined with Draft SVP64 give
  19 high performance compact Big Integer Vector Arithmetic. Leverage
  20 of existing Scalar Power ISA instructions is also explained.
  21
  22 Use of smaller sub-operations is a given: worst-case in a Scalar
  23 context, addition is O(N) whilst multiply and divide are O(N^2),
  24 and their Vectorization would reduce those (for small N) to
  25 O(1) and O(N).  Knuth's big-integer scalar algorithms provide
  26 useful real-world grounding into the types of operations needed,
  27 making it easy to demonstrate how they would be Vectorized.
  28
  29 The basic principle behind Knuth's algorithms is to break the
  30 problem down into a single scalar op against a Vector operand.
  31 *This fits naturally with a Scalable Vector ISA such as SVP64*.
  32 It only remains to exploit Carry (1-bit and 64-bit) in a Scalable Vector
  33 context and the picture is complete.
  34
  35 Links
  36
  37 * <https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>
  38 * <https://web.archive.org/web/20141021201141/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf>
  39 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004700.html>
  40 * <https://news.ycombinator.com/item?id=21151646>
  41 * <https://twitter.com/lkcl/status/1517169267912986624>
  42 * <https://www.youtube.com/watch?v=8hrIG7-E77o>
  43 * <https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/>
  44 * <https://bugs.libre-soc.org/show_bug.cgi?id=817>
  45
  46 # Vector Add and Subtract
  47
  48 Surprisingly, no new additional instructions are required to perform
  49 a straightforward big-integer add or subtract.  Vectorized `adde`
  50 or `addex` is perfectly sufficient to produce arbitrary-length
  51 big-integer add due to the rules set in SVP64 that all Vector Operations
  52 are directly equivalent to the strict Program Order Execution of
  53 their element-level operations.  Assuming that the two bigints (or
  54 a part thereof) have been loaded into sequentially-contiguous
  55 registers, with the least-significant bits being in the lowest-numbered
  56 register in each case:
  57
  58     R0,CA = A0+B0+CA  adde r0,a0,b0
  59         |
  60         +----------+
  61                    |
  62     R1,CA = A1+B1+CA  adde r1,a1,b1
  63         |
  64         +----------+
  65                    |
  66     R2,CA = A2+B2+CA  adde r2,a2,b2
  67
  68 This pattern - sequential execution of individual instructions
  69 with incrementing register numbers - is precisely the very definition
  70 of how SVP64 works!
  71 Thus, due to sequential execution of `adde` both consuming and producing
  72 a CA Flag, with no additions to SVP64 or to the v3.0 Power ISA,
  73 `sv.adde` is in effect an alias for Big-Integer Vectorized add.  As such,
  74 implementors are entirely at liberty to recognise Horizontal-First Vector
  75 adds and send the vector of registers to a much larger and wider back-end
  76 ALU, and short-cut the intermediate storage of XER.CA on an element
  77 level in back-end hardware that need only:
  78
  79 * read the first incoming XER.CA
  80 * implement a large Vector-aware carry propagation algorithm
  81 * store the very last XER.CA in the batch
  82
  83 The size and implementation of the underlying back-end SIMD ALU
  84 is entirely at the discretion of the implementer, as is whether to
  85 deploy the above strategy. The only hard requirement for
  86 implementors of SVP64 is to comply with strict and precise Program Order
  87 even at the Element level.
  88
  89 If there is pressure on the register file (or
  90 multi-million-digit big integers)
  91 then a partial-sum may be carried out with LD and ST in a standard
  92 Cray-style Vector Loop:
  93
  94       aptr = A address
  95       bptr = B address
  96       rptr = Result address
  97       li r0, 0        # used to help clear CA
  98       addic r0, r0, 0 # CA to zero as well
  99       setmvli 8       # set MAXVL to 8
 100     loop:
 101       setvl t0, n         # n is the number of digits
 102       mulli t1, t0, 8     # 8 bytes per digit/element
 103       sv.ldu a0, aptr, t1 # update advances pointer
 104       sv.ldu b0, bptr, t1 # likewise
 105       sv.adde r0, a0, b0  # takes in CA, updates CA
 106       sv.stu rptr, r0, t1 # pointer advances too
 107       sub. n, n, t0       # should not alter CA
 108       bnz loop            # do more digits
 109
 110 This is not that different from a Scalar Big-Int add, it is
 111 just that like all Cray-style Vectorization, a variable number
 112 of elements are covered by one instruction.  Of interest
 113 to people unfamiliar with Cray-style Vectors: if VL is not
 114 permitted to exceed 1 (because MAXVL is set to 1) then the above
 115 actually becomes a Scalar Big-Int add algorithm.
 116
 117 # Vector Multiply
 118
 119 Long-multiply, assuming an O(N^2) algorithm, is performed by summing
 120 NxN separate smaller multiplications together.  Karatsuba's algorithm
 121 reduces the number of small multiplies at the expense of increasing
 122 the number of additions.  Some algorithms follow the Vedic Multiply
 123 pattern by grouping together all multiplies of the same magnitude/power
 124 (same column) whilst others perform row-based multiplication: a single
 125 digit of B multiplies the entirety of A, summed a row at a time.
 126 A Row-based
 127 algorithm is the basis of the analysis below (Knuth's Algorithm M).
 128
 129 Multiply is tricky: 64 bit operands actually produce a 128-bit result,
 130 which clearly cannot fit into an orthogonal register file.
 131 Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
 132 instructions, whilst some (OpenRISC) have "Accumulators" from which
 133 the results of the multiply must be explicitly extracted. High
 134 performance RISC advocates
 135 recommend "macro-op fusion" which is in effect where the second instruction
 136 gains access to the cached copy of the HI half of the
 137 multiply result, which had already been
 138 computed by the first. This approach quickly complicates the internal
 139 microarchitecture, especially at the decode phase.
 140
 141 Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
 142 both HI and LO halves of the multiply to reach registers with a single
 143 instruction.  If however done as a
 144 multiply-and-accumulate this becomes quite an expensive operation:
 145 (3 64-Bit in, 2 64-bit registers out).
 146
 147 Long-multiplication may be performed a row at a time, starting
 148 with B0:
 149
 150     C4 C3 C2 C1 C0
 151              A0xB0
 152           A1xB0
 153        A2xB0
 154     A3xB0
 155     R4 R3 R2 R1 R0
 156
 157 * R0 contains C0 plus the LO half of A0 times B0
 158 * R1 contains C1 plus the LO half of A1 times B0
 159   plus the HI half of A0 times B0.
 160 * R2 contains C2 plus the LO half of A2 times B0
 161   plus the HI half of A1 times B0.
 162
 163 This would on the face of it be a 4-in operation:
 164 the upper half of a previous multiply, two new operands
 165 to multiply, and an additional accumulator (C). However if
 166 C is left out (and added afterwards with a Vector-Add)
 167 things become more manageable.
 168
 169 Demonstrating in c, a Row-based multiply using a temporary vector.
 170 Adapted from a simple implementation
 171 of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
 172
 173 ```
 174       // this becomes the basis for sv.maddedu in RS=RC Mode,
 175       // where k is RC. k takes the upper half of product
 176       // and adds it in on the next iteration
 177       k = 0;
 178       for (i = 0; i < m; i++) {
 179          unsigned product = u[i]*v[j] + k;
 180          k = product>>16;
 181          plo[i] = product; // & 0xffff
 182       }
 183       // this is simply sv.adde where k is XER.CA
 184       k = 0;
 185       for (i = 0; i < m; i++) {
 186          t = plo[i] + w[i + j] + k;
 187          w[i + j] = t;          // (I.e., t & 0xFFFF).
 188          k = t >> 16; // carry: should only be 1 bit
 189       }
 190 ```
 191
 192 We therefore propose an operation that is 3-in, 2-out,
 193 that, noting that the connection between successive
 194 mul-adds has the UPPER half of the previous operation
 195 as its input, writes the UPPER half of the current
 196 product into a second output register for exactly the
 197 purpose of letting it be added onto the next BigInt digit.
 198
 199     product = RA*RB+RC
 200     RT = lowerhalf(product)
 201     RC = upperhalf(product)
 202
 203 Horizontal-First Mode therefore may be applied to just this one
 204 instruction.
 205 Successive sequential iterations effectively use RC as a kind of
 206 64-bit carry, and
 207 as noted by Intel in their notes on mulx,
 208 `RA*RB+RC+RD` cannot overflow, so does not require
 209 setting an additional CA flag. We first cover the chain of
 210 `RA*RB+RC` as follows:
 211
 212     RT0, RC0 = RA0 * RB0 + 0
 213           |
 214           +----------------+
 215                            |
 216     RT1, RC1 = RA1 * RB1 + RC0
 217           |
 218           +----------------+
 219                            |
 220     RT2, RC2 = RA2 * RB2 + RC1
 221
 222 Following up to add each partially-computed row to what will become
 223 the final result is achieved with a Vectorized big-int
 224 `sv.adde`. Thus, the key inner loop of
 225 Knuth's Algorithm M may be achieved in four instructions, two of
 226 which are scalar initialisation:
 227
 228     li r16, 0                     # zero accumulator
 229     addic r16, r16, 0             # CA to zero as well
 230     sv.maddedu *r0, *r8, r17, r16 # mul vector
 231     sv.adde *r24, *r24, *r0   # big-add row to result
 232
 233 Normally, in a Scalar ISA, the use of a register as both a source
 234 and destination like this would create costly Dependency Hazards, so
 235 such an instruction would never be proposed.  However: it turns out
 236 that, just as with repeated chained application of `adde`, macro-op
 237 fusion may be internally applied to a sequence of these strange multiply
 238 operations. (*Such a trick works equally as well in a Scalar-only
 239 Out-of-Order microarchitecture, although the conditions are harder to
 240 detect*).
 241
 242 **Application of SVP64**
 243
 244 SVP64 has the means to mark registers as scalar or vector. However
 245 the available space in the prefix is extremely limited (9 bits).
 246 With effectively 5 operands (3 in, 2 out) some compromises are needed.
 247 A little thought gives a useful workaround: two modes,
 248 controlled by a single bit in `RM.EXTRA`, determine whether the 5th
 249 register is set to RC or whether to RT+MAXVL. This then leaves only
 250 4 registers to qualify as scalar/vector, which can use four
 251 EXTRA2 designators and fits into the available 9-bit space.
 252
 253 RS=RT+MAXVL Mode:
 254
 255     product = RA*RB+RC
 256     RT = lowerhalf(product)
 257     RS=RT+MAXVL = upperhalf(product)
 258
 259 and RS=RC Mode:
 260
 261     product = RA*RB+RC
 262     RT = lowerhalf(product)
 263     RS=RC = upperhalf(product)
 264
 265 Now there is much more potential, including setting RC to a Scalar,
 266 which would be useful as a 64 bit Carry. RC as a Vector would produce
 267 a Vector of the HI halves of a Vector of multiplies.  RS=RT+MAXVL Mode
 268 would allow that same Vector of HI halves to not be an overwrite of RC.
 269 Also it is possible to specify that any of RA, RB or RC are scalar or
 270 vector. Overall it is extremely powerful.
 271
 272 # Vector Shift
 273
 274 The decision here was made to follow the same principle as
 275 for multiply-and-accumulate. Whilst Intel has
 276 [srd](https://www.felixcloutier.com/x86/shrd)
 277 which is very similar (but only 2-in 2-out),
 278 the decision was made to create a 3-in 2-out
 279 instruction that effectively uses one of the inputs and one of
 280 the outputs as a "sort of 64-bit carry".
 281
 282 Without the `dsrd` instruction, it is necessary to have three
 283 instructions in an inner loop.
 284 Keeping the shift amount within the range of the element (64 bit)
 285 a Vector bit-shift may be synthesised from a pair of shift operations
 286 and an OR, all of which are standard Scalar Power ISA instructions.
 287 that when Vectorized are exactly what is needed, but that critically
 288 requires Vertical-First Mode.
 289
 290 ```
 291 void bigrsh(unsigned s, uint64_t r[], uint64_t un[], int n) {
 292     for (int i = 0; i < n - 1; i++)
 293         r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
 294     r[n - 1] = un[n - 1] >> s;
 295 }
 296 ```
 297
 298 With SVP64 being on top of the standard scalar regfile the offset by
 299 one of the elements may be achieved simply by referencing the same
 300 vector data offset by one.  Given that all three instructions
 301 (`srd`, `sld`, `or`) are an SVP64 type `RM-1P-2S1D` and are `EXTRA3`,
 302 it is possible to reference the full 128 64-bit registers (r0-r127):
 303
 304     subfic t1, t0, 64     # compute 64-s (s in t0)
 305     sv.srd *r8, *r24, t0  # shift each element of r24 vector up by s
 306     sv.sld *r16, *r25, t1 # offset start of vector by one (r25)
 307     sv.or  *r8, *r8, *r16 # OR two parts together
 308
 309 Predication with zeroing may be utilised on sld to ensure that the
 310 last element is zero, avoiding over-run.
 311
 312 The reason why three instructions are needed instead of one in the
 313 case of big-add is because multiple bits chain through to the
 314 next element, where for add it is a single bit (carry-in, carry-out),
 315 and this is precisely what `adde` already does.
 316 For multiply, divide and shift it is worthwhile to use
 317 one scalar register effectively as a full 64-bit carry/chain.
 318
 319 The limitations of this approach therefore become pretty clear:
 320 not only must Vertical-First Mode be used but also the predication
 321 with zeroing trick. Worse than that, an entire temporary vector
 322 is required which wastes register space.
 323 A better way would be to create a single
 324 scalar instruction that can do the long-shift in-place.
 325
 326 The basic principle of the 3-in 2-out `dsrd` is:
 327
 328 ```
 329     # r[i] = (un[i] >> s) | (un[i + 1] << (64 - s));
 330     temp <- ROT128(RA || RC, RB[58:63])
 331     RT <- temp[64:127]
 332     RS <- temp[0:63]
 333 ```
 334
 335 A 128-bit shift is performed, taking the lower half into RS
 336 and the upper half into RT.  However there is a trick that
 337 may be applied, which only requires `ROT64`:
 338
 339 ```
 340     n <- (RB)[58:63]
 341     v <- ROTL64((RA), 64-n)
 342     mask <- MASK(n, 63)
 343     RT <- (v[0:63] & mask) | ((RC) & ¬mask)
 344     RS <- v[0:63] & ¬mask
 345 ```
 346
 347 The trick here is that the *entirety* of `RA` is rotated,
 348 then parts of it are masked into the destinations.
 349 RC, if also properly masked, can be ORed into RT, as
 350 long as the bits of RC are in the right place.
 351 The really interesting bit is that when Vectorised,
 352 the upper bits (now in RS) *are* in the right bit-positions
 353 to be ORed into the second `dsrd` operation.  This allows
 354 us to create a chain `sv.dsrd`, and a single instruction
 355 replaces all four above:
 356
 357     sv.dsrd *r8, *r24, t1, t0
 358
 359 For larger shift amounts beyond an element bitwidth standard register move
 360 operations may be used, or, if the shift amount is static,
 361 to reference an alternate starting point in
 362 the registers containing the Vector elements
 363 because SVP64 sits on top of a standard Scalar register file.
 364 `sv.sld r16.v, r26.v, t1` for example is equivalent to shifting
 365 by an extra 64 bits, compared to `sv.sld r16.v, r25.v, t1`.
 366
 367 # Vector Divide
 368
 369 The simplest implementation of big-int divide is the standard schoolbook
 370 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
 371 Algorithm D performs estimates which, if wrong, are compensated for
 372 afterwards.  Essentially there are three phases:
 373
 374 * Calculation of the quotient estimate. This uses a single
 375   Scalar divide, which is covered separately in a later section
 376 * Big Integer multiply and subtract.
 377 * Carry-Correction with a big integer add, if the estimate from
 378   phase 1 was wrong by one digit.
 379
 380 From Knuth's Algorithm D, implemented in
 381 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD),
 382 Phase 2 is expressed in c, as:
 383
 384 ```
 385       // Multiply and subtract.
 386       k = 0;
 387       for (i = 0; i < n; i++) {
 388          p = qhat*vn[i]; // 64-bit product
 389          t = un[i+j] - k - (p & 0xFFFFFFFFLL);
 390          un[i+j] = t;
 391          k = (p >> 32) - (t >> 32);
 392       }
 393 ```
 394
 395 Where analysis of this algorithm, if a temporary vector is acceptable,
 396 shows that it can be split into two in exactly the same way as Algorithm M,
 397 this time using subtract instead of add.
 398
 399 ```
 400         uint32_t carry = 0;
 401         // this is just sv.maddedu again
 402         for (int i = 0; i <= n; i++) {
 403             uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
 404             carry = (uint32_t)(value >> 32); // upper half for next loop
 405             product[i] = (uint32_t)value;    // lower into vector
 406         }
 407         bool ca = true;
 408         // this is simply sv.subfe where ca is XER.CA
 409         for (int i = 0; i <= n; i++) {
 410             uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
 411             ca = value >> 32 != 0;
 412             un_j[i] = value;
 413         }
 414         bool need_fixup = !ca; // for phase 3 correction
 415 ```
 416
 417 In essence then the primary focus of Vectorized Big-Int divide is in
 418 fact big-integer multiply
 419
 420 Detection of the fixup (phase 3) is determined by the Carry (borrow)
 421 bit at the end. Logically: if borrow was required then the qhat estimate
 422 was too large and the correction is required, which is, again,
 423 nothing more than a Vectorized big-integer add (one instruction).
 424 However this is not the full story
 425
 426 **128/64-bit divisor**
 427
 428 As mentioned above, the first part of the Knuth Algorithm D involves
 429 computing an estimate for the divisor. This involves using the three
 430 most significant digits, performing a scalar divide, and consequently
 431 requires a scalar division with *twice* the number of bits of the
 432 size of individual digits (for example, a 64-bit array).  In this
 433 example taken from
 434 [divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;hb=HEAD)
 435 the digits are 32 bit and, special-casing the overflow, a 64/32 divide is sufficient (64-bit dividend, 32-bit divisor):
 436
 437 ```
 438         // Compute estimate qhat of q[j] from top 2 digits
 439         uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
 440         if (un[j+n] >= vn[n-1]) {
 441             // rhat can be bigger than 32-bit when the division overflows
 442             qhat = UINT32_MAX;
 443             rhat = dig2 - (uint64_t)UINT32_MAX * vn[n - 1];
 444         } else {
 445             qhat = dig2 / vn[n - 1]; // 64/32 divide
 446             rhat = dig2 % vn[n - 1]; // 64/32 modulo
 447         }
 448         // use 3rd-from-top digit to obtain better accuracy
 449         b = 1UL<<32;
 450         while (rhat < b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) {
 451             qhat = qhat - 1;
 452             rhat = rhat + vn[n - 1];
 453         }
 454 ```
 455
 456 However when moving to 64-bit digits (desirable because the algorithm
 457 is `O(N^2)`) this in turn means that the estimate has to be computed
 458 from a *128* bit dividend and a 64-bit divisor.  Such an operation
 459 simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
 460 comes close with `divdeu`, by placing one operand in the upper half
 461 of a 128-bit dividend, the lower half is zero.  Again Power ISA
 462 has a Packed SIMD instruction `vdivuq` which is a 128/128
 463 (quad) divide, not a 128/64, and its use would require considerable
 464 effort to move registers to and from GPRs.  Some investigation into
 465 soft-implementations of 128/128 or 128/64 divide show it to be typically
 466 implemented bit-wise, with all that implies.
 467
 468 The irony is, therefore, that attempting to
 469 improve big-integer divide by moving to 64-bit digits in order to take
 470 advantage of the efficiency of 64-bit scalar multiply when Vectorized
 471 would instead
 472 lock up CPU time performing a 128/64 scalar division.  With the Vector
 473 Multiply operations being critically dependent on that `qhat` estimate, and
 474 because that scalar is as an input into each of the vector digit
 475 multiples, as a Dependency Hazard it would cause *all* Parallel
 476 SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
 477
 478 Whilst one solution is to reduce the digit width to 32-bit in order to
 479 go back to 64/32 divide, this increases the completion time by a factor
 480 of 4 due to the algorithm being `O(N^2)`.
 481
 482 **Reducing completion time of 128/64-bit Scalar division**
 483
 484 Scalar division is a known computer science problem because, as even the
 485 Big-Int Divide shows, it requires looping around a multiply (or, if reduced
 486 to 1-bit per loop, a simple compare, shift, and subtract).  If the simplest
 487 approach were deployed then the completion time for the 128/64 scalar
 488 divide would be a whopping 128 cycles.  To be workable an alternative
 489 algorithm is required, and one of the fastest appears to be
 490 Goldschmidt Division.  Whilst typically deployed for Floating Point,
 491 there is no reason why it should not be adapted to Fixed Point.
 492 In this way a Scalar Integer divide can be performed in the same
 493 time-order as Newton-Raphson, using two hardware multipliers
 494 and a subtract.
 495
 496 **Back to Vector carry-looping**
 497
 498 There is however another reason for having a 128/64 division
 499 instruction, and it's effectively the reverse of `maddedu`.
 500 Look closely at Algorithm D when the divisor is only a scalar
 501 (`v[0]`):
 502
 503 ```
 504         k = 0; // the case of a
 505         for (j = m - 1; j >= 0; j--)
 506         {                                 // single-digit
 507             uint64_t dig2 = ((k << 32) | u[j]);
 508             q[j] = dig2 / v[0]; // divisor here.
 509             k = dig2 % v[0]; // modulo back into next loop
 510         }
 511 ```
 512
 513 Here, just as with `maddedu` which can put the hi-half of the 128 bit product
 514 back in as a form of 64-bit carry, a scalar divisor of a vector dividend
 515 puts the modulo back in as the hi-half of a 128/64-bit divide.
 516
 517     RT0      = ((  0<<64) | RA0) / RB0
 518          RC0 = ((  0<<64) | RA0) % RB0
 519           |
 520           +-------+
 521                   |
 522     RT1      = ((RC0<<64) | RA1) / RB1
 523          RC1 = ((RC0<<64) | RA1) % RB1
 524           |
 525           +-------+
 526                   |
 527     RT2      = ((RC1<<64) | RA2) / RB2
 528          RC2 = ((RC1<<64) | RA2) % RB2
 529
 530 By a nice coincidence this is exactly the same 128/64-bit operation
 531 needed (once, rather than chained) for the `qhat` estimate if it may
 532 produce both the quotient and the remainder.
 533 The pseudocode cleanly covering both scenarios (leaving out
 534 overflow for clarity) can be written as:
 535
 536 `divmod2du RT,RA,RB,RC`
 537
 538      dividend = (RC) || (RA)
 539      divisor = EXTZ128(RB)
 540      RT = UDIV(dividend, divisor)
 541      RS = UREM(dividend, divisor)
 542
 543 Again, in an SVP64 context, using EXTRA mode bit 8 allows for
 544 selecting whether `RS=RC` or
 545 `RS=RT+MAXVL`. Similar flexibility in the scalar-vector settings
 546 allows the instruction to perform full parallel vector div/mod,
 547 or act in loop-back mode for big-int division by a scalar,
 548 or for a single scalar 128/64 div/mod.
 549
 550 Again, just as with `sv.maddedu` and `sv.adde`, adventurous implementors
 551 may perform massively-wide DIV/MOD by transparently merging (fusing)
 552 the Vector element operations together, only inputting a single RC and
 553 outputting the last RC. Where efficient algorithms such as Goldschmidt
 554 are deployed internally this could dramatically reduce the cycle completion
 555 time for massive Vector DIV/MOD.  Thus, just as with the other operations
 556 the apparent limitation of creating chains is overcome: SVP64 is,
 557 by design, an "expression of intent" where the implementor is free to
 558 achieve that intent in any way they see fit
 559 as long as strict precise-aware Program Order is
 560 preserved (even on the VL for-loops).
 561
 562 Just as with `divdeu` on which this instruction is based an overflow
 563 detection is required.  When the divisor is too small compared to
 564 the dividend then the result may not fit into 64 bit.  Knuth's
 565 original algorithm detects overflow and manually places 0xffffffff
 566 (all ones) into `qhat`. With there being so many operands already
 567 in  `divmod2du`  a  `cmpl` instruction can be used instead to detect
 568 the overflow. This saves having to add an Rc=1 or OE=1 mode when
 569 the available space in VA-Form EXT04 is extremely limited.
 570
 571 Looking closely at the loop however we can see that overflow
 572 will not occur. The initial value k is zero: as long as a divide-by-zero
 573 is not requested this always fulfils the condition `RC < RA`, and on
 574 subsequent iterations the new k, being the modulo, is always less than the
 575 divisor as well. Thus the condition (the loop invariant) `RC < RA`
 576 is preserved, as long as RC starts at zero.
 577
 578 **Limitations**
 579
 580 One of the worst things for any ISA is that an algorithm's completion
 581 time is directly affected by different implementations having instructions
 582 take longer or shorter times.  Knuth's Big-Integer division is unfortunately
 583 one such algorithm.
 584
 585 Assuming that the computation of qhat takes 128 cycles to complete on
 586 a small power-efficient embedded design, this time would dominate
 587 compared to the 64 bit multiplications.  However if the element width
 588 was reduced to 8, such that the computation of qhat only took 16 cycles,
 589 the calculation of qhat would not dominate, but the number of
 590 multiplications would rise: somewhere in between there would be an
 591 elwidth and a Vector Length that would suit that particular embedded
 592 processor.
 593
 594 By contrast a high performance microarchitecture may deploy Goldschmidt
 595 or other efficient Scalar Division, which could complete 128/64 qhat
 596 computation in say only 5 to 8 cycles, which would be tolerable.
 597 Thus, for general-purpose software, it would be necessary to ship
 598 multiple implementations of the same algorithm and dynamically
 599 select the best one.
 600
 601 The very fact that programmers even have to consider multiple
 602 implementations and compare their performance is an unavoidable nuisance.
 603 SVP64 is supposed to be designed such that only one implementation of
 604 any given algorithm is needed. In some ways it is reassuring that
 605 some algorithms just don't fit.  Slightly more reassuring is that
 606 Goldschmidt Divide, which uses two multiplications that can be
 607 performed in parallel, would be a much better fit with SVP64 (and
 608 Vector Processing in general), the only downside being that it is
 609 regarded as worthwhile for much larger integers.
 610
 611 # Conclusion
 612
 613 The majority of ISAs do not have 3-in 2-out instructions for very good
 614 reasons: they are horrendously expensive if implemented as operations
 615 that can write to two registers every clock cycle.  Implementations
 616 that take two clock cycles to write to a single write port are frowned-upon,
 617 and the Hazard Management complexity also goes up.
 618
 619 However when looked at from a Vectorised perspective and using "chaining"
 620 (64-bit carry-out becomes the 64-bit output for the next operation) then
 621 "Operand Forwarding" makes one read and one write effectively disappear,
 622 reducing most of the operations to a manageable efficent 2-in 1-out.
 623 In normal Scalar hardware being used to attempt Vectored bigint operations
 624 the opportunity for "Forwarding" is quite hard to spot, but because of the
 625 Vectorisation it is known *in advance* that a Vector of values is being used,
 626 i.e. only *one* instruction issued not several. Therefore it is easy to
 627 micro-code the Operand Forwarding.
 628
 629 In essence by taking things that extra step further the complexity of
 630 Scalar ISAs disappears through the introduction of some uniform Vector-Looping
 631 that, although complex in itself in hardware, at least only has to be done once
 632 and becomes uniform-RISC applicable to *all* instructions.  These comprehensive
 633 powerful Scalar arithmetic instructions will significantly reduce the
 634 complexity of big-integer operations.