[[!tag standards]] # Big Integer Arithmetic **DRAFT STATUS** 19apr2021 BigNum arithmetic is extremely common especially in cryptography, where for example RSA relies on arithmetic of 2048 or 4096 bits in length. The primary operations are add, multiply and divide (and modulo) with specialisations of subtract and signed multiply. A reminder that a particular focus of SVP64 is that it is built on top of Scalar operations, where those scalar operations are useful in their own right without SVP64. Thus the operstions here are proposed first as Scalar Extensions to the Power ISA. A secondary focus is that if Vectorised, implementors may choose to deploy macro-op fusion targetting back-end 256-bit or greater Dynamic SIMD ALUs for maximum performance and effectiveness. # Analysis This section covers an analysis of big integer operations. Use of smaller sub-operations is a given: worst-case, addition is O(N) whilst multiply and divide are O(N^2). ## Add and Subtract Surprisingly, no new additional instructions are required to perform a straightforward big-integer add or subtract. Vectorised `adde` or `addex` is perfectly sufficient to produce arbitrary-length big-integer add due to the rules set in SVP64 that all Vector Operations are directly equivalent to the strict Program Order Execution of their element-level operations. Thus, due to sequential execution of `adde` both consuming and producing a CA Flag, `sv.adde` is in effect an alias for Vectorised add. As such, implementors are entirely at liberty to recognise Horizontal-First Vector adds and send the vector of registers to a much larger and wider back-end ALU. ## Multiply Multiply is tricky: 64 bit operands actually produce a 128-bit result, which clearly cannot fit into an orthogonal register file. Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half` instructions, whilst some (OpenRISC) have "Accumulators" from which the results of the multiply must be explicitly extracted. High performance RISC advocates recommend "macro-op fusion" which is in effect where the second instruction gains access to the cached copy of the HI half of the multiply redult, which had already been computed by the first. This approach quickly complicates the internal microarchitecture, especially at the decode phase. Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing both HI and LO halves of the multiply to reach registers. If done as a multiply-and-accumulate this becomes quite an expensive operation: 3 64-Bit in, 2 64-bit registers out). Long-multiplication may be performed a row at a time, starting with B0: C4 C3 C2 C1 C0 A0xB0 A1xB0 A2xB0 A3xB0 R4 R3 R2 R1 R0 * R0 contains C0 plus the LO half of A0 times B0 * R1 contains C1 plus the LO half of A1 times B0 plus the HI half of A0 times B0. * R2 contains C2 plus the LO half of A2 times B0 plus the HI half of A1 times B0. This would on the face of it be a 4-in operation: the upper half of a previous multiply, two new operands to multiply, and an additional accumulator (C). However if C is left out (and added afterwards with a Vector-Add) things become more manageable. We therefore propose an operation that is 3-in, 2-out, that, noting that the connection between successive mul-adds has the UPPER half of the previous operation as its input, writes the UPPER half of the current product into a second output register for exactly that purpose. product = RA*RB+RC RT = lowerhalf(product) RC = upperhalf(product) Successive iterations effectively use RC as a 64-bit carry, and as noted by Intel in their notes on mulx, RA*RB+RC+RD cannot overflow, so does not require setting an additional CA flag. Combined with a Vectorised big-int `sv.adde` the key inner loop of Knuth's Algorithm M may be achieved in four instructions, two of which are scalar initialisation: li r16, 0 # zero accululator addic r16, r16, 0 # CA to zero as well sv.madde r0.v, r8.v, r17, r16 # mul vector sv.adde r24.v, r24.v, r0.v # big-add row to result Normally, in a Scalar ISA, the use of a register as both a source and destination like this would create costly Dependency Hazards, so such an instruction would never be proposed. However: it turns out that, just as with repeated chained application of `adde`, macro-op fusion may be internally applied to a sequence of these strange multiply operations. Such a trick works equally as well in Scalar-only. **Application of SVP64** SVP64 has the means to mark registers as scalar or vector. However the available space in the prefix is extremely limited (9 bits). With effectively 5 operands (3 in, 2 out) some compromises are needed. However a little though gives a useful workaround: two modes, controlled by a single bit in `RM.EXTRA`, determine whether the 5th register is set to RC or whether to RT+VL. This then leaves only 4 registers to qualify as scalar/vector, and this can use four EXTRA2 designators which fits into the available space. RS=RT+VL Mode: product = RA*RB+RC RT = lowerhalf(product) RS=RT+VL = upperhalf(product) and RS=RC Mode: product = RA*RB+RC RT = lowerhalf(product) RS=RC = upperhalf(product) Now there is much more potential, including setting RC to a Scalar, which would be useful as a 64 bit Carry. RC as a Vector would produce a Vector of the HI halves of a Vector of multiplies. RS=RT+VL Mode would allow that same Vector of HI halves to not be an overwrite of RC. Also it is possible to specify that any of RA, RB or RC are scalar or vector. Overall it is extremely powerful. ## Divide The simplest implementation of big-int divide is the standard schoolbook "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's Algorithm D performs estimates which, if wrong, are compensated for afterwards. Essentially there are three phases: * Calculation of the quotient estimate. This is Scalar division and can be ignored for the scope of this analysis * Big Integer multiply and subtract. * Carry-Correction with a big integer add, if the estimate from phase 1 was wrong by one digit. In essence then the primary focus of Vectorised Big-Int divide is in fact big-integer multiply (more specifically, mul-and-subtract). product = RC - (RA) * (RB) RT = lowerhalf(product) RS = upperhalf(product) # Instructions **DRAFT** Both `madded` and `msubed` are VA-Form: |0.....5|6..10|11..15|16..20|21..25|26..31| |-------|-----|------|------|------|------| | EXT04 | RT | RA | RB | RC | XO | For the Opcode map (XO Field) see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357. Proposed is the addition of `msubed` (**DRAFT, NOT APPROVED**) which is in `110110`. A corresponding `madded` is proposed for `110010` | 110000 | 110001 | 110010 | 110011 | 110100 | 110101 | 110110 | 110111 | | ------ | ------- | ------ | ------ | ------ | ------ | ------ | ------ | | maddhd | maddhdu | madded | maddld | rsvd | rsvd | msubed | rsvd | For SVP64 EXTRA register extension, the `RM-1P-3S-1D` format is used with the additional bit set for determining RS. | Field Name | Field bits | Description | |------------|------------|----------------------------------------| | Rdest\_EXTRA2 | `10:11` | extends RT (R\*\_EXTRA2 Encoding) | | Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding) | | Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding) | | Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding) | | EXTRA2_MODE | `18` | used by `msubed` and `madded` for RS | When `EXTRA2_MODE` is set to zero, the implicit RS register takes its Vector/Scalar setting from Rdest_EXTRA2, and takes the register number from RT, but all numbering is offset by VL. *Note that element-width overrides influence this offset* (see SVP64 [[svp64/appendix]] for full details). When `EXTRA2_MODE` is set to one, the implicit RS register is identical to RC extended to SVP64 numbering, including whether RC is set Scalar or Vector. The pseudocode for `msubed RT, RA, RB, RC`` is: prod[0:127] = (RA) * (RB) sub[0:127] = EXTZ(RC) - prod RT <- sub[64:127] RS <- sub[0:63] # RS is either RC or RT+VL Note that RC is not sign-extended to 64-bit. In a Vector Loop it contains the top half of the previous multiply-with-subtract, and the current product must be subtracted from it. The pseudocode for `madded RT, RA, RB, RC` is: prod[0:127] = (RA) * (RB) sum[0:127] = EXTZ(RC) + prod RT <- sum[64:127] RS <- sum[0:63] # RS is either RC or RT+VL Again RC is zero-extended (not shifted), the 128-bit product added to it; the lower half of the result stored in RT and the upper half in RS. The differences here to `maddhdu` are that `maddhdu` stores the upper half in RT, where `madded` stores the upper half in RS. There is no equivalent to `maddld` because `maddld` performs sign-extension on RC. # Appendix see [[appendix]]