From af96950fc94e0198c4c7cbaa8e015f0e893948ef Mon Sep 17 00:00:00 2001 From: lkcl Date: Thu, 21 Apr 2022 09:20:46 +0100 Subject: [PATCH] --- openpower/sv/biginteger/analysis.mdwn | 155 ++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 openpower/sv/biginteger/analysis.mdwn diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn new file mode 100644 index 000000000..cd2e3bd61 --- /dev/null +++ b/openpower/sv/biginteger/analysis.mdwn @@ -0,0 +1,155 @@ +# Analysis + +This section covers an analysis of big integer operations. Use of +smaller sub-operations is a given: worst-case, addition is O(N) +whilst multiply and divide are O(N^2). + +# Add and Subtract + +Surprisingly, no new additional instructions are required to perform +a straightforward big-integer add or subtract. Vectorised `adde` +or `addex` is perfectly sufficient to produce arbitrary-length +big-integer add due to the rules set in SVP64 that all Vector Operations +are directly equivalent to the strict Program Order Execution of +their element-level operations. + +Thus, due to sequential execution of `adde` both consuming and producing +a CA Flag, `sv.adde` is in effect an alias for Vectorised add. As such, +implementors are entirely at liberty to recognise Horizontal-First Vector +adds and send the vector of registers to a much larger and wider back-end +ALU. + +# Multiply + +Long-multiply, assuming an O(N^2) algorithm, is performed by summing +NxN separate smaller multiplications together. Karatsuba's algorithm +reduces the number of small multiplies at the expense of increasing +the number of additions. Some algorithms follow the Vedic Multiply +pattern by grouping together all multiplies of the same magnitude/power +(same column) whilst others perform row-based multiplication: a single +digit of B multiplies the entirety of A, summed a row at a time. This +algorithm is the basis of the analysis below (Knuth's Algorithm M). + +Multiply is tricky: 64 bit operands actually produce a 128-bit result, +which clearly cannot fit into an orthogonal register file. +Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half` +instructions, whilst some (OpenRISC) have "Accumulators" from which +the results of the multiply must be explicitly extracted. High +performance RISC advocates +recommend "macro-op fusion" which is in effect where the second instruction +gains access to the cached copy of the HI half of the +multiply redult, which had already been +computed by the first. This approach quickly complicates the internal +microarchitecture, especially at the decode phase. + +Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing +both HI and LO halves of the multiply to reach registers. If done as a +multiply-and-accumulate this becomes quite an expensive operation: +3 64-Bit in, 2 64-bit registers out). + +Long-multiplication may be performed a row at a time, starting +with B0: + + C4 C3 C2 C1 C0 + A0xB0 + A1xB0 + A2xB0 + A3xB0 + R4 R3 R2 R1 R0 + +* R0 contains C0 plus the LO half of A0 times B0 +* R1 contains C1 plus the LO half of A1 times B0 + plus the HI half of A0 times B0. +* R2 contains C2 plus the LO half of A2 times B0 + plus the HI half of A1 times B0. + +This would on the face of it be a 4-in operation: +the upper half of a previous multiply, two new operands +to multiply, and an additional accumulator (C). However if +C is left out (and added afterwards with a Vector-Add) +things become more manageable. + +We therefore propose an operation that is 3-in, 2-out, +that, noting that the connection between successive +mul-adds has the UPPER half of the previous operation +as its input, writes the UPPER half of the current +product into a second output register for exactly that +purpose. + + product = RA*RB+RC + RT = lowerhalf(product) + RC = upperhalf(product) + +Successive iterations effectively use RC as a 64-bit carry, and +as noted by Intel in their notes on mulx, +RA*RB+RC+RD cannot overflow, so does not require +setting an additional CA flag. + +Combined with a Vectorised big-int `sv.adde` the key inner loop of +Knuth's Algorithm M may be achieved in four instructions, two of +which are scalar initialisation: + + li r16, 0 # zero accululator + addic r16, r16, 0 # CA to zero as well + sv.madde r0.v, r8.v, r17, r16 # mul vector + sv.adde r24.v, r24.v, r0.v # big-add row to result + +Normally, in a Scalar ISA, the use of a register as both a source +and destination like this would create costly Dependency Hazards, so +such an instruction would never be proposed. However: it turns out +that, just as with repeated chained application of `adde`, macro-op +fusion may be internally applied to a sequence of these strange multiply +operations. (*Such a trick works equally as well in a Scalar-only +Out-of-Order microarchitecture, although the conditions are harder to +detect*). + +**Application of SVP64** + +SVP64 has the means to mark registers as scalar or vector. However +the available space in the prefix is extremely limited (9 bits). +With effectively 5 operands (3 in, 2 out) some compromises are needed. +However a little though gives a useful workaround: two modes, +controlled by a single bit in `RM.EXTRA`, determine whether the 5th +register is set to RC or whether to RT+VL. This then leaves only +4 registers to qualify as scalar/vector, and this can use four +EXTRA2 designators which fits into the available space. + +RS=RT+VL Mode: + + product = RA*RB+RC + RT = lowerhalf(product) + RS=RT+VL = upperhalf(product) + +and RS=RC Mode: + + product = RA*RB+RC + RT = lowerhalf(product) + RS=RC = upperhalf(product) + +Now there is much more potential, including setting RC to a Scalar, +which would be useful as a 64 bit Carry. RC as a Vector would produce +a Vector of the HI halves of a Vector of multiplies. RS=RT+VL Mode +would allow that same Vector of HI halves to not be an overwrite of RC. +Also it is possible to specify that any of RA, RB or RC are scalar or +vector. Overall it is extremely powerful. + +# Divide + +The simplest implementation of big-int divide is the standard schoolbook +"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's +Algorithm D performs estimates which, if wrong, are compensated for +afterwards. Essentially there are three phases: + +* Calculation of the quotient estimate. This is Scalar division + and can be ignored for the scope of this analysis +* Big Integer multiply and subtract. +* Carry-Correction with a big integer add, if the estimate from + phase 1 was wrong by one digit. + +In essence then the primary focus of Vectorised Big-Int divide is in +fact big-integer multiply (more specifically, mul-and-subtract). + + product = RC - (RA) * (RB) + RT = lowerhalf(product) + RS = upperhalf(product) + -- 2.30.2