From: lkcl Date: Thu, 21 Apr 2022 08:27:06 +0000 (+0100) Subject: (no commit message) X-Git-Tag: opf_rfc_ls005_v1~2660 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=4cab8a818a7e775a8c3bf7c79f0eeeb3d271b66d;p=libreriscv.git --- diff --git a/openpower/sv/biginteger.mdwn b/openpower/sv/biginteger.mdwn index 38ba735e7..73c853f18 100644 --- a/openpower/sv/biginteger.mdwn +++ b/openpower/sv/biginteger.mdwn @@ -21,161 +21,12 @@ Dynamic SIMD ALUs for maximum performance and effectiveness. # Analysis Covered in [[biginteger/analysis]] the summary is that standard `adde` is sufficient -for SVP64 Vectorisation, but that big-integer multiply and divide -require two extra 3-in 2-out instructions, similar to Intel's `mulx`. - -This section covers an analysis of big integer operations. Use of -smaller sub-operations is a given: worst-case, addition is O(N) -whilst multiply and divide are O(N^2). - -## Add and Subtract - -Surprisingly, no new additional instructions are required to perform -a straightforward big-integer add or subtract. Vectorised `adde` -or `addex` is perfectly sufficient to produce arbitrary-length -big-integer add due to the rules set in SVP64 that all Vector Operations -are directly equivalent to the strict Program Order Execution of -their element-level operations. - -Thus, due to sequential execution of `adde` both consuming and producing -a CA Flag, `sv.adde` is in effect an alias for Vectorised add. As such, -implementors are entirely at liberty to recognise Horizontal-First Vector -adds and send the vector of registers to a much larger and wider back-end -ALU. - -## Multiply - -Long-multiply, assuming an O(N^2) algorithm, is performed by summing -NxN separate smaller multiplications together. Karatsuba's algorithm -reduces the number of small multiplies at the expense of increasing -the number of additions. Some algorithms follow the Vedic Multiply -pattern by grouping together all multiplies of the same magnitude/power -(same column) whilst others perform row-based multiplication: a single -digit of B multiplies the entirety of A, summed a row at a time. This -algorithm is the basis of the analysis below (Knuth's Algorithm M). - -Multiply is tricky: 64 bit operands actually produce a 128-bit result, -which clearly cannot fit into an orthogonal register file. -Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half` -instructions, whilst some (OpenRISC) have "Accumulators" from which -the results of the multiply must be explicitly extracted. High -performance RISC advocates -recommend "macro-op fusion" which is in effect where the second instruction -gains access to the cached copy of the HI half of the -multiply redult, which had already been -computed by the first. This approach quickly complicates the internal -microarchitecture, especially at the decode phase. - -Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing -both HI and LO halves of the multiply to reach registers. If done as a -multiply-and-accumulate this becomes quite an expensive operation: -3 64-Bit in, 2 64-bit registers out). - -Long-multiplication may be performed a row at a time, starting -with B0: - - C4 C3 C2 C1 C0 - A0xB0 - A1xB0 - A2xB0 - A3xB0 - R4 R3 R2 R1 R0 - -* R0 contains C0 plus the LO half of A0 times B0 -* R1 contains C1 plus the LO half of A1 times B0 - plus the HI half of A0 times B0. -* R2 contains C2 plus the LO half of A2 times B0 - plus the HI half of A1 times B0. - -This would on the face of it be a 4-in operation: -the upper half of a previous multiply, two new operands -to multiply, and an additional accumulator (C). However if -C is left out (and added afterwards with a Vector-Add) -things become more manageable. - -We therefore propose an operation that is 3-in, 2-out, -that, noting that the connection between successive -mul-adds has the UPPER half of the previous operation -as its input, writes the UPPER half of the current -product into a second output register for exactly that -purpose. - - product = RA*RB+RC - RT = lowerhalf(product) - RC = upperhalf(product) - -Successive iterations effectively use RC as a 64-bit carry, and -as noted by Intel in their notes on mulx, -RA*RB+RC+RD cannot overflow, so does not require -setting an additional CA flag. - -Combined with a Vectorised big-int `sv.adde` the key inner loop of -Knuth's Algorithm M may be achieved in four instructions, two of -which are scalar initialisation: - - li r16, 0 # zero accululator - addic r16, r16, 0 # CA to zero as well - sv.madde r0.v, r8.v, r17, r16 # mul vector - sv.adde r24.v, r24.v, r0.v # big-add row to result - -Normally, in a Scalar ISA, the use of a register as both a source -and destination like this would create costly Dependency Hazards, so -such an instruction would never be proposed. However: it turns out -that, just as with repeated chained application of `adde`, macro-op -fusion may be internally applied to a sequence of these strange multiply -operations. (*Such a trick works equally as well in a Scalar-only -Out-of-Order microarchitecture, although the conditions are harder to -detect*). - -**Application of SVP64** - -SVP64 has the means to mark registers as scalar or vector. However -the available space in the prefix is extremely limited (9 bits). -With effectively 5 operands (3 in, 2 out) some compromises are needed. -However a little though gives a useful workaround: two modes, -controlled by a single bit in `RM.EXTRA`, determine whether the 5th -register is set to RC or whether to RT+VL. This then leaves only -4 registers to qualify as scalar/vector, and this can use four -EXTRA2 designators which fits into the available space. - -RS=RT+VL Mode: - - product = RA*RB+RC - RT = lowerhalf(product) - RS=RT+VL = upperhalf(product) - -and RS=RC Mode: - - product = RA*RB+RC - RT = lowerhalf(product) - RS=RC = upperhalf(product) - -Now there is much more potential, including setting RC to a Scalar, -which would be useful as a 64 bit Carry. RC as a Vector would produce -a Vector of the HI halves of a Vector of multiplies. RS=RT+VL Mode -would allow that same Vector of HI halves to not be an overwrite of RC. -Also it is possible to specify that any of RA, RB or RC are scalar or -vector. Overall it is extremely powerful. - -## Divide - -The simplest implementation of big-int divide is the standard schoolbook -"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's -Algorithm D performs estimates which, if wrong, are compensated for -afterwards. Essentially there are three phases: - -* Calculation of the quotient estimate. This is Scalar division - and can be ignored for the scope of this analysis -* Big Integer multiply and subtract. -* Carry-Correction with a big integer add, if the estimate from - phase 1 was wrong by one digit. - -In essence then the primary focus of Vectorised Big-Int divide is in -fact big-integer multiply (more specifically, mul-and-subtract). - - product = RC - (RA) * (RB) - RT = lowerhalf(product) - RS = upperhalf(product) +for SVP64 Vectorisation of big-integer addition (and subfe for +subtraction) but that big-integer multiply and divide +require two extra 3-in 2-out instructions, similar to Intel's `mulx`, +to be efficient. Macro-op Fusion and back-end massively-wide SIMD ALUs +may be deployed in a fashion that is hidden from the user, behind a +consistent, stable ISA API. # Instructions