From 4cab8a818a7e775a8c3bf7c79f0eeeb3d271b66d Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Thu, 21 Apr 2022 09:27:06 +0100
Subject: [PATCH]

---
 openpower/sv/biginteger.mdwn | 161 ++---------------------------------
 1 file changed, 6 insertions(+), 155 deletions(-)

diff --git a/openpower/sv/biginteger.mdwn b/openpower/sv/biginteger.mdwn
index 38ba735e7..73c853f18 100644
--- a/openpower/sv/biginteger.mdwn
+++ b/openpower/sv/biginteger.mdwn
@@ -21,161 +21,12 @@ Dynamic SIMD ALUs for maximum performance and effectiveness.
 # Analysis
 
 Covered in [[biginteger/analysis]] the summary is that standard `adde` is sufficient
-for SVP64 Vectorisation, but that big-integer multiply and divide
-require two extra 3-in 2-out instructions, similar to Intel's `mulx`.
-
-This section covers an analysis of big integer operations.  Use of
-smaller sub-operations is a given: worst-case, addition is O(N)
-whilst multiply and divide are O(N^2).
-
-## Add and Subtract
-
-Surprisingly, no new additional instructions are required to perform
-a straightforward big-integer add or subtract.  Vectorised `adde`
-or `addex` is perfectly sufficient to produce arbitrary-length
-big-integer add due to the rules set in SVP64 that all Vector Operations
-are directly equivalent to the strict Program Order Execution of
-their element-level operations.
-
-Thus, due to sequential execution of `adde` both consuming and producing
-a CA Flag, `sv.adde` is in effect an alias for Vectorised add.  As such,
-implementors are entirely at liberty to recognise Horizontal-First Vector
-adds and send the vector of registers to a much larger and wider back-end
-ALU.
-
-## Multiply
-
-Long-multiply, assuming an O(N^2) algorithm, is performed by summing
-NxN separate smaller multiplications together.  Karatsuba's algorithm
-reduces the number of small multiplies at the expense of increasing
-the number of additions.  Some algorithms follow the Vedic Multiply
-pattern by grouping together all multiplies of the same magnitude/power
-(same column) whilst others perform row-based multiplication: a single
-digit of B multiplies the entirety of A, summed a row at a time. This
-algorithm is the basis of the analysis below (Knuth's Algorithm M).
-
-Multiply is tricky: 64 bit operands actually produce a 128-bit result,
-which clearly cannot fit into an orthogonal register file.
-Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
-instructions, whilst some (OpenRISC) have "Accumulators" from which
-the results of the multiply must be explicitly extracted. High
-performance RISC advocates
-recommend "macro-op fusion" which is in effect where the second instruction
-gains access to the cached copy of the HI half of the
-multiply redult, which had already been
-computed by the first. This approach quickly complicates the internal
-microarchitecture, especially at the decode phase.
-
-Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
-both HI and LO halves of the multiply to reach registers.  If done as a
-multiply-and-accumulate this becomes quite an expensive operation:
-3 64-Bit in, 2 64-bit registers out).
-
-Long-multiplication may be performed a row at a time, starting
-with B0:
-
-    C4 C3 C2 C1 C0
-             A0xB0
-          A1xB0
-       A2xB0
-    A3xB0
-    R4 R3 R2 R1 R0
-
-* R0 contains C0 plus the LO half of A0 times B0
-* R1 contains C1 plus the LO half of A1 times B0
-  plus the HI half of A0 times B0.
-* R2 contains C2 plus the LO half of A2 times B0
-  plus the HI half of A1 times B0.
-
-This would on the face of it be a 4-in operation:
-the upper half of a previous multiply, two new operands
-to multiply, and an additional accumulator (C). However if
-C is left out (and added afterwards with a Vector-Add)
-things become more manageable.
-
-We therefore propose an operation that is 3-in, 2-out,
-that, noting that the connection between successive
-mul-adds has the UPPER half of the previous operation
-as its input, writes the UPPER half of the current
-product into a second output register for exactly that
-purpose.
-
-    product = RA*RB+RC
-    RT = lowerhalf(product)
-    RC = upperhalf(product)
-
-Successive iterations effectively use RC as a 64-bit carry, and
-as noted by Intel in their notes on mulx,
-RA*RB+RC+RD cannot overflow, so does not require
-setting an additional CA flag.
-
-Combined with a Vectorised big-int `sv.adde` the key inner loop of
-Knuth's Algorithm M may be achieved in four instructions, two of
-which are scalar initialisation:
-
-    li r16, 0                     # zero accululator
-    addic r16, r16, 0             # CA to zero as well
-    sv.madde r0.v, r8.v, r17, r16 # mul vector 
-    sv.adde r24.v, r24.v, r0.v   # big-add row to result
-    
-Normally, in a Scalar ISA, the use of a register as both a source
-and destination like this would create costly Dependency Hazards, so
-such an instruction would never be proposed.  However: it turns out
-that, just as with repeated chained application of `adde`, macro-op
-fusion may be internally applied to a sequence of these strange multiply
-operations. (*Such a trick works equally as well in a Scalar-only
-Out-of-Order microarchitecture, although the conditions are harder to
-detect*).
-
-**Application of SVP64**
-
-SVP64 has the means to mark registers as scalar or vector. However
-the available space in the prefix is extremely limited (9 bits).
-With effectively 5 operands (3 in, 2 out) some compromises are needed.
-However a little though gives a useful workaround: two modes,
-controlled by a single bit in `RM.EXTRA`, determine whether the 5th
-register is set to RC or whether to RT+VL. This then leaves only
-4 registers to qualify as scalar/vector, and this can use four
-EXTRA2 designators which fits into the available space.
-
-RS=RT+VL Mode:
-
-    product = RA*RB+RC
-    RT = lowerhalf(product)
-    RS=RT+VL = upperhalf(product)
-
-and RS=RC Mode:
-
-    product = RA*RB+RC
-    RT = lowerhalf(product)
-    RS=RC = upperhalf(product)
-
-Now there is much more potential, including setting RC to a Scalar,
-which would be useful as a 64 bit Carry. RC as a Vector would produce
-a Vector of the HI halves of a Vector of multiplies.  RS=RT+VL Mode
-would allow that same Vector of HI halves to not be an overwrite of RC.
-Also it is possible to specify that any of RA, RB or RC are scalar or
-vector. Overall it is extremely powerful.
-
-## Divide
-
-The simplest implementation of big-int divide is the standard schoolbook
-"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
-Algorithm D performs estimates which, if wrong, are compensated for
-afterwards.  Essentially there are three phases:
-
-* Calculation of the quotient estimate. This is Scalar division
-  and can be ignored for the scope of this analysis
-* Big Integer multiply and subtract.
-* Carry-Correction with a big integer add, if the estimate from
-  phase 1 was wrong by one digit.
-
-In essence then the primary focus of Vectorised Big-Int divide is in
-fact big-integer multiply (more specifically, mul-and-subtract).
-
-    product = RC - (RA) * (RB)
-    RT = lowerhalf(product)
-    RS = upperhalf(product)
+for SVP64 Vectorisation of big-integer addition (and subfe for
+subtraction) but that big-integer multiply and divide
+require two extra 3-in 2-out instructions, similar to Intel's `mulx`,
+to be efficient.  Macro-op Fusion and back-end massively-wide SIMD ALUs
+may be deployed in a fashion that is hidden from the user, behind a
+consistent, stable ISA API.
 
 # Instructions
 
-- 
2.30.2