From af96950fc94e0198c4c7cbaa8e015f0e893948ef Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Thu, 21 Apr 2022 09:20:46 +0100
Subject: [PATCH]

---
 openpower/sv/biginteger/analysis.mdwn | 155 ++++++++++++++++++++++++++
 1 file changed, 155 insertions(+)
 create mode 100644 openpower/sv/biginteger/analysis.mdwn

diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn
new file mode 100644
index 000000000..cd2e3bd61
--- /dev/null
+++ b/openpower/sv/biginteger/analysis.mdwn
@@ -0,0 +1,155 @@
+# Analysis
+
+This section covers an analysis of big integer operations.  Use of
+smaller sub-operations is a given: worst-case, addition is O(N)
+whilst multiply and divide are O(N^2).
+
+# Add and Subtract
+
+Surprisingly, no new additional instructions are required to perform
+a straightforward big-integer add or subtract.  Vectorised `adde`
+or `addex` is perfectly sufficient to produce arbitrary-length
+big-integer add due to the rules set in SVP64 that all Vector Operations
+are directly equivalent to the strict Program Order Execution of
+their element-level operations.
+
+Thus, due to sequential execution of `adde` both consuming and producing
+a CA Flag, `sv.adde` is in effect an alias for Vectorised add.  As such,
+implementors are entirely at liberty to recognise Horizontal-First Vector
+adds and send the vector of registers to a much larger and wider back-end
+ALU.
+
+# Multiply
+
+Long-multiply, assuming an O(N^2) algorithm, is performed by summing
+NxN separate smaller multiplications together.  Karatsuba's algorithm
+reduces the number of small multiplies at the expense of increasing
+the number of additions.  Some algorithms follow the Vedic Multiply
+pattern by grouping together all multiplies of the same magnitude/power
+(same column) whilst others perform row-based multiplication: a single
+digit of B multiplies the entirety of A, summed a row at a time. This
+algorithm is the basis of the analysis below (Knuth's Algorithm M).
+
+Multiply is tricky: 64 bit operands actually produce a 128-bit result,
+which clearly cannot fit into an orthogonal register file.
+Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
+instructions, whilst some (OpenRISC) have "Accumulators" from which
+the results of the multiply must be explicitly extracted. High
+performance RISC advocates
+recommend "macro-op fusion" which is in effect where the second instruction
+gains access to the cached copy of the HI half of the
+multiply redult, which had already been
+computed by the first. This approach quickly complicates the internal
+microarchitecture, especially at the decode phase.
+
+Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
+both HI and LO halves of the multiply to reach registers.  If done as a
+multiply-and-accumulate this becomes quite an expensive operation:
+3 64-Bit in, 2 64-bit registers out).
+
+Long-multiplication may be performed a row at a time, starting
+with B0:
+
+    C4 C3 C2 C1 C0
+             A0xB0
+          A1xB0
+       A2xB0
+    A3xB0
+    R4 R3 R2 R1 R0
+
+* R0 contains C0 plus the LO half of A0 times B0
+* R1 contains C1 plus the LO half of A1 times B0
+  plus the HI half of A0 times B0.
+* R2 contains C2 plus the LO half of A2 times B0
+  plus the HI half of A1 times B0.
+
+This would on the face of it be a 4-in operation:
+the upper half of a previous multiply, two new operands
+to multiply, and an additional accumulator (C). However if
+C is left out (and added afterwards with a Vector-Add)
+things become more manageable.
+
+We therefore propose an operation that is 3-in, 2-out,
+that, noting that the connection between successive
+mul-adds has the UPPER half of the previous operation
+as its input, writes the UPPER half of the current
+product into a second output register for exactly that
+purpose.
+
+    product = RA*RB+RC
+    RT = lowerhalf(product)
+    RC = upperhalf(product)
+
+Successive iterations effectively use RC as a 64-bit carry, and
+as noted by Intel in their notes on mulx,
+RA*RB+RC+RD cannot overflow, so does not require
+setting an additional CA flag.
+
+Combined with a Vectorised big-int `sv.adde` the key inner loop of
+Knuth's Algorithm M may be achieved in four instructions, two of
+which are scalar initialisation:
+
+    li r16, 0                     # zero accululator
+    addic r16, r16, 0             # CA to zero as well
+    sv.madde r0.v, r8.v, r17, r16 # mul vector 
+    sv.adde r24.v, r24.v, r0.v   # big-add row to result
+    
+Normally, in a Scalar ISA, the use of a register as both a source
+and destination like this would create costly Dependency Hazards, so
+such an instruction would never be proposed.  However: it turns out
+that, just as with repeated chained application of `adde`, macro-op
+fusion may be internally applied to a sequence of these strange multiply
+operations. (*Such a trick works equally as well in a Scalar-only
+Out-of-Order microarchitecture, although the conditions are harder to
+detect*).
+
+**Application of SVP64**
+
+SVP64 has the means to mark registers as scalar or vector. However
+the available space in the prefix is extremely limited (9 bits).
+With effectively 5 operands (3 in, 2 out) some compromises are needed.
+However a little though gives a useful workaround: two modes,
+controlled by a single bit in `RM.EXTRA`, determine whether the 5th
+register is set to RC or whether to RT+VL. This then leaves only
+4 registers to qualify as scalar/vector, and this can use four
+EXTRA2 designators which fits into the available space.
+
+RS=RT+VL Mode:
+
+    product = RA*RB+RC
+    RT = lowerhalf(product)
+    RS=RT+VL = upperhalf(product)
+
+and RS=RC Mode:
+
+    product = RA*RB+RC
+    RT = lowerhalf(product)
+    RS=RC = upperhalf(product)
+
+Now there is much more potential, including setting RC to a Scalar,
+which would be useful as a 64 bit Carry. RC as a Vector would produce
+a Vector of the HI halves of a Vector of multiplies.  RS=RT+VL Mode
+would allow that same Vector of HI halves to not be an overwrite of RC.
+Also it is possible to specify that any of RA, RB or RC are scalar or
+vector. Overall it is extremely powerful.
+
+# Divide
+
+The simplest implementation of big-int divide is the standard schoolbook
+"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
+Algorithm D performs estimates which, if wrong, are compensated for
+afterwards.  Essentially there are three phases:
+
+* Calculation of the quotient estimate. This is Scalar division
+  and can be ignored for the scope of this analysis
+* Big Integer multiply and subtract.
+* Carry-Correction with a big integer add, if the estimate from
+  phase 1 was wrong by one digit.
+
+In essence then the primary focus of Vectorised Big-Int divide is in
+fact big-integer multiply (more specifically, mul-and-subtract).
+
+    product = RC - (RA) * (RB)
+    RT = lowerhalf(product)
+    RS = upperhalf(product)
+
-- 
2.30.2