From: lkcl <lkcl@web>
Date: Fri, 22 Apr 2022 13:18:13 +0000 (+0100)
Subject: (no commit message)
X-Git-Tag: opf_rfc_ls005_v1~2616
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=528e9e960d835b8004040267d9dbe05e54621ad0;p=libreriscv.git

---

diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn
index af7cb91f0..55652bdbe 100644
--- a/openpower/sv/biginteger/analysis.mdwn
+++ b/openpower/sv/biginteger/analysis.mdwn
@@ -25,7 +25,7 @@ Links
 * <https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/>
 * <https://bugs.libre-soc.org/show_bug.cgi?id=817>
 
-# Add and Subtract
+# Vector Add and Subtract
 
 Surprisingly, no new additional instructions are required to perform
 a straightforward big-integer add or subtract.  Vectorised `adde`
@@ -93,7 +93,7 @@ to people unfamiliar with Cray-style Vectors: if VL is not
 permitted to exceed 1 (because MAXVL is set to 1) then the above
 actually becomes a Scalar Big-Int add algorithm.
 
-# Multiply
+# Vector Multiply
 
 Long-multiply, assuming an O(N^2) algorithm, is performed by summing
 NxN separate smaller multiplications together.  Karatsuba's algorithm
@@ -244,7 +244,7 @@ would allow that same Vector of HI halves to not be an overwrite of RC.
 Also it is possible to specify that any of RA, RB or RC are scalar or
 vector. Overall it is extremely powerful.
 
-# Divide
+# Vector Divide
 
 The simplest implementation of big-int divide is the standard schoolbook
 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
@@ -303,7 +303,7 @@ bit at the end. Logically: if borrow was required then the qhat estimate
 was too large and the correction is required, which is, again,
 nothing more than a Vectorised big-integer add (one instruction).
 
-# 128-bit Scalar divisor
+# Scalar 128-bit divisor
 
 As mentioned above, the first part of the Knuth Algorithm D involves
 computing an estimate for the divisor. This involves using the three
@@ -332,19 +332,23 @@ cover a 64/32 operation (64-bit dividend, 32-bit divisor):
 
 However when moving to 64-bit digits (desirable because the algorithm
 is `O(N^2)`) this in turn means that the estimate has to be computed
-from a *128* bit dividend and a 64-bit divisor.  This operation does
-not exist in most Scalar 64-bit ISAs, and some investigation into
+from a *128* bit dividend and a 64-bit divisor.  Such an operation 
+simply does not exist in most Scalar 64-bit ISAs.  For Power ISA
+it would be necessary to implement Packed SIMD instructions
+and infrastructure in order to utilise `vdivuq` which is a 128/128
+(quad) divide, not a 128/64.  Some investigation into
 soft-implementations of 128/128 or 128/64 divide show it to be typically
-implemented bit-wise.
+implemented bit-wise, with all that implies.
 
 The irony is, therefore, that attempting to
 improve big-integer divide by moving to 64-bit digits in order to take
-advantage of the efficiency of 64-bit scalar multiply would instead
+advantage of the efficiency of 64-bit scalar multiply when Vectorised
+would instead
 lock up CPU time performing a 128/64 division.  With the Vector Multiply
 operations being critically dependent on that `qhat` estimate, and
 because that scalar is as an input into each of the vector digit
-multiples, as a Dependency Hazard it would have the Parallel SIMD Multiply
-back-ends sitting 100% idle, waiting for that one scalar value.
+multiples, as a Dependency Hazard it would cause *all* Parallel
+SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
 
 Whilst one solution is to reduce the digit width to 32-bit in order to
 go back to 64/32 divide, this increases the completion time by a factor