From 20564d41020fa047ece83faec895775160878a13 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Fri, 22 Apr 2022 10:56:32 +0100
Subject: [PATCH]

---
 openpower/sv/biginteger/analysis.mdwn | 54 ++++++++++++++++++++++++++-
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn
index 640aed9d8..350fa1623 100644
--- a/openpower/sv/biginteger/analysis.mdwn
+++ b/openpower/sv/biginteger/analysis.mdwn
@@ -251,8 +251,7 @@ Algorithm D performs estimates which, if wrong, are compensated for
 afterwards.  Essentially there are three phases:
 
 * Calculation of the quotient estimate. This uses a single
-  Scalar divide,
-  which can be ignored for the scope of this analysis
+  Scalar divide, which is covered separately in a later section 
 * Big Integer multiply and subtract.
 * Carry-Correction with a big integer add, if the estimate from
   phase 1 was wrong by one digit.
@@ -310,3 +309,54 @@ Detection of the fixup (phase 3) is determined by the Carry (borrow)
 bit at the end. Logically: if borrow was required then the qhat estimate
 was too large and the correction is required, which is, again,
 nothing more than a Vectorised big-integer add (one instruction).
+
+# 128-bit Scalar divisor
+
+As mentioned above, the first part of the Knuth Algorithm D involves
+computing an estimate for the divisor. This involves using the three
+most significant digits, performing a scalar divide, and consequently
+requires a scalar division with *twice* the number of bits of the
+size of individual digits (for example, a 64-bit array).  In this
+example taken from
+[divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/divmnu64.c;hb=HEAD)
+the digits are 32 bit and therefore a 64/64 divide is sufficient to
+cover a 64/32 operation (64-bit dividend, 32-bit divisor):
+
+```
+        // Compute estimate qhat of q[j] from top 2 digits
+        uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
+        qhat = dig2 / vn[n - 1]; // 64/32 divide
+        rhat = dig2 % vn[n - 1]; // 64/32 modulo
+    again:
+        // use 3rd-from-top digit to obtain better accuracy
+        if (qhat >= b || qhat * vn[n - 2] > b * rhat + un[j + n - 2])
+        {
+            qhat = qhat - 1;
+            rhat = rhat + vn[n - 1];
+            if (rhat < b)
+                goto again;
+        }
+```
+
+However when moving to 64-bit digits (desirable because the algorithm
+is `O(N^2)`) this in turn means that the estimate has to be computed
+from a *128* bit dividend and a 64-bit divisor.  This operation does
+not exist in most Scalar 64-bit ISAs, and some investigation into
+soft-implementations of 128/128 or 128/64 divide show it to be typically
+implemented bit-wise.
+
+The irony is, therefore, that attempting to
+improve big-integer divide by moving to 64-bit digits in order to take
+advantage of the efficiency of 64-bit scalar multiply would instead
+lock up CPU time performing a 128/64 division.  With the Vector Multiply
+operations being critically dependent on that `qhat` estimate, this
+Dependency Hazard would have the SIMD Multiply back-ends sitting idle
+waiting for that one scalar value.
+
+Whilst one solution is to reduce the digit width to 32-bit in order to
+go back to 64/32 divide, this increases the completion time by a factor
+of 4 due to the algorithm being `O(N^2)`.
+
+*(TODO continue from here: reference investigation into using Goldschmidt 
+division with fixed-width numbers, to get number of iterations down)*
+
-- 
2.30.2