From b3bb943b370bdf60cc9416858aa9047d0a9a4cd5 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Mon, 25 Apr 2022 16:15:44 +0100
Subject: [PATCH]

---
 openpower/sv/biginteger/analysis.mdwn | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn
index e5723d02a..8e7d6c0ed 100644
--- a/openpower/sv/biginteger/analysis.mdwn
+++ b/openpower/sv/biginteger/analysis.mdwn
@@ -303,8 +303,9 @@ Detection of the fixup (phase 3) is determined by the Carry (borrow)
 bit at the end. Logically: if borrow was required then the qhat estimate
 was too large and the correction is required, which is, again,
 nothing more than a Vectorised big-integer add (one instruction).
+However this is not the full story
 
-# Scalar 128-bit divisor
+**128/64-bit divisor**
 
 As mentioned above, the first part of the Knuth Algorithm D involves
 computing an estimate for the divisor. This involves using the three
@@ -345,8 +346,8 @@ The irony is, therefore, that attempting to
 improve big-integer divide by moving to 64-bit digits in order to take
 advantage of the efficiency of 64-bit scalar multiply when Vectorised
 would instead
-lock up CPU time performing a 128/64 division.  With the Vector Multiply
-operations being critically dependent on that `qhat` estimate, and
+lock up CPU time performing a 128/64 scalar division.  With the Vector 
+Multiply operations being critically dependent on that `qhat` estimate, and
 because that scalar is as an input into each of the vector digit
 multiples, as a Dependency Hazard it would cause *all* Parallel
 SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
@@ -355,8 +356,7 @@ Whilst one solution is to reduce the digit width to 32-bit in order to
 go back to 64/32 divide, this increases the completion time by a factor
 of 4 due to the algorithm being `O(N^2)`.
 
-*(TODO continue from here: reference investigation into using Goldschmidt 
-division with fixed-width numbers, to get number of iterations down)*
+**Reducing completion time of 128/64-bit Scalar division**
 
 Scalar division is a known computer science problem because, as even the
 Big-Int Divide shows, it requires looping around a multiply (or, if reduced
@@ -370,6 +370,8 @@ In this way a Scalar Integer divide can be performed in the same
 time-order as Newton-Raphson, using two hardware multipliers
 and a subtract.
 
+**Back to Vector carry-looping**
+
 There is however another reason for having a 128/64 division
 instruction, and it's effectively the reverse of `madded`.
 Look closely at Algorithm D when the divisor is only a scalar
-- 
2.30.2