bit at the end. Logically: if borrow was required then the qhat estimate
was too large and the correction is required, which is, again,
nothing more than a Vectorised big-integer add (one instruction).
+However this is not the full story
-# Scalar 128-bit divisor
+**128/64-bit divisor**
As mentioned above, the first part of the Knuth Algorithm D involves
computing an estimate for the divisor. This involves using the three
improve big-integer divide by moving to 64-bit digits in order to take
advantage of the efficiency of 64-bit scalar multiply when Vectorised
would instead
-lock up CPU time performing a 128/64 division. With the Vector Multiply
-operations being critically dependent on that `qhat` estimate, and
+lock up CPU time performing a 128/64 scalar division. With the Vector
+Multiply operations being critically dependent on that `qhat` estimate, and
because that scalar is as an input into each of the vector digit
multiples, as a Dependency Hazard it would cause *all* Parallel
SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
go back to 64/32 divide, this increases the completion time by a factor
of 4 due to the algorithm being `O(N^2)`.
-*(TODO continue from here: reference investigation into using Goldschmidt
-division with fixed-width numbers, to get number of iterations down)*
+**Reducing completion time of 128/64-bit Scalar division**
Scalar division is a known computer science problem because, as even the
Big-Int Divide shows, it requires looping around a multiply (or, if reduced
time-order as Newton-Raphson, using two hardware multipliers
and a subtract.
+**Back to Vector carry-looping**
+
There is however another reason for having a 128/64 division
instruction, and it's effectively the reverse of `madded`.
Look closely at Algorithm D when the divisor is only a scalar