afterwards. Essentially there are three phases:
* Calculation of the quotient estimate. This uses a single
- Scalar divide,
- which can be ignored for the scope of this analysis
+ Scalar divide, which is covered separately in a later section
* Big Integer multiply and subtract.
* Carry-Correction with a big integer add, if the estimate from
phase 1 was wrong by one digit.
bit at the end. Logically: if borrow was required then the qhat estimate
was too large and the correction is required, which is, again,
nothing more than a Vectorised big-integer add (one instruction).
+
+# 128-bit Scalar divisor
+
+As mentioned above, the first part of the Knuth Algorithm D involves
+computing an estimate for the divisor. This involves using the three
+most significant digits, performing a scalar divide, and consequently
+requires a scalar division with *twice* the number of bits of the
+size of individual digits (for example, a 64-bit array). In this
+example taken from
+[divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/divmnu64.c;hb=HEAD)
+the digits are 32 bit and therefore a 64/64 divide is sufficient to
+cover a 64/32 operation (64-bit dividend, 32-bit divisor):
+
+```
+ // Compute estimate qhat of q[j] from top 2 digits
+ uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1];
+ qhat = dig2 / vn[n - 1]; // 64/32 divide
+ rhat = dig2 % vn[n - 1]; // 64/32 modulo
+ again:
+ // use 3rd-from-top digit to obtain better accuracy
+ if (qhat >= b || qhat * vn[n - 2] > b * rhat + un[j + n - 2])
+ {
+ qhat = qhat - 1;
+ rhat = rhat + vn[n - 1];
+ if (rhat < b)
+ goto again;
+ }
+```
+
+However when moving to 64-bit digits (desirable because the algorithm
+is `O(N^2)`) this in turn means that the estimate has to be computed
+from a *128* bit dividend and a 64-bit divisor. This operation does
+not exist in most Scalar 64-bit ISAs, and some investigation into
+soft-implementations of 128/128 or 128/64 divide show it to be typically
+implemented bit-wise.
+
+The irony is, therefore, that attempting to
+improve big-integer divide by moving to 64-bit digits in order to take
+advantage of the efficiency of 64-bit scalar multiply would instead
+lock up CPU time performing a 128/64 division. With the Vector Multiply
+operations being critically dependent on that `qhat` estimate, this
+Dependency Hazard would have the SIMD Multiply back-ends sitting idle
+waiting for that one scalar value.
+
+Whilst one solution is to reduce the digit width to 32-bit in order to
+go back to 64/32 divide, this increases the completion time by a factor
+of 4 due to the algorithm being `O(N^2)`.
+
+*(TODO continue from here: reference investigation into using Goldschmidt
+division with fixed-width numbers, to get number of iterations down)*
+