time-order as Newton-Raphson, using two hardware multipliers
and a subtract.
+There is however another reason for having a 128/64 division
+instruction, and it's effectively the reverse of `madded`.
+Look closely at Algorithm D when the divisor is only a scalar
+(`v[0]`):
+
```
k = 0; // the case of a
for (j = m - 1; j >= 0; j--)
{ // single-digit
- uint64_t dig2 = (k * b + u[j]);
+ uint64_t dig2 = ((k << 32) | u[j]);
q[j] = dig2 / v[0]; // divisor here.
- k = dig2 % v[0]; // modulo bak into next loop
+ k = dig2 % v[0]; // modulo back into next loop
}
```
+
+Here, just as with `madded` which can put the hi-half of the 128 bit product
+back in as a form of 64-bit carry, a scalar divisor of a vector dividend
+puts the modulo back in as the hi-half of a 128/64-bit divide.
+By a nice coincidence this is exactly the same 128/64-bit operation
+needed for the `qhat` estimate if it may produce both the quotient and
+the remainder.