Here, just as with `madded` which can put the hi-half of the 128 bit product
back in as a form of 64-bit carry, a scalar divisor of a vector dividend
puts the modulo back in as the hi-half of a 128/64-bit divide.
+
RT0 = (( 0<<64) | RA0) / RB0
RC0 = (( 0<<64) | RA0) % RB0
|
RC2 = ((RC1<<64) | RA2) % RB2
By a nice coincidence this is exactly the same 128/64-bit operation
-needed for the `qhat` estimate if it may produce both the quotient and
-the remainder.
+needed (once, rather than chained) for the `qhat` estimate if it may
+produce both the quotient and the remainder.
The pseudocode cleanly covering both scenarios (leaving out
overflow for clarity) can be written as: