Here, just as with `madded` which can put the hi-half of the 128 bit product
back in as a form of 64-bit carry, a scalar divisor of a vector dividend
puts the modulo back in as the hi-half of a 128/64-bit divide.
+ RT0 = (( 0<<64) | RA0) / RB0
+ RC0 = (( 0<<64) | RA0) % RB0
+ |
+ +-------+
+ |
+ RT1 = ((RC0<<64) | RA1) / RB1
+ RC1 = ((RC0<<64) | RA1) % RB1
+ |
+ +-------+
+ |
+ RT2 = ((RC1<<64) | RA2) / RB2
+ RC2 = ((RC1<<64) | RA2) % RB2
+
By a nice coincidence this is exactly the same 128/64-bit operation
needed for the `qhat` estimate if it may produce both the quotient and
the remainder.
+The pseudocode cleanly covering both scenarios (leaving out
+overflow for clarity) can be written as:
`divrem2du RT,RA,RB,RC`