as noted by Intel in their notes on mulx,
`RA*RB+RC+RD` cannot overflow, so does not require
setting an additional CA flag. We first cover the chain of
-RA*RB+RC as follows:
+`RA*RB+RC` as follows:
RT0, RC0 = RA0 * RB0 + 0
|
li r16, 0 # zero accumulator
addic r16, r16, 0 # CA to zero as well
- sv.madde r0.v, r8.v, r17, r16 # mul vector
+ sv.madde r0.v, r8.v, r17, r16 # mul vector
sv.adde r24.v, r24.v, r0.v # big-add row to result
-
+
Normally, in a Scalar ISA, the use of a register as both a source
and destination like this would create costly Dependency Hazards, so
such an instruction would never be proposed. However: it turns out
afterwards. Essentially there are three phases:
* Calculation of the quotient estimate. This uses a single
- Scalar divide, which is covered separately in a later section
+ Scalar divide, which is covered separately in a later section
* Big Integer multiply and subtract.
* Carry-Correction with a big integer add, if the estimate from
phase 1 was wrong by one digit.
However when moving to 64-bit digits (desirable because the algorithm
is `O(N^2)`) this in turn means that the estimate has to be computed
-from a *128* bit dividend and a 64-bit divisor. Such an operation
+from a *128* bit dividend and a 64-bit divisor. Such an operation
simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
comes close with `divdeu`, by placing one operand in the upper half
of a 128-bit dividend, the lower half is zero. Again Power ISA
improve big-integer divide by moving to 64-bit digits in order to take
advantage of the efficiency of 64-bit scalar multiply when Vectorised
would instead
-lock up CPU time performing a 128/64 scalar division. With the Vector
+lock up CPU time performing a 128/64 scalar division. With the Vector
Multiply operations being critically dependent on that `qhat` estimate, and
because that scalar is as an input into each of the vector digit
multiples, as a Dependency Hazard it would cause *all* Parallel
`divrem2du RT,RA,RB,RC`
- dividend = (RC) || (RB)
- divisor = EXTZ128(RA)
+ dividend = (RC) || (RA)
+ divisor = EXTZ128(RB)
RT = UDIV(dividend, divisor)
RS = UREM(dividend, divisor)
the overflow. This saves having to add an Rc=1 or OE=1 mode when
the available space in VA-Form EXT04 is extremely limited.
-Looking closely at the loop however we can see that overflow
-will not occur. The initial value k is zero, and on subsequent iterations
-new k, being the modulo, is always less than the divisor. Thus the
-condition (the loop invariant) `RC < RA` is preserved, as long as RC
-starts at zero.
+Looking closely at the loop however we can see that overflow
+will not occur. The initial value k is zero: as long as a divide-by-zero
+is not requested this always fulfils the condition `RC < RA`, and on
+subsequent iterations the new k, being the modulo, is always less than the
+divisor as well. Thus the condition (the loop invariant) `RC < RA`
+is preserved, as long as RC starts at zero.
+
+# Conclusion
+
+TODO