From cb775d1230fd46a1f8df256c91af448d7618bacd Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Fri, 29 Apr 2022 10:42:48 +0100 Subject: [PATCH] swap RC and RA in divrem2du (oh and do some whitespace, not supposed to combine these, whoops) --- openpower/sv/biginteger/analysis.mdwn | 31 ++++++++++++++++----------- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn index 7e0c72aa6..4cd820b53 100644 --- a/openpower/sv/biginteger/analysis.mdwn +++ b/openpower/sv/biginteger/analysis.mdwn @@ -247,7 +247,7 @@ Successive iterations thus effectively use RC as a 64-bit carry, and as noted by Intel in their notes on mulx, `RA*RB+RC+RD` cannot overflow, so does not require setting an additional CA flag. We first cover the chain of -RA*RB+RC as follows: +`RA*RB+RC` as follows: RT0, RC0 = RA0 * RB0 + 0 | @@ -267,9 +267,9 @@ which are scalar initialisation: li r16, 0 # zero accumulator addic r16, r16, 0 # CA to zero as well - sv.madde r0.v, r8.v, r17, r16 # mul vector + sv.madde r0.v, r8.v, r17, r16 # mul vector sv.adde r24.v, r24.v, r0.v # big-add row to result - + Normally, in a Scalar ISA, the use of a register as both a source and destination like this would create costly Dependency Hazards, so such an instruction would never be proposed. However: it turns out @@ -317,7 +317,7 @@ Algorithm D performs estimates which, if wrong, are compensated for afterwards. Essentially there are three phases: * Calculation of the quotient estimate. This uses a single - Scalar divide, which is covered separately in a later section + Scalar divide, which is covered separately in a later section * Big Integer multiply and subtract. * Carry-Correction with a big integer add, if the estimate from phase 1 was wrong by one digit. @@ -400,7 +400,7 @@ the digits are 32 bit and, special-casing the overflow, a 64/32 divide is suffic However when moving to 64-bit digits (desirable because the algorithm is `O(N^2)`) this in turn means that the estimate has to be computed -from a *128* bit dividend and a 64-bit divisor. Such an operation +from a *128* bit dividend and a 64-bit divisor. Such an operation simply does not exist in most Scalar 64-bit ISAs. Although Power ISA comes close with `divdeu`, by placing one operand in the upper half of a 128-bit dividend, the lower half is zero. Again Power ISA @@ -414,7 +414,7 @@ The irony is, therefore, that attempting to improve big-integer divide by moving to 64-bit digits in order to take advantage of the efficiency of 64-bit scalar multiply when Vectorised would instead -lock up CPU time performing a 128/64 scalar division. With the Vector +lock up CPU time performing a 128/64 scalar division. With the Vector Multiply operations being critically dependent on that `qhat` estimate, and because that scalar is as an input into each of the vector digit multiples, as a Dependency Hazard it would cause *all* Parallel @@ -464,8 +464,8 @@ the remainder. `divrem2du RT,RA,RB,RC` - dividend = (RC) || (RB) - divisor = EXTZ128(RA) + dividend = (RC) || (RA) + divisor = EXTZ128(RB) RT = UDIV(dividend, divisor) RS = UREM(dividend, divisor) @@ -485,8 +485,13 @@ in `divrem2du` a `cmpl` instruction can be used instead to detect the overflow. This saves having to add an Rc=1 or OE=1 mode when the available space in VA-Form EXT04 is extremely limited. -Looking closely at the loop however we can see that overflow -will not occur. The initial value k is zero, and on subsequent iterations -new k, being the modulo, is always less than the divisor. Thus the -condition (the loop invariant) `RC < RA` is preserved, as long as RC -starts at zero. +Looking closely at the loop however we can see that overflow +will not occur. The initial value k is zero: as long as a divide-by-zero +is not requested this always fulfils the condition `RC < RA`, and on +subsequent iterations the new k, being the modulo, is always less than the +divisor as well. Thus the condition (the loop invariant) `RC < RA` +is preserved, as long as RC starts at zero. + +# Conclusion + +TODO -- 2.30.2