From cb775d1230fd46a1f8df256c91af448d7618bacd Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Fri, 29 Apr 2022 10:42:48 +0100
Subject: [PATCH] swap RC and RA in divrem2du (oh and do some whitespace, not
 supposed to combine these, whoops)

---
 openpower/sv/biginteger/analysis.mdwn | 31 ++++++++++++++++-----------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn
index 7e0c72aa6..4cd820b53 100644
--- a/openpower/sv/biginteger/analysis.mdwn
+++ b/openpower/sv/biginteger/analysis.mdwn
@@ -247,7 +247,7 @@ Successive iterations thus effectively use RC as a 64-bit carry, and
 as noted by Intel in their notes on mulx,
 `RA*RB+RC+RD` cannot overflow, so does not require
 setting an additional CA flag. We first cover the chain of
-RA*RB+RC as follows:
+`RA*RB+RC` as follows:
 
     RT0, RC0 = RA0 * RB0 + 0
           |
@@ -267,9 +267,9 @@ which are scalar initialisation:
 
     li r16, 0                     # zero accumulator
     addic r16, r16, 0             # CA to zero as well
-    sv.madde r0.v, r8.v, r17, r16 # mul vector 
+    sv.madde r0.v, r8.v, r17, r16 # mul vector
     sv.adde r24.v, r24.v, r0.v   # big-add row to result
-    
+
 Normally, in a Scalar ISA, the use of a register as both a source
 and destination like this would create costly Dependency Hazards, so
 such an instruction would never be proposed.  However: it turns out
@@ -317,7 +317,7 @@ Algorithm D performs estimates which, if wrong, are compensated for
 afterwards.  Essentially there are three phases:
 
 * Calculation of the quotient estimate. This uses a single
-  Scalar divide, which is covered separately in a later section 
+  Scalar divide, which is covered separately in a later section
 * Big Integer multiply and subtract.
 * Carry-Correction with a big integer add, if the estimate from
   phase 1 was wrong by one digit.
@@ -400,7 +400,7 @@ the digits are 32 bit and, special-casing the overflow, a 64/32 divide is suffic
 
 However when moving to 64-bit digits (desirable because the algorithm
 is `O(N^2)`) this in turn means that the estimate has to be computed
-from a *128* bit dividend and a 64-bit divisor.  Such an operation 
+from a *128* bit dividend and a 64-bit divisor.  Such an operation
 simply does not exist in most Scalar 64-bit ISAs. Although Power ISA
 comes close with `divdeu`, by placing one operand in the upper half
 of a 128-bit dividend, the lower half is zero.  Again Power ISA
@@ -414,7 +414,7 @@ The irony is, therefore, that attempting to
 improve big-integer divide by moving to 64-bit digits in order to take
 advantage of the efficiency of 64-bit scalar multiply when Vectorised
 would instead
-lock up CPU time performing a 128/64 scalar division.  With the Vector 
+lock up CPU time performing a 128/64 scalar division.  With the Vector
 Multiply operations being critically dependent on that `qhat` estimate, and
 because that scalar is as an input into each of the vector digit
 multiples, as a Dependency Hazard it would cause *all* Parallel
@@ -464,8 +464,8 @@ the remainder.
 
 `divrem2du RT,RA,RB,RC`
 
-     dividend = (RC) || (RB)
-     divisor = EXTZ128(RA) 
+     dividend = (RC) || (RA)
+     divisor = EXTZ128(RB)
      RT = UDIV(dividend, divisor)
      RS = UREM(dividend, divisor)
 
@@ -485,8 +485,13 @@ in  `divrem2du`  a  `cmpl` instruction can be used instead to detect
 the overflow. This saves having to add an Rc=1 or OE=1 mode when
 the available space in VA-Form EXT04 is extremely limited.
 
-Looking closely at the loop however we can see that overflow 
-will not occur. The initial value k is zero, and on subsequent iterations
-new k, being the modulo, is always less than the divisor. Thus the
-condition (the loop invariant) `RC < RA` is preserved, as long as RC
-starts at zero.
+Looking closely at the loop however we can see that overflow
+will not occur. The initial value k is zero: as long as a divide-by-zero
+is not requested this always fulfils the condition `RC < RA`, and on
+subsequent iterations the new k, being the modulo, is always less than the
+divisor as well. Thus the condition (the loop invariant) `RC < RA`
+is preserved, as long as RC starts at zero.
+
+# Conclusion
+
+TODO
-- 
2.30.2