From b3bb943b370bdf60cc9416858aa9047d0a9a4cd5 Mon Sep 17 00:00:00 2001 From: lkcl Date: Mon, 25 Apr 2022 16:15:44 +0100 Subject: [PATCH] --- openpower/sv/biginteger/analysis.mdwn | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn index e5723d02a..8e7d6c0ed 100644 --- a/openpower/sv/biginteger/analysis.mdwn +++ b/openpower/sv/biginteger/analysis.mdwn @@ -303,8 +303,9 @@ Detection of the fixup (phase 3) is determined by the Carry (borrow) bit at the end. Logically: if borrow was required then the qhat estimate was too large and the correction is required, which is, again, nothing more than a Vectorised big-integer add (one instruction). +However this is not the full story -# Scalar 128-bit divisor +**128/64-bit divisor** As mentioned above, the first part of the Knuth Algorithm D involves computing an estimate for the divisor. This involves using the three @@ -345,8 +346,8 @@ The irony is, therefore, that attempting to improve big-integer divide by moving to 64-bit digits in order to take advantage of the efficiency of 64-bit scalar multiply when Vectorised would instead -lock up CPU time performing a 128/64 division. With the Vector Multiply -operations being critically dependent on that `qhat` estimate, and +lock up CPU time performing a 128/64 scalar division. With the Vector +Multiply operations being critically dependent on that `qhat` estimate, and because that scalar is as an input into each of the vector digit multiples, as a Dependency Hazard it would cause *all* Parallel SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value. @@ -355,8 +356,7 @@ Whilst one solution is to reduce the digit width to 32-bit in order to go back to 64/32 divide, this increases the completion time by a factor of 4 due to the algorithm being `O(N^2)`. -*(TODO continue from here: reference investigation into using Goldschmidt -division with fixed-width numbers, to get number of iterations down)* +**Reducing completion time of 128/64-bit Scalar division** Scalar division is a known computer science problem because, as even the Big-Int Divide shows, it requires looping around a multiply (or, if reduced @@ -370,6 +370,8 @@ In this way a Scalar Integer divide can be performed in the same time-order as Newton-Raphson, using two hardware multipliers and a subtract. +**Back to Vector carry-looping** + There is however another reason for having a 128/64 division instruction, and it's effectively the reverse of `madded`. Look closely at Algorithm D when the divisor is only a scalar -- 2.30.2