From: lkcl Date: Fri, 22 Apr 2022 09:56:32 +0000 (+0100) Subject: (no commit message) X-Git-Tag: opf_rfc_ls005_v1~2619 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=20564d41020fa047ece83faec895775160878a13;p=libreriscv.git --- diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn index 640aed9d8..350fa1623 100644 --- a/openpower/sv/biginteger/analysis.mdwn +++ b/openpower/sv/biginteger/analysis.mdwn @@ -251,8 +251,7 @@ Algorithm D performs estimates which, if wrong, are compensated for afterwards. Essentially there are three phases: * Calculation of the quotient estimate. This uses a single - Scalar divide, - which can be ignored for the scope of this analysis + Scalar divide, which is covered separately in a later section * Big Integer multiply and subtract. * Carry-Correction with a big integer add, if the estimate from phase 1 was wrong by one digit. @@ -310,3 +309,54 @@ Detection of the fixup (phase 3) is determined by the Carry (borrow) bit at the end. Logically: if borrow was required then the qhat estimate was too large and the correction is required, which is, again, nothing more than a Vectorised big-integer add (one instruction). + +# 128-bit Scalar divisor + +As mentioned above, the first part of the Knuth Algorithm D involves +computing an estimate for the divisor. This involves using the three +most significant digits, performing a scalar divide, and consequently +requires a scalar division with *twice* the number of bits of the +size of individual digits (for example, a 64-bit array). In this +example taken from +[divmnu64.c](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/divmnu64.c;hb=HEAD) +the digits are 32 bit and therefore a 64/64 divide is sufficient to +cover a 64/32 operation (64-bit dividend, 32-bit divisor): + +``` + // Compute estimate qhat of q[j] from top 2 digits + uint64_t dig2 = ((uint64_t)un[j + n] << 32) | un[j + n - 1]; + qhat = dig2 / vn[n - 1]; // 64/32 divide + rhat = dig2 % vn[n - 1]; // 64/32 modulo + again: + // use 3rd-from-top digit to obtain better accuracy + if (qhat >= b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) + { + qhat = qhat - 1; + rhat = rhat + vn[n - 1]; + if (rhat < b) + goto again; + } +``` + +However when moving to 64-bit digits (desirable because the algorithm +is `O(N^2)`) this in turn means that the estimate has to be computed +from a *128* bit dividend and a 64-bit divisor. This operation does +not exist in most Scalar 64-bit ISAs, and some investigation into +soft-implementations of 128/128 or 128/64 divide show it to be typically +implemented bit-wise. + +The irony is, therefore, that attempting to +improve big-integer divide by moving to 64-bit digits in order to take +advantage of the efficiency of 64-bit scalar multiply would instead +lock up CPU time performing a 128/64 division. With the Vector Multiply +operations being critically dependent on that `qhat` estimate, this +Dependency Hazard would have the SIMD Multiply back-ends sitting idle +waiting for that one scalar value. + +Whilst one solution is to reduce the digit width to 32-bit in order to +go back to 64/32 divide, this increases the completion time by a factor +of 4 due to the algorithm being `O(N^2)`. + +*(TODO continue from here: reference investigation into using Goldschmidt +division with fixed-width numbers, to get number of iterations down)* +