From: lkcl Date: Fri, 22 Apr 2022 13:18:13 +0000 (+0100) Subject: (no commit message) X-Git-Tag: opf_rfc_ls005_v1~2616 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=528e9e960d835b8004040267d9dbe05e54621ad0;p=libreriscv.git --- diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn index af7cb91f0..55652bdbe 100644 --- a/openpower/sv/biginteger/analysis.mdwn +++ b/openpower/sv/biginteger/analysis.mdwn @@ -25,7 +25,7 @@ Links * * -# Add and Subtract +# Vector Add and Subtract Surprisingly, no new additional instructions are required to perform a straightforward big-integer add or subtract. Vectorised `adde` @@ -93,7 +93,7 @@ to people unfamiliar with Cray-style Vectors: if VL is not permitted to exceed 1 (because MAXVL is set to 1) then the above actually becomes a Scalar Big-Int add algorithm. -# Multiply +# Vector Multiply Long-multiply, assuming an O(N^2) algorithm, is performed by summing NxN separate smaller multiplications together. Karatsuba's algorithm @@ -244,7 +244,7 @@ would allow that same Vector of HI halves to not be an overwrite of RC. Also it is possible to specify that any of RA, RB or RC are scalar or vector. Overall it is extremely powerful. -# Divide +# Vector Divide The simplest implementation of big-int divide is the standard schoolbook "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's @@ -303,7 +303,7 @@ bit at the end. Logically: if borrow was required then the qhat estimate was too large and the correction is required, which is, again, nothing more than a Vectorised big-integer add (one instruction). -# 128-bit Scalar divisor +# Scalar 128-bit divisor As mentioned above, the first part of the Knuth Algorithm D involves computing an estimate for the divisor. This involves using the three @@ -332,19 +332,23 @@ cover a 64/32 operation (64-bit dividend, 32-bit divisor): However when moving to 64-bit digits (desirable because the algorithm is `O(N^2)`) this in turn means that the estimate has to be computed -from a *128* bit dividend and a 64-bit divisor. This operation does -not exist in most Scalar 64-bit ISAs, and some investigation into +from a *128* bit dividend and a 64-bit divisor. Such an operation +simply does not exist in most Scalar 64-bit ISAs. For Power ISA +it would be necessary to implement Packed SIMD instructions +and infrastructure in order to utilise `vdivuq` which is a 128/128 +(quad) divide, not a 128/64. Some investigation into soft-implementations of 128/128 or 128/64 divide show it to be typically -implemented bit-wise. +implemented bit-wise, with all that implies. The irony is, therefore, that attempting to improve big-integer divide by moving to 64-bit digits in order to take -advantage of the efficiency of 64-bit scalar multiply would instead +advantage of the efficiency of 64-bit scalar multiply when Vectorised +would instead lock up CPU time performing a 128/64 division. With the Vector Multiply operations being critically dependent on that `qhat` estimate, and because that scalar is as an input into each of the vector digit -multiples, as a Dependency Hazard it would have the Parallel SIMD Multiply -back-ends sitting 100% idle, waiting for that one scalar value. +multiples, as a Dependency Hazard it would cause *all* Parallel +SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value. Whilst one solution is to reduce the digit width to 32-bit in order to go back to 64/32 divide, this increases the completion time by a factor