* <https://www.reddit.com/r/OpenPOWER/comments/u8r4vf/draft_svp64_biginteger_vector_arithmetic_for_the/>
* <https://bugs.libre-soc.org/show_bug.cgi?id=817>
-# Add and Subtract
+# Vector Add and Subtract
Surprisingly, no new additional instructions are required to perform
a straightforward big-integer add or subtract. Vectorised `adde`
permitted to exceed 1 (because MAXVL is set to 1) then the above
actually becomes a Scalar Big-Int add algorithm.
-# Multiply
+# Vector Multiply
Long-multiply, assuming an O(N^2) algorithm, is performed by summing
NxN separate smaller multiplications together. Karatsuba's algorithm
Also it is possible to specify that any of RA, RB or RC are scalar or
vector. Overall it is extremely powerful.
-# Divide
+# Vector Divide
The simplest implementation of big-int divide is the standard schoolbook
"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
was too large and the correction is required, which is, again,
nothing more than a Vectorised big-integer add (one instruction).
-# 128-bit Scalar divisor
+# Scalar 128-bit divisor
As mentioned above, the first part of the Knuth Algorithm D involves
computing an estimate for the divisor. This involves using the three
However when moving to 64-bit digits (desirable because the algorithm
is `O(N^2)`) this in turn means that the estimate has to be computed
-from a *128* bit dividend and a 64-bit divisor. This operation does
-not exist in most Scalar 64-bit ISAs, and some investigation into
+from a *128* bit dividend and a 64-bit divisor. Such an operation
+simply does not exist in most Scalar 64-bit ISAs. For Power ISA
+it would be necessary to implement Packed SIMD instructions
+and infrastructure in order to utilise `vdivuq` which is a 128/128
+(quad) divide, not a 128/64. Some investigation into
soft-implementations of 128/128 or 128/64 divide show it to be typically
-implemented bit-wise.
+implemented bit-wise, with all that implies.
The irony is, therefore, that attempting to
improve big-integer divide by moving to 64-bit digits in order to take
-advantage of the efficiency of 64-bit scalar multiply would instead
+advantage of the efficiency of 64-bit scalar multiply when Vectorised
+would instead
lock up CPU time performing a 128/64 division. With the Vector Multiply
operations being critically dependent on that `qhat` estimate, and
because that scalar is as an input into each of the vector digit
-multiples, as a Dependency Hazard it would have the Parallel SIMD Multiply
-back-ends sitting 100% idle, waiting for that one scalar value.
+multiples, as a Dependency Hazard it would cause *all* Parallel
+SIMD Multiply back-ends to sit 100% idle, waiting for that one scalar value.
Whilst one solution is to reduce the digit width to 32-bit in order to
go back to 64/32 divide, this increases the completion time by a factor