```
uint32_t carry = 0;
- uint32_t product[n + 1];
// this becomes the basis for sv.msubed in RS=RC Mode,
// where carry is RC
- // VL = n + 1
- // sv.madded product.v, vn.v, qhat.s, carry.s
- for (int i = 0; i <= n; i++)
- {
- uint32_t vn_v = i < n ? vn[i] : 0;
- uint64_t value = (uint64_t)vn_v * (uint64_t)qhat + carry;
- carry = (uint32_t)(value >> 32);
- product[i] = (uint32_t)value;
+ for (int i = 0; i <= n; i++) {
+ uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry;
+ carry = (uint32_t)(value >> 32); // upper half for next loop
+ product[i] = (uint32_t)value; // lower into vector
}
bool ca = true;
- uint32_t *un_j = &un[j];
// this is simply sv.subfe where ca is XER.CA
- // sv.subfe un_j.v, product.v, un_j.v
- for (int i = 0; i <= n; i++)
- {
+ for (int i = 0; i <= n; i++) {
uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca;
ca = value >> 32 != 0;
un_j[i] = value;
}
- bool need_fixup = !ca;
+ bool need_fixup = !ca; // for phase 3 correction
```
In essence then the primary focus of Vectorised Big-Int divide is in
rhat = dig2 % vn[n - 1]; // 64/32 modulo
again:
// use 3rd-from-top digit to obtain better accuracy
- if (qhat >= b || qhat * vn[n - 2] > b * rhat + un[j + n - 2])
- {
+ if (qhat >= b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) {
qhat = qhat - 1;
rhat = rhat + vn[n - 1];
if (rhat < b)
improve big-integer divide by moving to 64-bit digits in order to take
advantage of the efficiency of 64-bit scalar multiply would instead
lock up CPU time performing a 128/64 division. With the Vector Multiply
-operations being critically dependent on that `qhat` estimate, this
-Dependency Hazard would have the SIMD Multiply back-ends sitting idle
-waiting for that one scalar value.
+operations being critically dependent on that `qhat` estimate, and
+because that scalar is as an input into each of the vector digit
+multiples, as a Dependency Hazard it would have the Parallel SIMD Multiply
+back-ends sitting 100% idle, waiting for that one scalar value.
Whilst one solution is to reduce the digit width to 32-bit in order to
go back to 64/32 divide, this increases the completion time by a factor
*(TODO continue from here: reference investigation into using Goldschmidt
division with fixed-width numbers, to get number of iterations down)*
+Scalar division is a known computer science problem because, as even the
+Big-Int Divide shows, it requires looping around a multiply (or, if reduced
+to 1-bit per loop, a simple compare, shift, and subtract). If the simplest
+approach were deployed then the completion time for the 128/64 scalar
+divide would be a whopping 128 cycles. To be workable an alternative
+algorithm is required, and one of the fastest appears to be
+Goldschmidt Division.