From aee759c303f10f0fda110b5aa0565e7574f84a71 Mon Sep 17 00:00:00 2001 From: lkcl Date: Fri, 22 Apr 2022 13:58:27 +0100 Subject: [PATCH] --- openpower/sv/biginteger/analysis.mdwn | 37 +++++++++++++-------------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/openpower/sv/biginteger/analysis.mdwn b/openpower/sv/biginteger/analysis.mdwn index 2ae5c149f..af7cb91f0 100644 --- a/openpower/sv/biginteger/analysis.mdwn +++ b/openpower/sv/biginteger/analysis.mdwn @@ -278,29 +278,21 @@ this time using subtract instead of add. ``` uint32_t carry = 0; - uint32_t product[n + 1]; // this becomes the basis for sv.msubed in RS=RC Mode, // where carry is RC - // VL = n + 1 - // sv.madded product.v, vn.v, qhat.s, carry.s - for (int i = 0; i <= n; i++) - { - uint32_t vn_v = i < n ? vn[i] : 0; - uint64_t value = (uint64_t)vn_v * (uint64_t)qhat + carry; - carry = (uint32_t)(value >> 32); - product[i] = (uint32_t)value; + for (int i = 0; i <= n; i++) { + uint64_t value = (uint64_t)vn[i] * (uint64_t)qhat + carry; + carry = (uint32_t)(value >> 32); // upper half for next loop + product[i] = (uint32_t)value; // lower into vector } bool ca = true; - uint32_t *un_j = &un[j]; // this is simply sv.subfe where ca is XER.CA - // sv.subfe un_j.v, product.v, un_j.v - for (int i = 0; i <= n; i++) - { + for (int i = 0; i <= n; i++) { uint64_t value = (uint64_t)~product[i] + (uint64_t)un_j[i] + ca; ca = value >> 32 != 0; un_j[i] = value; } - bool need_fixup = !ca; + bool need_fixup = !ca; // for phase 3 correction ``` In essence then the primary focus of Vectorised Big-Int divide is in @@ -330,8 +322,7 @@ cover a 64/32 operation (64-bit dividend, 32-bit divisor): rhat = dig2 % vn[n - 1]; // 64/32 modulo again: // use 3rd-from-top digit to obtain better accuracy - if (qhat >= b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) - { + if (qhat >= b || qhat * vn[n - 2] > b * rhat + un[j + n - 2]) { qhat = qhat - 1; rhat = rhat + vn[n - 1]; if (rhat < b) @@ -350,9 +341,10 @@ The irony is, therefore, that attempting to improve big-integer divide by moving to 64-bit digits in order to take advantage of the efficiency of 64-bit scalar multiply would instead lock up CPU time performing a 128/64 division. With the Vector Multiply -operations being critically dependent on that `qhat` estimate, this -Dependency Hazard would have the SIMD Multiply back-ends sitting idle -waiting for that one scalar value. +operations being critically dependent on that `qhat` estimate, and +because that scalar is as an input into each of the vector digit +multiples, as a Dependency Hazard it would have the Parallel SIMD Multiply +back-ends sitting 100% idle, waiting for that one scalar value. Whilst one solution is to reduce the digit width to 32-bit in order to go back to 64/32 divide, this increases the completion time by a factor @@ -361,3 +353,10 @@ of 4 due to the algorithm being `O(N^2)`. *(TODO continue from here: reference investigation into using Goldschmidt division with fixed-width numbers, to get number of iterations down)* +Scalar division is a known computer science problem because, as even the +Big-Int Divide shows, it requires looping around a multiply (or, if reduced +to 1-bit per loop, a simple compare, shift, and subtract). If the simplest +approach were deployed then the completion time for the 128/64 scalar +divide would be a whopping 128 cycles. To be workable an alternative +algorithm is required, and one of the fastest appears to be +Goldschmidt Division. -- 2.30.2