openpower/sv/biginteger/discussion.mdwn

   1 # big integer multiply
   2
   3 links
   4
   5 * <http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf>
   6 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004700.html>
   7 * <https://news.ycombinator.com/item?id=21151646>
   8
   9 ## Variant 1
  10
  11 Row-based multiply using temporary vector. Simple implementation
  12 of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
  13
  14 ```
  15       for (i = 0; i < m; i++) {
  16          unsigned product = u[i]*v[j] + w[i + j];
  17          phi[i] = product>>16;
  18          plo[i] = product;
  19       }
  20       for (i = 0; i < m; i++) {
  21          t = (phi[i]<<16) | plo[i] + k;
  22          w[i + j] = t;          // (I.e., t & 0xFFFF).
  23          k = t >> 16;
  24       }
  25 ```
  26
  27 **maddx RT, RA, RB, RC** (RS=RT+VL for SVP64, RS=RT+1 for scalar)
  28
  29     prod[0:127] = (RA) * (RB)
  30     sum[0:127] = EXTZ(RC) + prod
  31     RT <- sum[64:127]
  32     RS <- sum[0:63]
  33
  34 **addxd RT, RA, RB** (RS=RB+VL for SVP64, RS=RB+1 for scalar)
  35
  36     cat[0:127] = (RS) || (RB)
  37     sum[0:127] = cat + EXTZ(RA)
  38     RA = sum[0:63]
  39     RT = sum[64:127]
  40
  41 These two combine as, simply:
  42
  43     # assume VL=8, therefore RS starts at r8.v
  44     # q           : r16
  45     # multiplier  : r17
  46     # multiplicand: r20.v
  47     # carry       : r18
  48     li r18, 0
  49     sv.maddx r0.v, r16, r17, r20.v
  50     # here, RS=RB+VL, therefore again RS starts at r8.v
  51     sv.addxd r0.v, r18, r0.v
  52
  53 ## Variant 2
  54
  55 ```
  56       for (i = 0; i < m; i++) {
  57          unsigned product = u[i]*v[j] + k;
  58          k = product>>16;
  59          plo[i] = product; // & 0xffff
  60       }
  61       k = 0;
  62       for (i = 0; i < m; i++) {
  63          t = plo[i] + w[i + j] + k;
  64          w[i + j] = t;          // (I.e., t & 0xFFFF).
  65          k = t >> 16; // carry: should only be 1 bit
  66       }
  67 ```
  68
  69 **maddx RT, RA, RB, RC**
  70
  71     prod[0:127] = (RA) * (RB)
  72     sum[0:127] = EXTZ(RC) + prod
  73     RT <- sum[64:127]
  74     RC <- sum[0:63]
  75
  76 # big integer division
  77
  78 links
  79
  80 * <https://skanthak.homepage.t-online.de/division.html>
  81 * <https://news.ycombinator.com/item?id=26562819>
  82 * <https://gmplib.org/~tege/division-paper.pdf>
  83 * <https://github.com/Richard-Mace/huge-integer-class/blob/master/HugeInt.cpp> nice-looking well-commented c++ implementation
  84 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004739.html>
  85 * <https://github.com/bcoin-org/libtorsion/blob/master/src/mpi.c#L2872>
  86
  87 the most efficient division algorithm is probably Knuth's Algorithm D (with modifications from the exercises section of his book) which is O(n^2) and uses 2N-by-N-bit div/rem
  88
  89 an oversimplified version of the knuth algorithm d with 32-bit words is:
  90 (TODO find original: <https://raw.githubusercontent.com/hcs0/Hackers-Delight/master/divmnu64.c.txt>
  91
  92 ```
  93 void div(uint32_t *n, uint32_t *d, uint32_t* q, int n_bytes, int d_bytes) {
  94     // assumes d[0] != 0, also n, d, and q have their most-significant-word in index 0
  95     int q_bytes = n_bytes - d_bytes;
  96     for(int i = 0; i < q_bytes / sizeof(n[0]); i++) {
  97         // calculate guess for quotient word
  98         q[i] = (((uint64_t)n[i] << 32) + n[i + 1]) / d[0];
  99         // n -= q[i] * d
 100         uint32_t carry = 0, carry2 = 0;
 101         for(int j = d_bytes / sizeof(d[0]) - 1; j >= 0; j--) {
 102             uint64_t v = (uint64_t)q[i] * d[j] + carry;
 103             carry = v >> 32;
 104             v = (uint32_t)v;
 105             v = n[i + j] - v + carry2;
 106             carry2 = v >> 32; // either ~0 or 0
 107             n[i + j] = v;
 108         }
 109         // fixup if carry2 != 0
 110     }
 111     // now remainder is in n and quotient is in q
 112 }
 113 ```
 114
 115 The key loop may be implemented with a 4-in, 2-out mul-twin-add
 116 (which is too much):
 117
 118 ```
 119 On Sat, Apr 16, 2022, 22:06 Jacob Lifshay <programmerjake@gmail.com> wrote:
 120 and a mrsubcarry (the one actually needed by bigint division):
 121
 122     # for big_c - big_a * word_b
 123     result <- RC + ~(RA * RB) + CARRY # wrong, needs further thought
 124     CARRY <- HIGH_HALF(result)
 125     RT <- LOW_HALF(result)
 126
 127 turns out, after some checking with 4-bit words, afaict the correct
 128 algorithm for mrsubcarry is:
 129
 130     # for big_c - big_a * word_b
 131     result <- RC + ~(RA * RB) + CARRY
 132     result_high <- HIGH_HALF(result)
 133     if CARRY <= 1 then # unsigned comparison
 134         result_high <- result_high + 1
 135     end
 136     CARRY <- result_high
 137     RT <- LOW_HALF(result)
 138
 139 afaict, that'll make the following algorithm work:
 140
 141 so the inner loop in the bigint division algorithm would end up being
 142 (assuming n, d, and q all fit in registers):
 143
 144     li r3, 1 # carry in for subtraction
 145     mtspr CARRY, r3 # init carry spr
 146     setvl loop_count
 147     sv.mrsubcarry rn.v, rd.v, rq.s, rn.v
 148 ```
 149
 150 This algorithm may be morphed into a pair of Vector operations by temporary
 151 storage of the products.
 152
 153 ```
 154       uint32_t borrow = 0;
 155       for(int i = 0; i <= n; i++) {
 156          uint32_t vn_i = i < n ? vn[i] : 0;
 157          uint64_t value = un[i + j] - (uint64_t)qhat * vn_i;
 158          plo[i] = value & 0xffffffffLL;
 159          phi[i] = value >> 32;
 160       }
 161       for(int i = 0; i <= n; i++) {
 162          uint64_t value = (((uint64_t)phi[i]<<32) | plo[i]) - borrow;
 163          borrow = ~(value >> 32)+1; // -(uint32_t)(value >> 32);
 164          un[i + j] = (uint32_t)value;
 165       }
 166       bool need_fixup = borrow != 0;
 167 ```
 168
 169
 170 Transformation of 4-in, 2-out into a pair of operations:
 171
 172 * 3-in, 2-out `msubx RT, RA, RB, RC` producing {RT,RS} where RS=RT+VL
 173 * 3-in, 2-out `subxd RT, RA, RB` a hidden RS=RT+VL as input, RA dual
 174
 175 <img src="/openpower/sv/weirdmuladd.jpg" width=800 />
 176
 177 A trick used in the DCT and FFT twin-butterfly instructions,
 178 originally borrowed from `lq` and LD/ST-with-update, is to
 179 have a second hidden (implicit) destination register, RS.
 180 RS is calculated as RT+VL, where all scalar operations
 181 assume VL=1.  With `sv.msubx` *creating* a pair of Vector
 182 results, `sv.weirdaddx` correspondingly has to pick the
 183 pairs up, containing the split lo-hi 128-bit products,
 184 in order to carry on the algorithm.
 185
 186 **msubx RT, RA, RB, RC** (RS=RT+VL for SVP64, RS=RT+1 for scalar)
 187
 188     prod[0:127] = (RA) * (RB)
 189     sub[0:127] = EXTZ(RC) - prod
 190     RT <- sub[64:127]
 191     RS <- sub[0:63]
 192
 193 **subxd RT, RA, RB** (RS=RB+VL for SVP64, RS=RB+1 for scalar)
 194
 195     cat[0:127] = (RS) || (RB)
 196     sum[0:127] = cat - EXTS(RA)
 197     RA = ~sum[0:63] + 1
 198     RT = sum[64:127]
 199
 200 These two combine as, simply:
 201
 202     # assume VL=8, therefore RS starts at r8.v
 203     # q       : r16
 204     # dividend: r17
 205     # divisor : r20.v
 206     # carry   : r18
 207     li r18, 0
 208     sv.msubx r0.v, r16, r17, r20.v
 209     # here, RS=RB+VL, therefore again RS starts at r8.v
 210     sv.subxd r0.v, r18, r0.v
 211
 212 As a result, a big-integer subtract and multiply may be carried out
 213 in only 3 instructions, one of which is setting a scalar integer to
 214 zero.
 215
 216 An Rc=1 variant tests not against RT but RA, which allows detection
 217 of a fixup in Knuth Algorithm D: the condition where RA is not zero.
 218