rename madded->maddedu for consistency with PowerISA maddhdu instruction
[libreriscv.git] / openpower / sv / biginteger / discussion.mdwn
1 # big integer multiply
2
3 links
4
5 * <http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf>
6 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004700.html>
7 * <https://news.ycombinator.com/item?id=21151646>
8
9 ## Variant 1
10
11 Row-based multiply using temporary vector. Simple implementation
12 of Knuth M: <https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip/mulmnu.c;hb=HEAD>
13
14 ```
15 for (i = 0; i < m; i++) {
16 unsigned product = u[i]*v[j] + w[i + j];
17 phi[i] = product>>16;
18 plo[i] = product;
19 }
20 for (i = 0; i < m; i++) {
21 t = (phi[i]<<16) | plo[i] + k;
22 w[i + j] = t; // (I.e., t & 0xFFFF).
23 k = t >> 16;
24 }
25 ```
26
27 **maddx RT, RA, RB, RC** (RS=RT+VL for SVP64, RS=RT+1 for scalar)
28
29 prod[0:127] = (RA) * (RB)
30 sum[0:127] = EXTZ(RC) + prod
31 RT <- sum[64:127]
32 RS <- sum[0:63]
33
34 **addxd RT, RA, RB** (RS=RB+VL for SVP64, RS=RB+1 for scalar)
35
36 cat[0:127] = (RS) || (RB)
37 sum[0:127] = cat + EXTZ(RA)
38 RA = sum[0:63]
39 RT = sum[64:127]
40
41 These two combine as, simply:
42
43 # assume VL=8, therefore RS starts at r8.v
44 # q : r16
45 # multiplier : r17
46 # multiplicand: r20.v
47 # carry : r18
48 li r18, 0
49 sv.maddx r0.v, r16, r17, r20.v
50 # here, RS=RB+VL, therefore again RS starts at r8.v
51 sv.addxd r0.v, r18, r0.v
52
53 ## Variant 2
54
55 ```
56 for (i = 0; i < m; i++) {
57 unsigned product = u[i]*v[j] + k;
58 k = product>>16;
59 plo[i] = product; // & 0xffff
60 }
61 k = 0;
62 for (i = 0; i < m; i++) {
63 t = plo[i] + w[i + j] + k;
64 w[i + j] = t; // (I.e., t & 0xFFFF).
65 k = t >> 16; // carry: should only be 1 bit
66 }
67 ```
68
69 **maddx RT, RA, RB, RC**
70
71 prod[0:127] = (RA) * (RB)
72 sum[0:127] = EXTZ(RC) + prod
73 RT <- sum[64:127]
74 RC <- sum[0:63]
75
76 # big integer division
77
78 links
79
80 * <https://skanthak.homepage.t-online.de/division.html>
81 * <https://news.ycombinator.com/item?id=26562819>
82 * <https://gmplib.org/~tege/division-paper.pdf>
83 * <https://github.com/Richard-Mace/huge-integer-class/blob/master/HugeInt.cpp> nice-looking well-commented c++ implementation
84 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004739.html>
85 * <https://github.com/bcoin-org/libtorsion/blob/master/src/mpi.c#L2872>
86
87 the most efficient division algorithm is probably Knuth's Algorithm D (with modifications from the exercises section of his book) which is O(n^2) and uses 2N-by-N-bit div/rem
88
89 an oversimplified version of the knuth algorithm d with 32-bit words is:
90 (TODO find original: <https://raw.githubusercontent.com/hcs0/Hackers-Delight/master/divmnu64.c.txt>
91
92 ```
93 void div(uint32_t *n, uint32_t *d, uint32_t* q, int n_bytes, int d_bytes) {
94 // assumes d[0] != 0, also n, d, and q have their most-significant-word in index 0
95 int q_bytes = n_bytes - d_bytes;
96 for(int i = 0; i < q_bytes / sizeof(n[0]); i++) {
97 // calculate guess for quotient word
98 q[i] = (((uint64_t)n[i] << 32) + n[i + 1]) / d[0];
99 // n -= q[i] * d
100 uint32_t carry = 0, carry2 = 0;
101 for(int j = d_bytes / sizeof(d[0]) - 1; j >= 0; j--) {
102 uint64_t v = (uint64_t)q[i] * d[j] + carry;
103 carry = v >> 32;
104 v = (uint32_t)v;
105 v = n[i + j] - v + carry2;
106 carry2 = v >> 32; // either ~0 or 0
107 n[i + j] = v;
108 }
109 // fixup if carry2 != 0
110 }
111 // now remainder is in n and quotient is in q
112 }
113 ```
114
115 The key loop may be implemented with a 4-in, 2-out mul-twin-add
116 (which is too much):
117
118 ```
119 On Sat, Apr 16, 2022, 22:06 Jacob Lifshay <programmerjake@gmail.com> wrote:
120 and a mrsubcarry (the one actually needed by bigint division):
121
122 # for big_c - big_a * word_b
123 result <- RC + ~(RA * RB) + CARRY # wrong, needs further thought
124 CARRY <- HIGH_HALF(result)
125 RT <- LOW_HALF(result)
126
127 turns out, after some checking with 4-bit words, afaict the correct
128 algorithm for mrsubcarry is:
129
130 # for big_c - big_a * word_b
131 result <- RC + ~(RA * RB) + CARRY
132 result_high <- HIGH_HALF(result)
133 if CARRY <= 1 then # unsigned comparison
134 result_high <- result_high + 1
135 end
136 CARRY <- result_high
137 RT <- LOW_HALF(result)
138
139 afaict, that'll make the following algorithm work:
140
141 so the inner loop in the bigint division algorithm would end up being
142 (assuming n, d, and q all fit in registers):
143
144 li r3, 1 # carry in for subtraction
145 mtspr CARRY, r3 # init carry spr
146 setvl loop_count
147 sv.mrsubcarry rn.v, rd.v, rq.s, rn.v
148 ```
149
150 This algorithm may be morphed into a pair of Vector operations by temporary
151 storage of the products.
152
153 ```
154 uint32_t borrow = 0;
155 for(int i = 0; i <= n; i++) {
156 uint32_t vn_i = i < n ? vn[i] : 0;
157 uint64_t value = un[i + j] - (uint64_t)qhat * vn_i;
158 plo[i] = value & 0xffffffffLL;
159 phi[i] = value >> 32;
160 }
161 for(int i = 0; i <= n; i++) {
162 uint64_t value = (((uint64_t)phi[i]<<32) | plo[i]) - borrow;
163 borrow = ~(value >> 32)+1; // -(uint32_t)(value >> 32);
164 un[i + j] = (uint32_t)value;
165 }
166 bool need_fixup = borrow != 0;
167 ```
168
169
170 Transformation of 4-in, 2-out into a pair of operations:
171
172 * 3-in, 2-out `msubx RT, RA, RB, RC` producing {RT,RS} where RS=RT+VL
173 * 3-in, 2-out `subxd RT, RA, RB` a hidden RS=RT+VL as input, RA dual
174
175 <img src="/openpower/sv/weirdmuladd.jpg" width=800 />
176
177 A trick used in the DCT and FFT twin-butterfly instructions,
178 originally borrowed from `lq` and LD/ST-with-update, is to
179 have a second hidden (implicit) destination register, RS.
180 RS is calculated as RT+VL, where all scalar operations
181 assume VL=1. With `sv.msubx` *creating* a pair of Vector
182 results, `sv.weirdaddx` correspondingly has to pick the
183 pairs up, containing the split lo-hi 128-bit products,
184 in order to carry on the algorithm.
185
186 **msubx RT, RA, RB, RC** (RS=RT+VL for SVP64, RS=RT+1 for scalar)
187
188 prod[0:127] = (RA) * (RB)
189 sub[0:127] = EXTZ(RC) - prod
190 RT <- sub[64:127]
191 RS <- sub[0:63]
192
193 **subxd RT, RA, RB** (RS=RB+VL for SVP64, RS=RB+1 for scalar)
194
195 cat[0:127] = (RS) || (RB)
196 sum[0:127] = cat - EXTS(RA)
197 RA = ~sum[0:63] + 1
198 RT = sum[64:127]
199
200 These two combine as, simply:
201
202 # assume VL=8, therefore RS starts at r8.v
203 # q : r16
204 # dividend: r17
205 # divisor : r20.v
206 # carry : r18
207 li r18, 0
208 sv.msubx r0.v, r16, r17, r20.v
209 # here, RS=RB+VL, therefore again RS starts at r8.v
210 sv.subxd r0.v, r18, r0.v
211
212 As a result, a big-integer subtract and multiply may be carried out
213 in only 3 instructions, one of which is setting a scalar integer to
214 zero.
215
216 An Rc=1 variant tests not against RT but RA, which allows detection
217 of a fixup in Knuth Algorithm D: the condition where RA is not zero.
218