1 # RFC ls003 Big Integer
5 * <https://libre-soc.org/openpower/sv/biginteger/analysis/>
6 * <https://libre-soc.org/openpower/sv/rfc/ls003/>
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=960>
8 * <https://git.openpower.foundation/isa/PowerISA/issues/91>
20 **Books and Section affected**: **UPDATE**
23 Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1
24 Appendix E Power ISA sorted by opcode
25 Appendix F Power ISA sorted by version
26 Appendix G Power ISA sorted by Compliancy Subset
27 Appendix H Power ISA sorted by mnemonic
35 maddedu - Multiply-Add Extended Double Unsigned
36 maddedus - Multiply-Add Extended Double Unsigned/Signed
37 divmod2du - Divide/Modulo Quad-Double Unsigned
38 dsld - Double Shift Left Doubleword
39 dsrd - Double Shift Right Doubleword
42 **Submitter**: Luke Leighton (Libre-SOC)
44 **Requester**: Libre-SOC
46 **Impact on processor**:
49 Addition of five new GPR-based instructions
52 **Impact on software**:
55 Requires support for new instructions in assembler, debuggers,
62 GPR, Big-integer, Double-word
67 * Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling
68 accumulation affect: `RC` effectively becomes a 64-bit carry in chains
69 of highly-efficient loop-unrolled arbitrary-length big-integer operations.
70 * Similar to `divdeu`, and has similar advantages to `maddedu`,
71 Modulo result is available with the quotient in a single instruction
72 allowing highly-efficient arbitrary-length big-integer division.
73 * Combining at least three instructions into one, the `dsld` and `dsrd`
74 instructions make shifting an arbitrary-length big-integer vector by
75 a scalar 64-bit quantity highly efficient.
77 **Notes and Observations**:
79 1. It is not practical to add Rc=1 variants when VA-Form is used and
80 there is a **pair** of results produced.
81 2. An overflow variant (XER.OV set) of `divmod2du` would be valuable
82 but VA-Form EXT004 is under severe pressure.
83 3. Both `maddhdu` and `divmod2du` instructions have been present in Intel x86
84 for several decades. Likewise, `dsld` and `dsrd`.
85 4. None of these instruction is present in VSX: these are 128/64 whereas
87 5. `maddedu` and `divmod2du` are full inverses of each other, including
88 when used for arbitrary-length big-integer arithmetic.
89 6. These are all 3-in 2-out instructions. If Power ISA did not already
90 have LD/ST-with-update instructions and instructions with `RAp`
91 and `RTp` then these instructions would not be proposed.
92 7. `maddedus` is the first Scalar signed/unsigned multiply instruction. The
93 only other signed/unsigned multiply instruction is the
94 specialist `vmsummbm` (bytes only), requires VSX,
95 and is unsuited for big-integer or other general arithmetic.
96 8. Unresolved: dsld/dsrd are 3-in 3-out (in the Rc=1 variants) where the
97 normal threshold set is 3-in 2-out.
101 Add the following entries to:
103 * the Appendices of Book I
104 * Instructions of Book I added to Section 3.3.9.1
105 * VA2-Form of Book I Section 1.6.21.1 and 1.6.2
111 # Multiply-Add Extended Double Unsigned
113 `maddedu RT, RA, RB, RC`
115 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
116 |-------|------|-------|-------|-------|-------|---------|
117 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
122 prod[0:127] <- (RA) * (RB) # Multiply RA and RB, result 128-bit
123 sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product
124 RT <- sum[64:127] # Store low half in RT
125 RS <- sum[0:63] # RS implicit register, equal to RC
128 Special registers altered:
132 The 64-bit operands are (RA), (RB), and (RC).
133 RC is zero-extended (not shifted, not sign-extended).
134 The 128-bit product of the operands (RA) and (RB) is added to (RC).
135 The low-order 64 bits of the 128-bit sum are
136 placed into register RT.
137 The high-order 64 bits of the 128-bit sum are
138 placed into register RS.
139 RS is implicitly defined as the same register as RC.
141 All three operands and the result are interpreted as
144 The differences here to `maddhdu` are that `maddhdu` stores the upper
145 half in RT, where `maddedu` stores the upper half in RS.
147 The value stored in RT is exactly equivalent to `maddld` despite `maddld`
148 performing sign-extension on RC, because RT is the full mathematical result
149 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
150 results modulo 2^64. This is why there is no maddldu instruction.
153 To achieve a big-integer rolling-accumulation effect:
154 assuming the scalar to multiply is in r0, and r3 is
155 used (effectively) as a 64-bit carry,
156 the vector to multiply by starts at r4 and the result vector
157 in r20, instructions may be issued `maddedu r20,r4,r0,r3`
158 `maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have
159 stored the upper half of the 128-bit multiply into r3, such
160 that it may be picked up by the second `maddedu`. Repeat inline
161 to construct a larger bigint scalar-vector multiply,
162 as Scalar GPR register file space permits. If register
163 spill is required then r3, as the effective 64-bit carry,
164 continues the chain.*
169 # (r0 * r1) + r2, store lower in r4, upper in r2
170 maddedu r4, r0, r1, r2
172 # Chaining together for larger bigint (see Programmer's Note above)
173 # r3 starts with zero (no carry-in)
183 # Multiply-Add Extended Double Unsigned/Signed
185 `maddedus RT, RA, RB, RC`
187 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
188 |-------|------|-------|-------|-------|-------|---------|
189 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
194 if (RB)[0] != 0 then # workaround no unsigned-signed mul op
195 prod[0:127] <- -((RA) * -(RB))
197 prod[0:127] <- (RA) * (RB)
198 sum[0:127] <- prod + EXTS128((RC))
199 RT <- sum[64:127] # Store low half in RT
200 RS <- sum[0:63] # RS implicit register, equal to RC
203 Special registers altered:
207 The 64-bit operands are (RA), (RB), and (RC).
208 (RC) is sign-extended to 128-bits and then summed with the
209 128-bit product of zero-extended (RA) and sign-extended (RB).
210 The low-order 64 bits of the 128-bit sum are
211 placed into register RT.
212 The high-order 64 bits of the 128-bit sum are
213 placed into register RS.
214 RS is implicitly defined as the same register as RC.
217 To achieve a big-integer rolling-accumulation effect:
218 assuming the signed scalar to multiply is in r0, and r3 is
219 used (effectively) as a 64-bit carry,
220 the unsigned vector to multiply by starts at r4 and the signed result vector
221 in r20, instructions may be issued `maddedus r20,r4,r0,r3`
222 `maddedus r21,r5,r0,r3` etc. where the first `maddedus` will have
223 stored the upper half of the 128-bit multiply into r3, such
224 that it may be picked up by the second `maddedus`. Repeat inline
225 to construct a larger bigint scalar-vector multiply,
226 as Scalar GPR register file space permits. If register
227 spill is required then r3, as the effective 64-bit carry,
228 continues the chain.*
233 # (r0 * r1) + r2, store lower in r4, upper in r2
234 maddedus r4, r0, r1, r2
236 # Chaining together for larger bigint (see Programmer's Note above)
237 # r3 starts with zero (no carry-in)
238 maddedus r20,r4,r0,r3
239 maddedus r21,r5,r0,r3
240 maddedus r22,r6,r0,r3
247 # Divide/Modulo Quad-Double Unsigned
249 `divmod2du RT,RA,RB,RC`
251 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
252 |-------|------|-------|-------|-------|-------|---------|
253 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
258 if ((RA) <u (RB)) & ((RB) != [0]*64) then # Check RA<RB, for divide-by-0
259 dividend[0:127] <- (RA) || (RC) # Combine RA/RC as 128-bit
260 divisor[0:127] <- [0]*64 || (RB) # Extend RB to 128-bit
261 result <- dividend / divisor # Unsigned Division
262 modulo <- dividend % divisor # Unsigned Modulo
263 RT <- result[64:127] # Store result in RT
264 RS <- modulo[64:127] # Modulo in RC, implicit
265 else # In case of error
266 RT <- [1]*64 # RT all 1's
267 RS <- [0]*64 # RS all 0's
270 Special registers altered:
274 The 128-bit dividend is (RA) || (RC). The 64-bit divisor is
275 (RB). If the quotient can be represented in 64 bits, it is
276 placed into register RT. The modulo is placed into register RS.
277 RS is implicitly defined as the same register as RC, similarly to maddedu.
279 The quotient can be represented in 64-bits when both these conditions
282 * (RA) < (RB) (unsigned comparison)
283 * (RB) is NOT 0 (not divide-by-0)
285 If these conditions are not met, RT is set to all 1's, RS to all 0's.
287 All operands, quotient, and modulo are interpreted as unsigned integers.
289 Divide/Modulo Quad-Double Unsigned is a VA-Form instruction
290 that is near-identical to `divdeu` except that:
292 * the lower 64 bits of the dividend, instead of being zero, contain a
294 * it performs a fused divide and modulo in a single instruction, storing
295 the modulo in an implicit RS (similar to `maddedu`)
296 * There is no `UNDEFINED` behaviour.
298 RB, the divisor, remains 64 bit. The instruction is therefore a 128/64
299 division, producing a (pair) of 64 bit result(s), in the same way that
300 Intel [divq](https://www.felixcloutier.com/x86/div) works.
302 are detected in exactly the same fashion as `divdeu`, except that rather
303 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
306 *Programmer's note: there are no Rc variants of any of these VA-Form
307 instructions. `cmpi` will need to be used to detect overflow conditions:
308 the saving in instruction count is that both RT and RS will have already
309 been set to useful values (all 1s and all zeros respectively)
310 needed as part of implementing Knuth's Algorithm D*
312 For Scalar usage, just as for `maddedu`, `RS=RC`
316 # ((r0 << 64) + r2) / r1, store in r4
317 # ((r0 << 64) + r2) % r1, store in r2
318 divmod2du r4, r0, r1, r2
325 # Double-Shift Left Doubleword
329 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form |
330 |-------|------|-------|-------|-------|-------|----|----------|
331 | EXT04 | RT | RA | RB | RC | XO | Rc | VA2-Form |
335 n <- (RB)[58:63] # Use lower 6-bits for shift
336 v <- ROTL64((RA), n) # Rotate RA 64-bit left by n bits
337 mask <- MASK(64, 63-n) # 1s mask in MSBs
338 RT <- (v[0:63] & mask) | ((RC) & ¬mask) # mask-in RC into RT
339 RS <- v[0:63] & ¬mask # part normally lost into RC
340 overflow = 0 # Clear overflow flag
341 if RS != [0]*64: # Check if RS is NOT zero
342 overflow = 1 # Set the overflow flag
344 Special Registers Altered:
348 The contents of register RA are shifted left the number
349 of bits specified by (RB) 58:63. The same number of
350 shifted bits are taken from the **right** (LSB) end of register
351 RC and placed into the **rightmost** (LSB) end of the result, RT.
352 Additionally, the MSB (leftmost) bits of register RA that would normally
353 be discarded by a 64-bit left shift are placed into the
356 When Rc=1, the overflow flag in CR0 is set if RS is nonzero,
357 or cleared if it is zero; all other bits of CR0 are set from RT as normal.
358 XER.OV and XER.SO remain unchanged.
361 similar to maddedu and divmod2du, dsld can be chained (using RC),
362 effectively using RC as a 64-bit carry-in and carry-out. Arbitrary
363 length Scalar-Vector shift may be performed without the additional
364 masking instructions normally needed.*
368 # Double-Shift Right Doubleword
372 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form |
373 |-------|------|-------|-------|-------|-------|----|----------|
374 | EXT04 | RT | RA | RB | RC | XO | Rc | VA2-Form |
378 n <- (RB)[58:63] # Take lower 6-bits for shift
379 v <- ROTL64((RA), 64-n) # Rotate RA 64-bit left by 64-n bits
380 mask <- MASK(n, 63) # 0's mask, set mask[n:63] to 1'
381 RT <- (v[0:63] & mask) | ((RC) & ¬mask) #
382 RS <- v[0:63] & ¬mask
387 Special Registers Altered:
391 The contents of register RA are shifted right the number
392 of bits specified by (RB) 58:63. The same number of
393 shifted bits are taken from the **left** (MSB) end of register RC
394 and placed into the **leftmost** (MSB) end of the result, RT.
395 Additionally, the LSB (rightmost) bits of register RA that would normally
396 be discarded by a 64-bit right shift are placed into the
399 When Rc=1, the overflow flag in CR0 is set if RS is nonzero,
400 or cleared if it is zero; all other bits of CR0 are set from RT as normal.
401 XER.OV and XER.SO remain unchanged.
404 similar to maddedu and divmod2du, dsrd can be chained (using RC),
405 effectively using RC as a 64-bit carry-in and carry-out. Arbitrary
406 length Scalar-Vector shift may be performed without the additional
407 masking instructions normally needed.*
415 Add the following to Book I, 1.6.21.1, VA2-Form
418 |0 |6 |11 |16 |21 |24|26 |31 |
419 | PO | RT | RA | RB | RC | XO | Rc |
422 Add 'VA2-Form' to `RA` thru `XO` Field in Book I, 1.6.2
426 Field used to specify a GPR to be used as a
427 source or as a target.
428 Formats: ... VA2, ...
431 Field used to specify a GPR to be used as a
433 Formats: ... VA2, ...
436 Field used to specify a GPR to be used as a
438 Formats: ... VA2, ...
442 0 Do not alter the Condition Register.
443 1 Set Condition Register Field 0 or Field 1 as
444 described in Section 2.3.1, 'Condition Regis-
446 Formats: ... VA2, ...
449 Field used to specify a GPR to be used as a target.
450 Formats: ... VA2, ...
453 Extended opcode field.
454 Formats: ... VA2, ...
461 Appendix E Power ISA sorted by opcode
462 Appendix F Power ISA sorted by version
463 Appendix G Power ISA sorted by Compliancy Subset
464 Appendix H Power ISA sorted by mnemonic
466 |Form| Book | Page | Version | mnemonic | Description |
467 |----|------|------|---------|----------|-------------|
468 |VA | I | # | 3.2B |maddedu | Multiply-Add Extend Double Unsigned |
469 |VA | I | # | 3.2B |maddedus | Multiply-Add Extend Double Unsigned Signed |
470 |VA | I | # | 3.2B |divmod2du | Divide/Modulo Quad-Double Unsigned |
471 |VA2 | I | # | 3.2B |dsld | Double-Shift Left Doubleword |
472 |VA2 | I | # | 3.2B |dsrd | Double-Shift Right Doubleword |