# RFC ls003 Big Integer **URLs**: * * * * **Severity**: Major **Status**: New **Date**: 20 Oct 2022 **Target**: v3.2B **Source**: v3.0B **Books and Section affected**: **UPDATE** ``` Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1 Appendix E Power ISA sorted by opcode Appendix F Power ISA sorted by version Appendix G Power ISA sorted by Compliancy Subset Appendix H Power ISA sorted by mnemonic ``` **Summary** Instructions added ``` maddedu - Multiply-Add Extended Double Unsigned maddedus - Multiply-Add Extended Double Unsigned/Signed divmod2du - Divide/Modulo Quad-Double Unsigned dsld - Double Shift Left Doubleword dsrd - Double Shift Right Doubleword ``` **Submitter**: Luke Leighton (Libre-SOC) **Requester**: Libre-SOC **Impact on processor**: ``` Addition of five new GPR-based instructions ``` **Impact on software**: ``` Requires support for new instructions in assembler, debuggers, and related tools. ``` **Keywords**: ``` GPR, Big-integer, Double-word ``` **Motivation** * Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling accumulation affect: `RC` effectively becomes a 64-bit carry in chains of highly-efficient loop-unrolled arbitrary-length big-integer operations. * Similar to `divdeu`, and has similar advantages to `maddedu`, Modulo result is available with the quotient in a single instruction allowing highly-efficient arbitrary-length big-integer division. * Combining at least three instructions into one, the `dsld` and `dsrd` instructions make shifting an arbitrary-length big-integer vector by a scalar 64-bit quantity highly efficient. **Notes and Observations**: 1. It is not practical to add Rc=1 variants when VA-Form is used and there is a **pair** of results produced. 2. An overflow variant (XER.OV set) of `divmod2du` would be valuable but VA-Form EXT004 is under severe pressure. 3. Both `maddhdu` and `divmod2du` instructions have been present in Intel x86 for several decades. Likewise, `dsld` and `dsrd`. 4. None of these instruction is present in VSX: these are 128/64 whereas VSX is 128/128. 5. `maddedu` and `divmod2du` are full inverses of each other, including when used for arbitrary-length big-integer arithmetic. 6. These are all 3-in 2-out instructions. If Power ISA did not already have LD/ST-with-update instructions and instructions with `RAp` and `RTp` then these instructions would not be proposed. 7. `maddedus` is the first Scalar signed/unsigned multiply instruction. The only other signed/unsigned multiply instruction is the specialist `vmsummbm` (bytes only), requires VSX, and is unsuited for big-integer or other general arithmetic. 8. Unresolved: dsld/dsrd are 3-in 3-out (in the Rc=1 variants) where the normal threshold set is 3-in 2-out. **Changes** Add the following entries to: * the Appendices of Book I * Instructions of Book I added to Section 3.3.9.1 * VA2-Form of Book I Section 1.6.21.1 and 1.6.2 ---------------- \newpage{} # Multiply-Add Extended Double Unsigned `maddedu RT, RA, RB, RC` | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | |-------|------|-------|-------|-------|-------|---------| | EXT04 | RT | RA | RB | RC | XO | VA-Form | Pseudocode: ``` prod[0:127] <- (RA) * (RB) # Multiply RA and RB, result 128-bit sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product RT <- sum[64:127] # Store low half in RT RS <- sum[0:63] # RS implicit register, equal to RC ``` Special registers altered: None The 64-bit operands are (RA), (RB), and (RC). RC is zero-extended (not shifted, not sign-extended). The 128-bit product of the operands (RA) and (RB) is added to (RC). The low-order 64 bits of the 128-bit sum are placed into register RT. The high-order 64 bits of the 128-bit sum are placed into register RS. RS is implicitly defined as the same register as RC. All three operands and the result are interpreted as unsigned integers. The differences here to `maddhdu` are that `maddhdu` stores the upper half in RT, where `maddedu` stores the upper half in RS. The value stored in RT is exactly equivalent to `maddld` despite `maddld` performing sign-extension on RC, because RT is the full mathematical result modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical results modulo 2^64. This is why there is no maddldu instruction. *Programmer's Note: To achieve a big-integer rolling-accumulation effect: assuming the scalar to multiply is in r0, and r3 is used (effectively) as a 64-bit carry, the vector to multiply by starts at r4 and the result vector in r20, instructions may be issued `maddedu r20,r4,r0,r3` `maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have stored the upper half of the 128-bit multiply into r3, such that it may be picked up by the second `maddedu`. Repeat inline to construct a larger bigint scalar-vector multiply, as Scalar GPR register file space permits. If register spill is required then r3, as the effective 64-bit carry, continues the chain.* Examples: ``` # (r0 * r1) + r2, store lower in r4, upper in r2 maddedu r4, r0, r1, r2 # Chaining together for larger bigint (see Programmer's Note above) # r3 starts with zero (no carry-in) maddedu r20,r4,r0,r3 maddedu r21,r5,r0,r3 maddedu r22,r6,r0,r3 ``` ---------- \newpage{} # Multiply-Add Extended Double Unsigned/Signed `maddedus RT, RA, RB, RC` | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | |-------|------|-------|-------|-------|-------|---------| | EXT04 | RT | RA | RB | RC | XO | VA-Form | Pseudocode: ``` if (RB)[0] != 0 then # workaround no unsigned-signed mul op prod[0:127] <- -((RA) * -(RB)) else prod[0:127] <- (RA) * (RB) sum[0:127] <- prod + EXTS128((RC)) RT <- sum[64:127] # Store low half in RT RS <- sum[0:63] # RS implicit register, equal to RC ``` Special registers altered: None The 64-bit operands are (RA), (RB), and (RC). (RC) is sign-extended to 128-bits and then summed with the 128-bit product of zero-extended (RA) and sign-extended (RB). The low-order 64 bits of the 128-bit sum are placed into register RT. The high-order 64 bits of the 128-bit sum are placed into register RS. RS is implicitly defined as the same register as RC. *Programmer's Note: To achieve a big-integer rolling-accumulation effect: assuming the signed scalar to multiply is in r0, and r3 is used (effectively) as a 64-bit carry, the unsigned vector to multiply by starts at r4 and the signed result vector in r20, instructions may be issued `maddedus r20,r4,r0,r3` `maddedus r21,r5,r0,r3` etc. where the first `maddedus` will have stored the upper half of the 128-bit multiply into r3, such that it may be picked up by the second `maddedus`. Repeat inline to construct a larger bigint scalar-vector multiply, as Scalar GPR register file space permits. If register spill is required then r3, as the effective 64-bit carry, continues the chain.* Examples: ``` # (r0 * r1) + r2, store lower in r4, upper in r2 maddedus r4, r0, r1, r2 # Chaining together for larger bigint (see Programmer's Note above) # r3 starts with zero (no carry-in) maddedus r20,r4,r0,r3 maddedus r21,r5,r0,r3 maddedus r22,r6,r0,r3 ``` ---------- \newpage{} # Divide/Modulo Quad-Double Unsigned `divmod2du RT,RA,RB,RC` | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | |-------|------|-------|-------|-------|-------|---------| | EXT04 | RT | RA | RB | RC | XO | VA-Form | Pseudo-code: ``` if ((RA)