From: Luke Kenneth Casson Leighton Date: Wed, 24 May 2023 11:05:11 +0000 (+0100) Subject: rename ls003 to ls003.bignum X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=2c102684ff1e38526f0d2e9ae844860fd7576282;p=libreriscv.git rename ls003 to ls003.bignum --- diff --git a/openpower/sv/rfc/ls002.fmi.mdwn b/openpower/sv/rfc/ls002.fmi.mdwn index 21613b31b..babc8eed2 100644 --- a/openpower/sv/rfc/ls002.fmi.mdwn +++ b/openpower/sv/rfc/ls002.fmi.mdwn @@ -3,7 +3,7 @@ **URLs**: * -* +* * * diff --git a/openpower/sv/rfc/ls003.bignum.mdwn b/openpower/sv/rfc/ls003.bignum.mdwn new file mode 100644 index 000000000..39b2c12b2 --- /dev/null +++ b/openpower/sv/rfc/ls003.bignum.mdwn @@ -0,0 +1,486 @@ +# RFC ls003 Big Integer + +**URLs**: + +* +* +* +* + +**Severity**: Major + +**Status**: New + +**Date**: 20 Oct 2022 - v2 TODO + +**Target**: v3.2B + +**Source**: v3.0B + +**Books and Section affected**: **UPDATE** + +``` + Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1 + Appendix E Power ISA sorted by opcode + Appendix F Power ISA sorted by version + Appendix G Power ISA sorted by Compliancy Subset + Appendix H Power ISA sorted by mnemonic +``` + +**Summary** + +Instructions added + +``` + maddedu - Multiply-Add Extended Double Unsigned + maddedus - Multiply-Add Extended Double Unsigned/Signed + divmod2du - Divide/Modulo Quad-Double Unsigned + dsld - Double Shift Left Doubleword + dsrd - Double Shift Right Doubleword +``` + +**Submitter**: Luke Leighton (Libre-SOC) + +**Requester**: Libre-SOC + +**Impact on processor**: + +``` + Addition of five new GPR-based instructions +``` + +**Impact on software**: + +``` + Requires support for new instructions in assembler, debuggers, + and related tools. +``` + +**Keywords**: + +``` + GPR, Big-integer, Double-word +``` + +**Motivation** + +* Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling + accumulation affect: `RC` effectively becomes a 64-bit carry in chains + of highly-efficient loop-unrolled arbitrary-length big-integer operations. +* Similar to `divdeu`, and has similar advantages to `maddedu`, + Modulo result is available with the quotient in a single instruction + allowing highly-efficient arbitrary-length big-integer division. +* Combining at least three instructions into one, the `dsld` and `dsrd` + instructions make shifting an arbitrary-length big-integer vector by + a scalar 64-bit quantity highly efficient. + +**Notes and Observations**: + +1. It is not practical to add Rc=1 variants when VA-Form is used and + there is a **pair** of results produced. +2. An overflow variant (XER.OV set) of `divmod2du` would be valuable + but VA-Form EXT004 is under severe pressure. +3. Both `maddhdu` and `divmod2du` instructions have been present in Intel x86 + for several decades. Likewise, `dsld` and `dsrd`. +4. None of these instruction is present in VSX. +5. `maddedu` and `divmod2du` are full inverses of each other, including + when used for arbitrary-length big-integer arithmetic. +6. These are all 3-in 2-out instructions. If Power ISA did not already + have LD/ST-with-update instructions and instructions with `RAp` + and `RTp` then these instructions would not be proposed. +7. `maddedus` is the first Scalar signed/unsigned multiply instruction. The + only other signed/unsigned multiply instruction is the + specialist `vmsummbm` (bytes only), requires VSX, + and is unsuited for big-integer or other general arithmetic. +8. Unresolved: dsld/dsrd are 3-in 3-out (in the Rc=1 variants) where the + normal threshold set is 3-in 2-out. +9. Hardware may macro-op fuse inline uses, reducing register use through + operand-forwarding and/or higher bit-width ALUs. + +**Changes** + +Add the following entries to: + +* the Appendices of Book I +* Instructions of Book I added to Section 3.3.9.1 +* VA2-Form of Book I Section 1.6.21.1 and 1.6.2 + +---------------- + +\newpage{} + +# Multiply-Add Extended Double Unsigned + +`maddedu RT, RA, RB, RC` + +| 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | +|-------|------|-------|-------|-------|-------|---------| +| EXT04 | RT | RA | RB | RC | XO | VA-Form | + +Pseudocode: + +``` +prod[0:127] <- (RA) * (RB) # Multiply RA and RB, result 128-bit +sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product +RT <- sum[64:127] # Store low half in RT +RS <- sum[0:63] # RS implicit register, equal to RC +``` + +Special registers altered: + + None + +The 64-bit operands are (RA), (RB), and (RC). +RC is zero-extended (not shifted, not sign-extended). +The 128-bit product of the operands (RA) and (RB) is added to (RC). +The low-order 64 bits of the 128-bit sum are +placed into register RT. +The high-order 64 bits of the 128-bit sum are +placed into register RS. +RS is implicitly defined as the same register as RC. + +All three operands and the result are interpreted as +unsigned integers. + +The differences here to `maddhdu` are that `maddhdu` stores the upper +half in RT, where `maddedu` stores the upper half in RS. + +The value stored in RT is exactly equivalent to `maddld` despite `maddld` +performing sign-extension on RC, because RT is the full mathematical result +modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical +results modulo 2^64. This is why there is no maddldu instruction. + +*Programmer's Note: +To achieve a big-integer rolling-accumulation effect: +assuming the scalar to multiply is in r0, and r3 is +used (effectively) as a 64-bit carry, +the vector to multiply by starts at r4 and the result vector +in r20, instructions may be issued `maddedu r20,r4,r0,r3` +`maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have +stored the upper half of the 128-bit multiply into r3, such +that it may be picked up by the second `maddedu`. Repeat inline +to construct a larger bigint scalar-vector multiply, +as Scalar GPR register file space permits. If register +spill is required then r3, as the effective 64-bit carry, +continues the chain.* + +Examples: + +``` +# (r0 * r1) + r2, store lower in r4, upper in r2 +maddedu r4, r0, r1, r2 + +# Chaining together for larger bigint (see Programmer's Note above) +# r3 starts with zero (no carry-in) +maddedu r20,r4,r0,r3 +maddedu r21,r5,r0,r3 +maddedu r22,r6,r0,r3 +``` + +---------- + +\newpage{} + +# Multiply-Add Extended Double Unsigned/Signed + +`maddedus RT, RA, RB, RC` + +| 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | +|-------|------|-------|-------|-------|-------|---------| +| EXT04 | RT | RA | RB | RC | XO | VA-Form | + +Pseudocode: + +``` + if (RB)[0] != 0 then # workaround no unsigned-signed mul op + prod[0:127] <- -((RA) * -(RB)) + else + prod[0:127] <- (RA) * (RB) + sum[0:127] <- prod + EXTS128((RC)) + RT <- sum[64:127] # Store low half in RT + RS <- sum[0:63] # RS implicit register, equal to RC +``` + +Special registers altered: + + None + +The 64-bit operands are (RA), (RB), and (RC). +(RC) is sign-extended to 128-bits and then summed with the +128-bit product of zero-extended (RA) and sign-extended (RB). +The low-order 64 bits of the 128-bit sum are +placed into register RT. +The high-order 64 bits of the 128-bit sum are +placed into register RS. +RS is implicitly defined as the same register as RC. + +*Programmer's Note: +To achieve a big-integer rolling-accumulation effect: +assuming the signed scalar to multiply is in r0, and r3 is +used (effectively) as a 64-bit carry, +the unsigned vector to multiply by starts at r4 and the signed result vector +in r20, instructions may be issued `maddedus r20,r4,r0,r3` +`maddedus r21,r5,r0,r3` etc. where the first `maddedus` will have +stored the upper half of the 128-bit multiply into r3, such +that it may be picked up by the second `maddedus`. Repeat inline +to construct a larger bigint scalar-vector multiply, +as Scalar GPR register file space permits. If register +spill is required then r3, as the effective 64-bit carry, +continues the chain.* + +Examples: + +``` +# (r0 * r1) + r2, store lower in r4, upper in r2 +maddedus r4, r0, r1, r2 + +# Chaining together for larger bigint (see Programmer's Note above) +# r3 starts with zero (no carry-in) +maddedus r20,r4,r0,r3 +maddedus r21,r5,r0,r3 +maddedus r22,r6,r0,r3 +``` + +---------- + +\newpage{} + +# Divide/Modulo Quad-Double Unsigned + +`divmod2du RT,RA,RB,RC` + +| 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | +|-------|------|-------|-------|-------|-------|---------| +| EXT04 | RT | RA | RB | RC | XO | VA-Form | + +Pseudo-code: + +``` + if ((RA) -* -* -* - -**Severity**: Major - -**Status**: New - -**Date**: 20 Oct 2022 - v2 TODO - -**Target**: v3.2B - -**Source**: v3.0B - -**Books and Section affected**: **UPDATE** - -``` - Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1 - Appendix E Power ISA sorted by opcode - Appendix F Power ISA sorted by version - Appendix G Power ISA sorted by Compliancy Subset - Appendix H Power ISA sorted by mnemonic -``` - -**Summary** - -Instructions added - -``` - maddedu - Multiply-Add Extended Double Unsigned - maddedus - Multiply-Add Extended Double Unsigned/Signed - divmod2du - Divide/Modulo Quad-Double Unsigned - dsld - Double Shift Left Doubleword - dsrd - Double Shift Right Doubleword -``` - -**Submitter**: Luke Leighton (Libre-SOC) - -**Requester**: Libre-SOC - -**Impact on processor**: - -``` - Addition of five new GPR-based instructions -``` - -**Impact on software**: - -``` - Requires support for new instructions in assembler, debuggers, - and related tools. -``` - -**Keywords**: - -``` - GPR, Big-integer, Double-word -``` - -**Motivation** - -* Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling - accumulation affect: `RC` effectively becomes a 64-bit carry in chains - of highly-efficient loop-unrolled arbitrary-length big-integer operations. -* Similar to `divdeu`, and has similar advantages to `maddedu`, - Modulo result is available with the quotient in a single instruction - allowing highly-efficient arbitrary-length big-integer division. -* Combining at least three instructions into one, the `dsld` and `dsrd` - instructions make shifting an arbitrary-length big-integer vector by - a scalar 64-bit quantity highly efficient. - -**Notes and Observations**: - -1. It is not practical to add Rc=1 variants when VA-Form is used and - there is a **pair** of results produced. -2. An overflow variant (XER.OV set) of `divmod2du` would be valuable - but VA-Form EXT004 is under severe pressure. -3. Both `maddhdu` and `divmod2du` instructions have been present in Intel x86 - for several decades. Likewise, `dsld` and `dsrd`. -4. None of these instruction is present in VSX. -5. `maddedu` and `divmod2du` are full inverses of each other, including - when used for arbitrary-length big-integer arithmetic. -6. These are all 3-in 2-out instructions. If Power ISA did not already - have LD/ST-with-update instructions and instructions with `RAp` - and `RTp` then these instructions would not be proposed. -7. `maddedus` is the first Scalar signed/unsigned multiply instruction. The - only other signed/unsigned multiply instruction is the - specialist `vmsummbm` (bytes only), requires VSX, - and is unsuited for big-integer or other general arithmetic. -8. Unresolved: dsld/dsrd are 3-in 3-out (in the Rc=1 variants) where the - normal threshold set is 3-in 2-out. -9. Hardware may macro-op fuse inline uses, reducing register use through - operand-forwarding and/or higher bit-width ALUs. - -**Changes** - -Add the following entries to: - -* the Appendices of Book I -* Instructions of Book I added to Section 3.3.9.1 -* VA2-Form of Book I Section 1.6.21.1 and 1.6.2 - ----------------- - -\newpage{} - -# Multiply-Add Extended Double Unsigned - -`maddedu RT, RA, RB, RC` - -| 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | -|-------|------|-------|-------|-------|-------|---------| -| EXT04 | RT | RA | RB | RC | XO | VA-Form | - -Pseudocode: - -``` -prod[0:127] <- (RA) * (RB) # Multiply RA and RB, result 128-bit -sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product -RT <- sum[64:127] # Store low half in RT -RS <- sum[0:63] # RS implicit register, equal to RC -``` - -Special registers altered: - - None - -The 64-bit operands are (RA), (RB), and (RC). -RC is zero-extended (not shifted, not sign-extended). -The 128-bit product of the operands (RA) and (RB) is added to (RC). -The low-order 64 bits of the 128-bit sum are -placed into register RT. -The high-order 64 bits of the 128-bit sum are -placed into register RS. -RS is implicitly defined as the same register as RC. - -All three operands and the result are interpreted as -unsigned integers. - -The differences here to `maddhdu` are that `maddhdu` stores the upper -half in RT, where `maddedu` stores the upper half in RS. - -The value stored in RT is exactly equivalent to `maddld` despite `maddld` -performing sign-extension on RC, because RT is the full mathematical result -modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical -results modulo 2^64. This is why there is no maddldu instruction. - -*Programmer's Note: -To achieve a big-integer rolling-accumulation effect: -assuming the scalar to multiply is in r0, and r3 is -used (effectively) as a 64-bit carry, -the vector to multiply by starts at r4 and the result vector -in r20, instructions may be issued `maddedu r20,r4,r0,r3` -`maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have -stored the upper half of the 128-bit multiply into r3, such -that it may be picked up by the second `maddedu`. Repeat inline -to construct a larger bigint scalar-vector multiply, -as Scalar GPR register file space permits. If register -spill is required then r3, as the effective 64-bit carry, -continues the chain.* - -Examples: - -``` -# (r0 * r1) + r2, store lower in r4, upper in r2 -maddedu r4, r0, r1, r2 - -# Chaining together for larger bigint (see Programmer's Note above) -# r3 starts with zero (no carry-in) -maddedu r20,r4,r0,r3 -maddedu r21,r5,r0,r3 -maddedu r22,r6,r0,r3 -``` - ----------- - -\newpage{} - -# Multiply-Add Extended Double Unsigned/Signed - -`maddedus RT, RA, RB, RC` - -| 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | -|-------|------|-------|-------|-------|-------|---------| -| EXT04 | RT | RA | RB | RC | XO | VA-Form | - -Pseudocode: - -``` - if (RB)[0] != 0 then # workaround no unsigned-signed mul op - prod[0:127] <- -((RA) * -(RB)) - else - prod[0:127] <- (RA) * (RB) - sum[0:127] <- prod + EXTS128((RC)) - RT <- sum[64:127] # Store low half in RT - RS <- sum[0:63] # RS implicit register, equal to RC -``` - -Special registers altered: - - None - -The 64-bit operands are (RA), (RB), and (RC). -(RC) is sign-extended to 128-bits and then summed with the -128-bit product of zero-extended (RA) and sign-extended (RB). -The low-order 64 bits of the 128-bit sum are -placed into register RT. -The high-order 64 bits of the 128-bit sum are -placed into register RS. -RS is implicitly defined as the same register as RC. - -*Programmer's Note: -To achieve a big-integer rolling-accumulation effect: -assuming the signed scalar to multiply is in r0, and r3 is -used (effectively) as a 64-bit carry, -the unsigned vector to multiply by starts at r4 and the signed result vector -in r20, instructions may be issued `maddedus r20,r4,r0,r3` -`maddedus r21,r5,r0,r3` etc. where the first `maddedus` will have -stored the upper half of the 128-bit multiply into r3, such -that it may be picked up by the second `maddedus`. Repeat inline -to construct a larger bigint scalar-vector multiply, -as Scalar GPR register file space permits. If register -spill is required then r3, as the effective 64-bit carry, -continues the chain.* - -Examples: - -``` -# (r0 * r1) + r2, store lower in r4, upper in r2 -maddedus r4, r0, r1, r2 - -# Chaining together for larger bigint (see Programmer's Note above) -# r3 starts with zero (no carry-in) -maddedus r20,r4,r0,r3 -maddedus r21,r5,r0,r3 -maddedus r22,r6,r0,r3 -``` - ----------- - -\newpage{} - -# Divide/Modulo Quad-Double Unsigned - -`divmod2du RT,RA,RB,RC` - -| 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form | -|-------|------|-------|-------|-------|-------|---------| -| EXT04 | RT | RA | RB | RC | XO | VA-Form | - -Pseudo-code: - -``` - if ((RA)