1 [[!tag standards]]

3 # Big Integer Arithmetic

5 **DRAFT STATUS** 19apr2022, last edited 23may2022

7 * [[discussion]] page for notes

8 * <https://bugs.libre-soc.org/show_bug.cgi?id=817> bugreport

9 * <https://bugs.libre-soc.org/show_bug.cgi?id=937> 128/64 shifts

10 * [[biginteger/analysis]]

11 * [[openpower/isa/svfixedarith]] pseudocode

13 BigNum arithmetic is extremely common especially in cryptography,

14 where for example RSA relies on arithmetic of 2048 or 4096 bits

15 in length. The primary operations are add, multiply and divide

16 (and modulo) with specialisations of subtract and signed multiply.

18 A reminder that a particular focus of SVP64 is that it is built on

19 top of Scalar operations, where those scalar operations are useful in

20 their own right without SVP64. Thus the operations here are proposed

21 first as Scalar Extensions to the Power ISA.

23 A secondary focus is that if Vectorised, implementors may choose

24 to deploy macro-op fusion targetting back-end 256-bit or greater

25 Dynamic SIMD ALUs for maximum performance and effectiveness.

27 # Analysis

29 Covered in [[biginteger/analysis]] the summary is that standard `adde`

30 is sufficient for SVP64 Vectorisation of big-integer addition (and `subfe`

31 for subtraction) but that big-integer shift, multiply and divide require an

32 extra 3-in 2-out instructions, similar to Intel's

33 [shld](https://www.felixcloutier.com/x86/shld)

34 and [shrd](https://www.felixcloutier.com/x86/shrd),

35 [mulx](https://www.felixcloutier.com/x86/mulx) and

36 [divq](https://www.felixcloutier.com/x86/div),

37 to be efficient.

38 The same instruction (`maddedu`) is used in both

39 big-divide and big-multiply because 'maddedu''s primary

40 purpose is to perform a fused 64-bit scalar multiply with a large vector,

41 where that result is Big-Added for Big-Multiply, but Big-Subtracted for

42 Big-Divide.

44 Chaining the operations together gives Scalar-by-Vector

45 operations, except for `sv.adde` and `sv.subfe` which are

46 Vector-by-Vector Chainable (through the `CA` flag).

47 Macro-op Fusion and back-end massively-wide SIMD ALUs may be deployed in a

48 fashion that is hidden from the user, behind a consistent, stable ISA API.

49 The same macro-op fusion may theoretically be deployed even on Scalar

50 operations.

52 # dsld and dsrd

54 **DRAFT**

56 `dsld` and `dsrd` are similar to v3.0 `sld`, and

57 is Z23-Form in "overwrite" on RT.

59 |0.....5|6..10|11..15|16..20|21.22|23..30|31|

60 |-------|-----|------|------|-----|------|--|

61 | EXT04 | RT | RA | RB | sm | XO |Rc|

63 Both instructions take two 64-bit sources, concatenate

64 them together then extract 64 bits from it, the offset

65 location determined by a third source. So as to avoid

66 costly 4-reg (VA-Form) a 2-bit mode `sm` gives four

67 potential overwrite and zero-source options instead.

69 # maddedu

71 **DRAFT**

73 `maddedu` is similar to v3.0 `madd`, and

74 is VA-Form despite having 2 outputs: the second

75 destination register is implicit.

77 |0.....5|6..10|11..15|16..20|21..25|26..31|

78 |-------|-----|------|------|------|------|

79 | EXT04 | RT | RA | RB | RC | XO |

81 The pseudocode for `maddedu RT, RA, RB, RC` is:

83 prod[0:127] = (RA) * (RB)

84 sum[0:127] = EXTZ(RC) + prod

85 RT <- sum[64:127]

86 RS <- sum[0:63] # RS implicit register, see below

88 RC is zero-extended (not shifted, not sign-extended), the 128-bit product added

89 to it; the lower half of that result stored in RT and the upper half

90 in RS.

92 The differences here to `maddhdu` are that `maddhdu` stores the upper

93 half in RT, where `maddedu` stores the upper half in RS.

95 The value stored in RT is exactly equivalent to `maddld` despite `maddld`

96 performing sign-extension on RC, because RT is the full mathematical result

97 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical

98 results modulo 2^64. This is why there is no maddldu instruction.

100 *Programmer's Note:

101 As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1.

102 To achieve the same big-integer rolling-accumulation effect

103 as SVP64: assuming the scalar to multiply is in r0,

104 the vector to multiply by starts at r4 and the result vector

105 in r20, instructions may be issued `maddedu r20,r4,r0,r20

106 maddedu r21,r5,r0,r21` etc. where the first `maddedu` will have

107 stored the upper half of the 128-bit multiply into r21, such

108 that it may be picked up by the second `maddedu`. Repeat inline

109 to construct a larger bigint scalar-vector multiply,

110 as Scalar GPR register file space permits.*

112 SVP64 overrides the Scalar behaviour of what defines RS.

113 For SVP64 EXTRA register extension, the `RM-1P-3S-1D` format is

114 used with the additional bit set for determining RS.

116 | Field Name | Field bits | Description |

117 |------------|------------|----------------------------------------|

118 | Rdest\_EXTRA2 | `10:11` | extends RT (R\*\_EXTRA2 Encoding) |

119 | Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding) |

120 | Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding) |

121 | Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding) |

122 | EXTRA2_MODE | `18` | used by `maddedu` for determining RS |

124 When `EXTRA2_MODE` is set to zero, the implicit RS register takes

125 its Vector/Scalar setting from Rdest_EXTRA2, and takes

126 the register number from RT, but all numbering

127 is offset by MAXVL. *Note that element-width overrides influence this

128 offset* (see SVP64 [[svp64/appendix]] for full details).

130 When `EXTRA2_MODE` is set to one, the implicit RS register is identical

131 to RC extended with SVP64 using `Rsrc3_EXTRA2` in every respect, including whether RC is set Scalar or

132 Vector.

134 # divmod2du RT,RA,RB,RC

136 **DRAFT**

138 Divide/Modulu Quad-Double Unsigned is another VA-Form instruction

139 that is near-identical to `divdeu` except that:

141 * the lower 64 bits of the dividend, instead of being zero, contain a

142 register, RC.

143 * it performs a fused divide and modulo in a single instruction, storing

144 the modulo in an implicit RS (similar to `maddedu`)

146 RB, the divisor, remains 64 bit. The instruction is therefore a 128/64

147 division, producing a (pair) of 64 bit result(s), in the same way that

148 Intel [divq](https://www.felixcloutier.com/x86/div) works.

149 Overflow conditions

150 are detected in exactly the same fashion as `divdeu`, except that rather

151 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all

152 zeros on overflow.

154 *Programmer's note: there are no Rc variants of any of these VA-Form

155 instructions. `cmpi` will need to be used to detect overflow conditions:

156 the saving in instruction count is that both RT and RS will have already

157 been set to useful values (all 1s and all zeros respectively)

158 needed as part of implementing Knuth's

159 Algorithm D*

161 For SVP64, given that this instruction is also 3-in 2-out 64-bit registers,

162 the exact same EXTRA format and setting of RS is used as for `sv.maddedu`.

163 For Scalar usage, just as for `maddedu`, `RS=RT+1` (similar to `lq` and `stq`).

165 Pseudo-code:

167 if ((RA) <u (RB)) & ((RB) != [0]*XLEN) then

168 dividend[0:(XLEN*2)-1] <- (RA) || (RC)

169 divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)

170 result <- dividend / divisor

171 modulo <- dividend % divisor

172 RT <- result[XLEN:(XLEN*2)-1]

173 RS <- modulo[XLEN:(XLEN*2)-1]

174 else

175 RT <- [1]*XLEN

176 RS <- [0]*XLEN

178 # [DRAFT] EXT04 Proposed Map

180 For the Opcode map (XO Field)

181 see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.

182 Proposed is the addition of `maddedu` (**DRAFT, NOT APPROVED**) in `110010`

183 and `divmod2du` in `110100`

185 |110000|110001 |110010 |110011|110100 |110101|110110|110111|

186 |------|-------|----------|------|-------------|------|------|------|

187 |maddhd|maddhdu|**maddedu**|maddld|**divmod2du**|rsvd |rsvd |rsvd |