link to other RFCs in summary table in ls012
[libreriscv.git] / openpower / sv / biginteger.mdwn
1 [[!tag standards]]
2
3 # Big Integer Arithmetic
4
5 **DRAFT STATUS** 19apr2022, last edited 23may2022
6
7 * [[discussion]] page for notes
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=817> bugreport
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=937> 128/64 shifts
10 * [[biginteger/analysis]]
11 * [[openpower/isa/svfixedarith]] pseudocode
12
13 BigNum arithmetic is extremely common especially in cryptography,
14 where for example RSA relies on arithmetic of 2048 or 4096 bits
15 in length. The primary operations are add, multiply and divide
16 (and modulo) with specialisations of subtract and signed multiply.
17
18 A reminder that a particular focus of SVP64 is that it is built on
19 top of Scalar operations, where those scalar operations are useful in
20 their own right without SVP64. Thus the operations here are proposed
21 first as Scalar Extensions to the Power ISA.
22
23 A secondary focus is that if Vectorised, implementors may choose
24 to deploy macro-op fusion targetting back-end 256-bit or greater
25 Dynamic SIMD ALUs for maximum performance and effectiveness.
26
27 # Analysis
28
29 Covered in [[biginteger/analysis]] the summary is that standard `adde`
30 is sufficient for SVP64 Vectorisation of big-integer addition (and `subfe`
31 for subtraction) but that big-integer shift, multiply and divide require an
32 extra 3-in 2-out instructions, similar to Intel's
33 [shld](https://www.felixcloutier.com/x86/shld)
34 and [shrd](https://www.felixcloutier.com/x86/shrd),
35 [mulx](https://www.felixcloutier.com/x86/mulx) and
36 [divq](https://www.felixcloutier.com/x86/div),
37 to be efficient.
38 The same instruction (`maddedu`) is used in both
39 big-divide and big-multiply because 'maddedu''s primary
40 purpose is to perform a fused 64-bit scalar multiply with a large vector,
41 where that result is Big-Added for Big-Multiply, but Big-Subtracted for
42 Big-Divide.
43
44 Chaining the operations together gives Scalar-by-Vector
45 operations, except for `sv.adde` and `sv.subfe` which are
46 Vector-by-Vector Chainable (through the `CA` flag).
47 Macro-op Fusion and back-end massively-wide SIMD ALUs may be deployed in a
48 fashion that is hidden from the user, behind a consistent, stable ISA API.
49 The same macro-op fusion may theoretically be deployed even on Scalar
50 operations.
51
52 # **DRAFT** dsld
53
54 |0.....5|6..10|11..15|16..20|21.25|26..30|31|
55 |-------|-----|------|------|-----|------|--|
56 | EXT04 | RT | RA | RB | RC | XO |Rc|
57
58 VA2-Form
59
60 * dsld RT,RA,RB,RC (Rc=0)
61 * dsld. RT,RA,RB,RC (Rc=1)
62
63 Pseudo-code:
64
65 n <- (RB)[58:63]
66 v <- ROTL64((RA), n)
67 mask <- MASK(0, 63-n)
68 RT <- (v[0:63] & mask) | ((RC) & ¬mask)
69 RS <- v[0:63] & ¬mask
70 overflow = 0
71 if RS != [0]*64:
72 overflow = 1
73
74 Special Registers Altered:
75
76 CR0 (if Rc=1)
77
78 # **DRAFT** dsrd
79
80 |0.....5|6..10|11..15|16..20|21.25|26..30|31|
81 |-------|-----|------|------|-----|------|--|
82 | EXT04 | RT | RA | RB | RC | XO |Rc|
83
84 VA2-Form
85
86 * dsrd RT,RA,RB,RC (Rc=0)
87 * dsrd. RT,RA,RB,RC (Rc=1)
88
89 Pseudo-code:
90
91 n <- (RB)[58:63]
92 v <- ROTL64((RA), 64-n)
93 mask <- MASK(n, 63)
94 RT <- (v[0:63] & mask) | ((RC) & ¬mask)
95 RS <- v[0:63] & ¬mask
96 overflow = 0
97 if RS != [0]*64:
98 overflow = 1
99
100 Special Registers Altered:
101
102 CR0 (if Rc=1)
103
104
105 # maddedu
106
107 **DRAFT**
108
109 `maddedu` is similar to v3.0 `madd`, and
110 is VA-Form despite having 2 outputs: the second
111 destination register is implicit.
112
113 |0.....5|6..10|11..15|16..20|21..25|26..31|
114 |-------|-----|------|------|------|------|
115 | EXT04 | RT | RA | RB | RC | XO |
116
117 The pseudocode for `maddedu RT, RA, RB, RC` is:
118
119 prod[0:127] = (RA) * (RB)
120 sum[0:127] = EXTZ(RC) + prod
121 RT <- sum[64:127]
122 RS <- sum[0:63] # RS implicit register, see below
123
124 RC is zero-extended (not shifted, not sign-extended), the 128-bit product added
125 to it; the lower half of that result stored in RT and the upper half
126 in RS.
127
128 The differences here to `maddhdu` are that `maddhdu` stores the upper
129 half in RT, where `maddedu` stores the upper half in RS.
130
131 The value stored in RT is exactly equivalent to `maddld` despite `maddld`
132 performing sign-extension on RC, because RT is the full mathematical result
133 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
134 results modulo 2^64. This is why there is no maddldu instruction.
135
136 *Programmer's Note:
137 As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1.
138 To achieve the same big-integer rolling-accumulation effect
139 as SVP64: assuming the scalar to multiply is in r0,
140 the vector to multiply by starts at r4 and the result vector
141 in r20, instructions may be issued `maddedu r20,r4,r0,r20
142 maddedu r21,r5,r0,r21` etc. where the first `maddedu` will have
143 stored the upper half of the 128-bit multiply into r21, such
144 that it may be picked up by the second `maddedu`. Repeat inline
145 to construct a larger bigint scalar-vector multiply,
146 as Scalar GPR register file space permits.*
147
148 SVP64 overrides the Scalar behaviour of what defines RS.
149 For SVP64 EXTRA register extension, the `RM-1P-3S-1D` format is
150 used with the additional bit set for determining RS.
151
152 | Field Name | Field bits | Description |
153 |------------|------------|----------------------------------------|
154 | Rdest\_EXTRA2 | `10:11` | extends RT (R\*\_EXTRA2 Encoding) |
155 | Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding) |
156 | Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding) |
157 | Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding) |
158 | EXTRA2_MODE | `18` | used by `maddedu` for determining RS |
159
160 When `EXTRA2_MODE` is set to zero, the implicit RS register takes
161 its Vector/Scalar setting from Rdest_EXTRA2, and takes
162 the register number from RT, but all numbering
163 is offset by MAXVL. *Note that element-width overrides influence this
164 offset* (see SVP64 [[svp64/appendix]] for full details).
165
166 When `EXTRA2_MODE` is set to one, the implicit RS register is identical
167 to RC extended with SVP64 using `Rsrc3_EXTRA2` in every respect, including whether RC is set Scalar or
168 Vector.
169
170 # divmod2du RT,RA,RB,RC
171
172 **DRAFT**
173
174 Divide/Modulu Quad-Double Unsigned is another VA-Form instruction
175 that is near-identical to `divdeu` except that:
176
177 * the lower 64 bits of the dividend, instead of being zero, contain a
178 register, RC.
179 * it performs a fused divide and modulo in a single instruction, storing
180 the modulo in an implicit RS (similar to `maddedu`)
181
182 RB, the divisor, remains 64 bit. The instruction is therefore a 128/64
183 division, producing a (pair) of 64 bit result(s), in the same way that
184 Intel [divq](https://www.felixcloutier.com/x86/div) works.
185 Overflow conditions
186 are detected in exactly the same fashion as `divdeu`, except that rather
187 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
188 zeros on overflow.
189
190 *Programmer's note: there are no Rc variants of any of these VA-Form
191 instructions. `cmpi` will need to be used to detect overflow conditions:
192 the saving in instruction count is that both RT and RS will have already
193 been set to useful values (all 1s and all zeros respectively)
194 needed as part of implementing Knuth's
195 Algorithm D*
196
197 For SVP64, given that this instruction is also 3-in 2-out 64-bit registers,
198 the exact same EXTRA format and setting of RS is used as for `sv.maddedu`.
199 For Scalar usage, just as for `maddedu`, `RS=RT+1` (similar to `lq` and `stq`).
200
201 Pseudo-code:
202
203 if ((RA) <u (RB)) & ((RB) != [0]*XLEN) then
204 dividend[0:(XLEN*2)-1] <- (RA) || (RC)
205 divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
206 result <- dividend / divisor
207 modulo <- dividend % divisor
208 RT <- result[XLEN:(XLEN*2)-1]
209 RS <- modulo[XLEN:(XLEN*2)-1]
210 else
211 RT <- [1]*XLEN
212 RS <- [0]*XLEN
213
214 # [DRAFT] EXT04 Proposed Map
215
216 For the Opcode map (XO Field)
217 see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.
218 Proposed is the addition of:
219
220 * `maddedu` in `110010`
221 * `divmod2du` in `111010`
222 * `pcdec` in `111000`
223
224 |v >| 000| 001 | 010 | 011| 100 | 101 | 110 | 111 |
225 |---|------|-------|----------|------|--------|--------|---------|--------|
226 |110|maddhd|maddhdu|maddedu |maddld|rsvd |rsvd |rsvd |rsvd |
227 |111|pcdec.|rsvd |divmod2du |vpermr|vaddequm|vaddecuq|vsubeuqm |vsubecuq|
228