ls008 only doing setvl and svstep
[libreriscv.git] / openpower / sv / rfc / ls003.mdwn
1 # RFC ls003 Big Integer
2
3 **URLs**:
4
5 * <https://libre-soc.org/openpower/sv/biginteger/analysis/>
6 * <https://libre-soc.org/openpower/sv/rfc/ls003/>
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=960>
8 * <https://git.openpower.foundation/isa/PowerISA/issues/91>
9
10 **Severity**: Major
11
12 **Status**: New
13
14 **Date**: 20 Oct 2022
15
16 **Target**: v3.2B
17
18 **Source**: v3.0B
19
20 **Books and Section affected**: **UPDATE**
21
22 ```
23 Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1
24 Appendix E Power ISA sorted by opcode
25 Appendix F Power ISA sorted by version
26 Appendix G Power ISA sorted by Compliancy Subset
27 Appendix H Power ISA sorted by mnemonic
28 ```
29
30 **Summary**
31
32 Instructions added
33
34 ```
35 maddedu - Multiply-Add Extended Double Unsigned
36 maddedus - Multiply-Add Extended Double Unsigned/Signed
37 divmod2du - Divide/Modulo Quad-Double Unsigned
38 dsld - Double Shift Left Doubleword
39 dsrd - Double Shift Right Doubleword
40 ```
41
42 **Submitter**: Luke Leighton (Libre-SOC)
43
44 **Requester**: Libre-SOC
45
46 **Impact on processor**:
47
48 ```
49 Addition of five new GPR-based instructions
50 ```
51
52 **Impact on software**:
53
54 ```
55 Requires support for new instructions in assembler, debuggers,
56 and related tools.
57 ```
58
59 **Keywords**:
60
61 ```
62 GPR, Big-integer, Double-word
63 ```
64
65 **Motivation**
66
67 * Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling
68 accumulation affect: `RC` effectively becomes a 64-bit carry in chains
69 of highly-efficient loop-unrolled arbitrary-length big-integer operations.
70 * Similar to `divdeu`, and has similar advantages to `maddedu`,
71 Modulo result is available with the quotient in a single instruction
72 allowing highly-efficient arbitrary-length big-integer division.
73 * Combining at least three instructions into one, the `dsld` and `dsrd`
74 instructions make shifting an arbitrary-length big-integer vector by
75 a scalar 64-bit quantity highly efficient.
76
77 **Notes and Observations**:
78
79 1. It is not practical to add Rc=1 variants when VA-Form is used and
80 there is a **pair** of results produced.
81 2. An overflow variant (XER.OV set) of `divmod2du` would be valuable
82 but VA-Form EXT004 is under severe pressure.
83 3. Both `maddhdu` and `divmod2du` instructions have been present in Intel x86
84 for several decades. Likewise, `dsld` and `dsrd`.
85 4. None of these instruction is present in VSX: these are 128/64 whereas
86 VSX is 128/128.
87 5. `maddedu` and `divmod2du` are full inverses of each other, including
88 when used for arbitrary-length big-integer arithmetic.
89 6. These are all 3-in 2-out instructions. If Power ISA did not already
90 have LD/ST-with-update instructions and instructions with `RAp`
91 and `RTp` then these instructions would not be proposed.
92 7. `maddedus` is the first Scalar signed/unsigned multiply instruction. The
93 only other signed/unsigned multiply instruction is the
94 specialist `vmsummbm` (bytes only), requires VSX,
95 and is unsuited for big-integer or other general arithmetic.
96 8. Unresolved: dsld/dsrd are 3-in 3-out (in the Rc=1 variants) where the
97 normal threshold set is 3-in 2-out.
98
99 **Changes**
100
101 Add the following entries to:
102
103 * the Appendices of Book I
104 * Instructions of Book I added to Section 3.3.9.1
105 * VA2-Form of Book I Section 1.6.21.1 and 1.6.2
106
107 ----------------
108
109 \newpage{}
110
111 # Multiply-Add Extended Double Unsigned
112
113 `maddedu RT, RA, RB, RC`
114
115 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
116 |-------|------|-------|-------|-------|-------|---------|
117 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
118
119 Pseudocode:
120
121 ```
122 prod[0:127] <- (RA) * (RB) # Multiply RA and RB, result 128-bit
123 sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product
124 RT <- sum[64:127] # Store low half in RT
125 RS <- sum[0:63] # RS implicit register, equal to RC
126 ```
127
128 Special registers altered:
129
130 None
131
132 The 64-bit operands are (RA), (RB), and (RC).
133 RC is zero-extended (not shifted, not sign-extended).
134 The 128-bit product of the operands (RA) and (RB) is added to (RC).
135 The low-order 64 bits of the 128-bit sum are
136 placed into register RT.
137 The high-order 64 bits of the 128-bit sum are
138 placed into register RS.
139 RS is implicitly defined as the same register as RC.
140
141 All three operands and the result are interpreted as
142 unsigned integers.
143
144 The differences here to `maddhdu` are that `maddhdu` stores the upper
145 half in RT, where `maddedu` stores the upper half in RS.
146
147 The value stored in RT is exactly equivalent to `maddld` despite `maddld`
148 performing sign-extension on RC, because RT is the full mathematical result
149 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
150 results modulo 2^64. This is why there is no maddldu instruction.
151
152 *Programmer's Note:
153 To achieve a big-integer rolling-accumulation effect:
154 assuming the scalar to multiply is in r0, and r3 is
155 used (effectively) as a 64-bit carry,
156 the vector to multiply by starts at r4 and the result vector
157 in r20, instructions may be issued `maddedu r20,r4,r0,r3`
158 `maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have
159 stored the upper half of the 128-bit multiply into r3, such
160 that it may be picked up by the second `maddedu`. Repeat inline
161 to construct a larger bigint scalar-vector multiply,
162 as Scalar GPR register file space permits. If register
163 spill is required then r3, as the effective 64-bit carry,
164 continues the chain.*
165
166 Examples:
167
168 ```
169 # (r0 * r1) + r2, store lower in r4, upper in r2
170 maddedu r4, r0, r1, r2
171
172 # Chaining together for larger bigint (see Programmer's Note above)
173 # r3 starts with zero (no carry-in)
174 maddedu r20,r4,r0,r3
175 maddedu r21,r5,r0,r3
176 maddedu r22,r6,r0,r3
177 ```
178
179 ----------
180
181 \newpage{}
182
183 # Multiply-Add Extended Double Unsigned/Signed
184
185 `maddedus RT, RA, RB, RC`
186
187 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
188 |-------|------|-------|-------|-------|-------|---------|
189 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
190
191 Pseudocode:
192
193 ```
194 if (RB)[0] != 0 then # workaround no unsigned-signed mul op
195 prod[0:127] <- -((RA) * -(RB))
196 else
197 prod[0:127] <- (RA) * (RB)
198 sum[0:127] <- prod + EXTS128((RC))
199 RT <- sum[64:127] # Store low half in RT
200 RS <- sum[0:63] # RS implicit register, equal to RC
201 ```
202
203 Special registers altered:
204
205 None
206
207 The 64-bit operands are (RA), (RB), and (RC).
208 (RC) is sign-extended to 128-bits and then summed with the
209 128-bit product of zero-extended (RA) and sign-extended (RB).
210 The low-order 64 bits of the 128-bit sum are
211 placed into register RT.
212 The high-order 64 bits of the 128-bit sum are
213 placed into register RS.
214 RS is implicitly defined as the same register as RC.
215
216 *Programmer's Note:
217 To achieve a big-integer rolling-accumulation effect:
218 assuming the signed scalar to multiply is in r0, and r3 is
219 used (effectively) as a 64-bit carry,
220 the unsigned vector to multiply by starts at r4 and the signed result vector
221 in r20, instructions may be issued `maddedus r20,r4,r0,r3`
222 `maddedus r21,r5,r0,r3` etc. where the first `maddedus` will have
223 stored the upper half of the 128-bit multiply into r3, such
224 that it may be picked up by the second `maddedus`. Repeat inline
225 to construct a larger bigint scalar-vector multiply,
226 as Scalar GPR register file space permits. If register
227 spill is required then r3, as the effective 64-bit carry,
228 continues the chain.*
229
230 Examples:
231
232 ```
233 # (r0 * r1) + r2, store lower in r4, upper in r2
234 maddedus r4, r0, r1, r2
235
236 # Chaining together for larger bigint (see Programmer's Note above)
237 # r3 starts with zero (no carry-in)
238 maddedus r20,r4,r0,r3
239 maddedus r21,r5,r0,r3
240 maddedus r22,r6,r0,r3
241 ```
242
243 ----------
244
245 \newpage{}
246
247 # Divide/Modulo Quad-Double Unsigned
248
249 `divmod2du RT,RA,RB,RC`
250
251 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form |
252 |-------|------|-------|-------|-------|-------|---------|
253 | EXT04 | RT | RA | RB | RC | XO | VA-Form |
254
255 Pseudo-code:
256
257 ```
258 if ((RA) <u (RB)) & ((RB) != [0]*64) then # Check RA<RB, for divide-by-0
259 dividend[0:127] <- (RA) || (RC) # Combine RA/RC as 128-bit
260 divisor[0:127] <- [0]*64 || (RB) # Extend RB to 128-bit
261 result <- dividend / divisor # Unsigned Division
262 modulo <- dividend % divisor # Unsigned Modulo
263 RT <- result[64:127] # Store result in RT
264 RS <- modulo[64:127] # Modulo in RC, implicit
265 else # In case of error
266 RT <- [1]*64 # RT all 1's
267 RS <- [0]*64 # RS all 0's
268 ```
269
270 Special registers altered:
271
272 None
273
274 The 128-bit dividend is (RA) || (RC). The 64-bit divisor is
275 (RB). If the quotient can be represented in 64 bits, it is
276 placed into register RT. The modulo is placed into register RS.
277 RS is implicitly defined as the same register as RC, similarly to maddedu.
278
279 The quotient can be represented in 64-bits when both these conditions
280 are true:
281
282 * (RA) < (RB) (unsigned comparison)
283 * (RB) is NOT 0 (not divide-by-0)
284
285 If these conditions are not met, RT is set to all 1's, RS to all 0's.
286
287 All operands, quotient, and modulo are interpreted as unsigned integers.
288
289 Divide/Modulo Quad-Double Unsigned is a VA-Form instruction
290 that is near-identical to `divdeu` except that:
291
292 * the lower 64 bits of the dividend, instead of being zero, contain a
293 register, RC.
294 * it performs a fused divide and modulo in a single instruction, storing
295 the modulo in an implicit RS (similar to `maddedu`)
296 * There is no `UNDEFINED` behaviour.
297
298 RB, the divisor, remains 64 bit. The instruction is therefore a 128/64
299 division, producing a (pair) of 64 bit result(s), in the same way that
300 Intel [divq](https://www.felixcloutier.com/x86/div) works.
301 Overflow conditions
302 are detected in exactly the same fashion as `divdeu`, except that rather
303 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
304 zeros on overflow.
305
306 *Programmer's note: there are no Rc variants of any of these VA-Form
307 instructions. `cmpi` will need to be used to detect overflow conditions:
308 the saving in instruction count is that both RT and RS will have already
309 been set to useful values (all 1s and all zeros respectively)
310 needed as part of implementing Knuth's Algorithm D*
311
312 For Scalar usage, just as for `maddedu`, `RS=RC`
313 Examples:
314
315 ```
316 # ((r0 << 64) + r2) / r1, store in r4
317 # ((r0 << 64) + r2) % r1, store in r2
318 divmod2du r4, r0, r1, r2
319 ```
320
321 ----------
322
323 \newpage{}
324
325 # Double-Shift Left Doubleword
326
327 `dsld RT,RA,RB,RC`
328
329 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form |
330 |-------|------|-------|-------|-------|-------|----|----------|
331 | EXT04 | RT | RA | RB | RC | XO | Rc | VA2-Form |
332
333 Pseudo-code:
334
335 n <- (RB)[58:63] # Use lower 6-bits for shift
336 v <- ROTL64((RA), n) # Rotate RA 64-bit left by n bits
337 mask <- MASK(64, 63-n) # 1s mask in MSBs
338 RT <- (v[0:63] & mask) | ((RC) & ¬mask) # mask-in RC into RT
339 RS <- v[0:63] & ¬mask # part normally lost into RC
340 overflow = 0 # Clear overflow flag
341 if RS != [0]*64: # Check if RS is NOT zero
342 overflow = 1 # Set the overflow flag
343
344 Special Registers Altered:
345
346 CR0 (if Rc=1)
347
348 The contents of register RA are shifted left the number
349 of bits specified by (RB) 58:63. The same number of
350 shifted bits are taken from the **right** (LSB) end of register
351 RC and placed into the **rightmost** (LSB) end of the result, RT.
352 Additionally, the MSB (leftmost) bits of register RA that would normally
353 be discarded by a 64-bit left shift are placed into the
354 LSBs of RS.
355
356 When Rc=1, the overflow flag in CR0 is set if RS is nonzero,
357 or cleared if it is zero; all other bits of CR0 are set from RT as normal.
358 XER.OV and XER.SO remain unchanged.
359
360 *Programmer's note:
361 similar to maddedu and divmod2du, dsld can be chained (using RC),
362 effectively using RC as a 64-bit carry-in and carry-out. Arbitrary
363 length Scalar-Vector shift may be performed without the additional
364 masking instructions normally needed.*
365
366 ----------
367
368 # Double-Shift Right Doubleword
369
370 `dsrd RT,RA,RB,RC`
371
372 | 0-5 | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form |
373 |-------|------|-------|-------|-------|-------|----|----------|
374 | EXT04 | RT | RA | RB | RC | XO | Rc | VA2-Form |
375
376 Pseudo-code:
377
378 n <- (RB)[58:63] # Take lower 6-bits for shift
379 v <- ROTL64((RA), 64-n) # Rotate RA 64-bit left by 64-n bits
380 mask <- MASK(n, 63) # 0's mask, set mask[n:63] to 1'
381 RT <- (v[0:63] & mask) | ((RC) & ¬mask) #
382 RS <- v[0:63] & ¬mask
383 overflow = 0
384 if RS != [0]*64:
385 overflow = 1
386
387 Special Registers Altered:
388
389 CR0 (if Rc=1)
390
391 The contents of register RA are shifted right the number
392 of bits specified by (RB) 58:63. The same number of
393 shifted bits are taken from the **left** (MSB) end of register RC
394 and placed into the **leftmost** (MSB) end of the result, RT.
395 Additionally, the LSB (rightmost) bits of register RA that would normally
396 be discarded by a 64-bit right shift are placed into the
397 MSBs of RS.
398
399 When Rc=1, the overflow flag in CR0 is set if RS is nonzero,
400 or cleared if it is zero; all other bits of CR0 are set from RT as normal.
401 XER.OV and XER.SO remain unchanged.
402
403 *Programmer's note:
404 similar to maddedu and divmod2du, dsrd can be chained (using RC),
405 effectively using RC as a 64-bit carry-in and carry-out. Arbitrary
406 length Scalar-Vector shift may be performed without the additional
407 masking instructions normally needed.*
408
409 ----------
410
411 \newpage{}
412
413 # VA2-Form
414
415 Add the following to Book I, 1.6.21.1, VA2-Form
416
417 ```
418 |0 |6 |11 |16 |21 |24|26 |31 |
419 | PO | RT | RA | RB | RC | XO | Rc |
420 ```
421
422 Add 'VA2-Form' to `RA` thru `XO` Field in Book I, 1.6.2
423
424 ```
425 RA (11:15)
426 Field used to specify a GPR to be used as a
427 source or as a target.
428 Formats: ... VA2, ...
429
430 RB (16:20)
431 Field used to specify a GPR to be used as a
432 source.
433 Formats: ... VA2, ...
434
435 RC (21:25)
436 Field used to specify a GPR to be used as a
437 source.
438 Formats: ... VA2, ...
439
440 Rc (31)
441 RECORD bit.
442 0 Do not alter the Condition Register.
443 1 Set Condition Register Field 0 or Field 1 as
444 described in Section 2.3.1, 'Condition Regis-
445 ter' on page 30.
446 Formats: ... VA2, ...
447
448 RT (6:10)
449 Field used to specify a GPR to be used as a target.
450 Formats: ... VA2, ...
451
452 XO (26:30)
453 Extended opcode field.
454 Formats: ... VA2, ...
455 ```
456
457 ----------
458
459 # Appendices
460
461 Appendix E Power ISA sorted by opcode
462 Appendix F Power ISA sorted by version
463 Appendix G Power ISA sorted by Compliancy Subset
464 Appendix H Power ISA sorted by mnemonic
465
466 |Form| Book | Page | Version | mnemonic | Description |
467 |----|------|------|---------|----------|-------------|
468 |VA | I | # | 3.2B |maddedu | Multiply-Add Extend Double Unsigned |
469 |VA | I | # | 3.2B |maddedus | Multiply-Add Extend Double Unsigned Signed |
470 |VA | I | # | 3.2B |divmod2du | Divide/Modulo Quad-Double Unsigned |
471 |VA2 | I | # | 3.2B |dsld | Double-Shift Left Doubleword |
472 |VA2 | I | # | 3.2B |dsrd | Double-Shift Right Doubleword |
473
474 ----------------
475
476 [[!tag opf_rfc]]