openpower/sv/rfc/ls003.mdwn

   1 # RFC ls003 Big Integer
   2
   3 **URLs**:
   4
   5 * <https://libre-soc.org/openpower/sv/biginteger/analysis/>
   6 * <https://libre-soc.org/openpower/sv/rfc/ls003/>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=960>
   8 * <https://git.openpower.foundation/isa/PowerISA/issues/91>
   9
  10 **Severity**: Major
  11
  12 **Status**: New
  13
  14 **Date**: 20 Oct 2022
  15
  16 **Target**: v3.2B
  17
  18 **Source**: v3.0B
  19
  20 **Books and Section affected**: **UPDATE**
  21
  22 ```
  23     Book I 64-bit Fixed-Point Arithmetic Instructions 3.3.9.1
  24     Appendix E Power ISA sorted by opcode
  25     Appendix F Power ISA sorted by version
  26     Appendix G Power ISA sorted by Compliancy Subset
  27     Appendix H Power ISA sorted by mnemonic
  28 ```
  29
  30 **Summary**
  31
  32 Instructions added
  33
  34 ```
  35     maddedu - Multiply-Add Extended Double Unsigned
  36     maddedus - Multiply-Add Extended Double Unsigned/Signed
  37     divmod2du - Divide/Modulo Quad-Double Unsigned
  38     dsld - Double Shift Left Doubleword
  39     dsrd - Double Shift Right Doubleword
  40 ```
  41
  42 **Submitter**: Luke Leighton (Libre-SOC)
  43
  44 **Requester**: Libre-SOC
  45
  46 **Impact on processor**:
  47
  48 ```
  49     Addition of five new GPR-based instructions
  50 ```
  51
  52 **Impact on software**:
  53
  54 ```
  55     Requires support for new instructions in assembler, debuggers,
  56     and related tools.
  57 ```
  58
  59 **Keywords**:
  60
  61 ```
  62     GPR, Big-integer, Double-word
  63 ```
  64
  65 **Motivation**
  66
  67 * Similar to `maddhdu` and `maddld`, but allow for a big-integer rolling
  68   accumulation affect: `RC` effectively becomes a 64-bit carry in chains
  69   of highly-efficient loop-unrolled arbitrary-length big-integer operations.
  70 * Similar to `divdeu`, and has similar advantages to `maddedu`,
  71   Modulo result is available with the quotient in a single instruction
  72   allowing highly-efficient arbitrary-length big-integer division.
  73 * Combining at least three instructions into one, the `dsld` and `dsrd`
  74   instructions make shifting an arbitrary-length big-integer vector by
  75   a scalar 64-bit quantity highly efficient.
  76
  77 **Notes and Observations**:
  78
  79 1. It is not practical to add Rc=1 variants when VA-Form is used and
  80    there is a **pair** of results produced.
  81 2. An overflow variant (XER.OV set) of `divmod2du` would be valuable
  82    but VA-Form EXT004 is under severe pressure.
  83 3. Both `maddhdu` and `divmod2du` instructions have been present in Intel x86
  84    for several decades.  Likewise, `dsld` and `dsrd`.
  85 4. None of these instruction is present in VSX: these are 128/64 whereas
  86    VSX is 128/128.
  87 5. `maddedu` and `divmod2du` are full inverses of each other, including
  88   when used for arbitrary-length big-integer arithmetic.
  89 6. These are all 3-in 2-out instructions. If Power ISA did not already
  90   have LD/ST-with-update instructions and instructions with `RAp`
  91   and `RTp` then these instructions would not be proposed.
  92 7. `maddedus` is the first Scalar signed/unsigned multiply instruction. The
  93   only other signed/unsigned multiply instruction is the
  94   specialist `vmsummbm` (bytes only), requires VSX,
  95   and is unsuited for big-integer or other general arithmetic.
  96 8. Unresolved: dsld/dsrd are 3-in 3-out (in the Rc=1 variants) where the
  97    normal threshold set is 3-in 2-out.
  98
  99 **Changes**
 100
 101 Add the following entries to:
 102
 103 * the Appendices of Book I
 104 * Instructions of Book I added to Section 3.3.9.1
 105 * VA2-Form of Book I Section 1.6.21.1 and 1.6.2
 106
 107 ----------------
 108
 109 \newpage{}
 110
 111 # Multiply-Add Extended Double Unsigned
 112
 113 `maddedu RT, RA, RB, RC`
 114
 115 |  0-5  | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form    |
 116 |-------|------|-------|-------|-------|-------|---------|
 117 | EXT04 | RT   |  RA   |  RB   |   RC  |  XO   | VA-Form |
 118
 119 Pseudocode:
 120
 121 ```
 122 prod[0:127] <- (RA) * (RB)    # Multiply RA and RB, result 128-bit
 123 sum[0:127] <- EXTZ(RC) + prod # Zero extend RC, add product
 124 RT <- sum[64:127]             # Store low half in RT
 125 RS <- sum[0:63]               # RS implicit register, equal to RC
 126 ```
 127
 128 Special registers altered:
 129
 130     None
 131
 132 The 64-bit operands are (RA), (RB), and (RC).
 133 RC is zero-extended (not shifted, not sign-extended).
 134 The 128-bit product of the operands (RA) and (RB) is added to (RC).
 135 The low-order 64 bits of the 128-bit sum are
 136 placed into register RT.
 137 The high-order 64 bits of the 128-bit sum are
 138 placed into register RS.
 139 RS is implicitly defined as the same register as RC.
 140
 141 All three operands and the result are interpreted as
 142 unsigned integers.
 143
 144 The differences here to `maddhdu` are that `maddhdu` stores the upper
 145 half in RT, where `maddedu` stores the upper half in RS.
 146
 147 The value stored in RT is exactly equivalent to `maddld` despite `maddld`
 148 performing sign-extension on RC, because RT is the full mathematical result
 149 modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
 150 results modulo 2^64. This is why there is no maddldu instruction.
 151
 152 *Programmer's Note:
 153 To achieve a big-integer rolling-accumulation effect:
 154 assuming the scalar to multiply is in r0, and r3 is
 155 used (effectively) as a 64-bit carry,
 156 the vector to multiply by starts at r4 and the result vector
 157 in r20, instructions may be issued `maddedu r20,r4,r0,r3`
 158 `maddedu r21,r5,r0,r3` etc. where the first `maddedu` will have
 159 stored the upper half of the 128-bit multiply into r3, such
 160 that it may be picked up by the second `maddedu`. Repeat inline
 161 to construct a larger bigint scalar-vector multiply,
 162 as Scalar GPR register file space permits. If register
 163 spill is required then r3, as the effective 64-bit carry,
 164 continues the chain.*
 165
 166 Examples:
 167
 168 ```
 169 # (r0 * r1) + r2, store lower in r4, upper in r2
 170 maddedu r4, r0, r1, r2
 171
 172 # Chaining together for larger bigint (see Programmer's Note above)
 173 # r3 starts with zero (no carry-in)
 174 maddedu r20,r4,r0,r3
 175 maddedu r21,r5,r0,r3
 176 maddedu r22,r6,r0,r3
 177 ```
 178
 179 ----------
 180
 181 \newpage{}
 182
 183 # Multiply-Add Extended Double Unsigned/Signed
 184
 185 `maddedus RT, RA, RB, RC`
 186
 187 |  0-5  | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form    |
 188 |-------|------|-------|-------|-------|-------|---------|
 189 | EXT04 | RT   |  RA   |  RB   |   RC  |  XO   | VA-Form |
 190
 191 Pseudocode:
 192
 193 ```
 194 if (RB)[0] != 0 then                 # workaround no unsigned-signed mul op
 195     prod[0:127] <- -((RA) * -(RB))
 196 else
 197     prod[0:127] <- (RA) * (RB)
 198 sum[0:127] <- prod + EXTS128((RC))
 199 RT <- sum[64:127]                    # Store low half in RT
 200 RS <- sum[0:63]                      # RS implicit register, equal to RC
 201 ```
 202
 203 Special registers altered:
 204
 205     None
 206
 207 The 64-bit operands are (RA), (RB), and (RC).
 208 (RC) is sign-extended to 128-bits and then summed with the
 209 128-bit product of zero-extended (RA) and sign-extended (RB).
 210 The low-order 64 bits of the 128-bit sum are
 211 placed into register RT.
 212 The high-order 64 bits of the 128-bit sum are
 213 placed into register RS.
 214 RS is implicitly defined as the same register as RC.
 215
 216 *Programmer's Note:
 217 To achieve a big-integer rolling-accumulation effect:
 218 assuming the signed scalar to multiply is in r0, and r3 is
 219 used (effectively) as a 64-bit carry,
 220 the unsigned vector to multiply by starts at r4 and the signed result vector
 221 in r20, instructions may be issued `maddedus r20,r4,r0,r3`
 222 `maddedus r21,r5,r0,r3` etc. where the first `maddedus` will have
 223 stored the upper half of the 128-bit multiply into r3, such
 224 that it may be picked up by the second `maddedus`. Repeat inline
 225 to construct a larger bigint scalar-vector multiply,
 226 as Scalar GPR register file space permits. If register
 227 spill is required then r3, as the effective 64-bit carry,
 228 continues the chain.*
 229
 230 Examples:
 231
 232 ```
 233 # (r0 * r1) + r2, store lower in r4, upper in r2
 234 maddedus r4, r0, r1, r2
 235
 236 # Chaining together for larger bigint (see Programmer's Note above)
 237 # r3 starts with zero (no carry-in)
 238 maddedus r20,r4,r0,r3
 239 maddedus r21,r5,r0,r3
 240 maddedus r22,r6,r0,r3
 241 ```
 242
 243 ----------
 244
 245 \newpage{}
 246
 247 # Divide/Modulo Quad-Double Unsigned
 248
 249 `divmod2du RT,RA,RB,RC`
 250
 251 |  0-5  | 6-10 | 11-15 | 16-20 | 21-25 | 26-31 | Form    |
 252 |-------|------|-------|-------|-------|-------|---------|
 253 | EXT04 | RT   |  RA   |  RB   |   RC  |  XO   | VA-Form |
 254
 255 Pseudo-code:
 256
 257 ```
 258 if ((RA) <u (RB)) & ((RB) != [0]*64) then  # Check RA<RB, for divide-by-0
 259     dividend[0:127] <- (RA) || (RC)        # Combine RA/RC as 128-bit
 260     divisor[0:127] <- [0]*64 || (RB)       # Extend RB to 128-bit
 261     result <- dividend / divisor           # Unsigned Division
 262     modulo <- dividend % divisor           # Unsigned Modulo
 263     RT <- result[64:127]                   # Store result in RT
 264     RS <- modulo[64:127]                   # Modulo in RC, implicit
 265 else                                       # In case of error
 266     RT <- [1]*64                           # RT all 1's
 267     RS <- [0]*64                           # RS all 0's
 268 ```
 269
 270 Special registers altered:
 271
 272     None
 273
 274 The 128-bit dividend is (RA) || (RC). The 64-bit divisor is
 275 (RB). If the quotient can be represented in 64 bits, it is
 276 placed into register RT. The modulo is placed into register RS.
 277 RS is implicitly defined as the same register as RC, similarly to maddedu.
 278
 279 The quotient can be represented in 64-bits when both these conditions
 280 are true:
 281
 282 * (RA) < (RB) (unsigned comparison)
 283 * (RB) is NOT 0 (not divide-by-0)
 284
 285 If these conditions are not met, RT is set to all 1's, RS to all 0's.
 286
 287 All operands, quotient, and modulo are interpreted as unsigned integers.
 288
 289 Divide/Modulo Quad-Double Unsigned is a VA-Form instruction
 290 that is near-identical to `divdeu` except that:
 291
 292 * the lower 64 bits of the dividend, instead of being zero, contain a
 293   register, RC.
 294 * it performs a fused divide and modulo in a single instruction, storing
 295   the modulo in an implicit RS (similar to `maddedu`)
 296 * There is no `UNDEFINED` behaviour.
 297
 298 RB, the divisor, remains 64 bit.  The instruction is therefore a 128/64
 299 division, producing a (pair) of 64 bit result(s), in the same way that
 300 Intel [divq](https://www.felixcloutier.com/x86/div) works.
 301 Overflow conditions
 302 are detected in exactly the same fashion as `divdeu`, except that rather
 303 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
 304 zeros on overflow.
 305
 306 *Programmer's note: there are no Rc variants of any of these VA-Form
 307 instructions. `cmpi` will need to be used to detect overflow conditions:
 308 the saving in instruction count is that both RT and RS will have already
 309 been set to useful values (all 1s and all zeros respectively)
 310 needed as part of implementing Knuth's Algorithm D*
 311
 312 For Scalar usage, just as for `maddedu`, `RS=RC`
 313 Examples:
 314
 315 ```
 316 # ((r0 << 64) + r2) / r1, store in r4
 317 # ((r0 << 64) + r2) % r1, store in r2
 318 divmod2du r4, r0, r1, r2
 319 ```
 320
 321 ----------
 322
 323 \newpage{}
 324
 325 # Double-Shift Left Doubleword
 326
 327 `dsld RT,RA,RB,RC`
 328
 329 |  0-5  | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form     |
 330 |-------|------|-------|-------|-------|-------|----|----------|
 331 | EXT04 | RT   |  RA   |  RB   |   RC  |  XO   | Rc | VA2-Form |
 332
 333 Pseudo-code:
 334
 335     n <- (RB)[58:63]                        # Use lower 6-bits for shift
 336     v <- ROTL64((RA), n)                    # Rotate RA 64-bit left by n bits
 337     mask <- MASK(64, 63-n)                  # 1s mask in MSBs
 338     RT <- (v[0:63] & mask) | ((RC) & ¬mask) # mask-in RC into RT
 339     RS <- v[0:63] & ¬mask                   # part normally lost into RC
 340     overflow = 0                            # Clear overflow flag
 341     if RS != [0]*64:                        # Check if RS is NOT zero
 342         overflow = 1                        # Set the overflow flag
 343
 344 Special Registers Altered:
 345
 346     CR0                    (if Rc=1)
 347
 348 The contents of register RA are shifted left the number
 349 of bits specified by (RB) 58:63. The same number of
 350 shifted bits are taken from the **right** (LSB) end of register
 351 RC and placed into the **rightmost** (LSB) end of the result, RT.
 352 Additionally, the MSB (leftmost) bits of register RA that would normally
 353 be discarded by a 64-bit left shift are placed into the
 354 LSBs of RS.
 355
 356 When Rc=1, the overflow flag in CR0 is set if RS is nonzero,
 357 or cleared if it is zero; all other bits of CR0 are set from RT as normal.
 358 XER.OV and XER.SO remain unchanged.
 359
 360 *Programmer's note:
 361 similar to maddedu and divmod2du, dsld can be chained (using RC),
 362 effectively using RC as a 64-bit carry-in and carry-out. Arbitrary
 363 length Scalar-Vector shift may be performed without the additional
 364 masking instructions normally needed.*
 365
 366 ----------
 367
 368 # Double-Shift Right Doubleword
 369
 370 `dsrd RT,RA,RB,RC`
 371
 372 |  0-5  | 6-10 | 11-15 | 16-20 | 21-25 | 26-30 | 31 | Form     |
 373 |-------|------|-------|-------|-------|-------|----|----------|
 374 | EXT04 | RT   |  RA   |  RB   |   RC  |  XO   | Rc | VA2-Form |
 375
 376 Pseudo-code:
 377
 378     n <- (RB)[58:63]                        # Take lower 6-bits for shift
 379     v <- ROTL64((RA), 64-n)                 # Rotate RA 64-bit left by 64-n bits
 380     mask <- MASK(n, 63)                     # 0's mask, set mask[n:63] to 1'
 381     RT <- (v[0:63] & mask) | ((RC) & ¬mask) #
 382     RS <- v[0:63] & ¬mask
 383     overflow = 0
 384     if RS != [0]*64:
 385         overflow = 1
 386
 387 Special Registers Altered:
 388
 389     CR0                    (if Rc=1)
 390
 391 The contents of register RA are shifted right the number
 392 of bits specified by (RB) 58:63. The same number of
 393 shifted bits are taken from the **left** (MSB) end of register RC
 394 and placed into the **leftmost** (MSB) end of the result, RT.
 395 Additionally, the LSB (rightmost) bits of register RA that would normally
 396 be discarded by a 64-bit right shift are placed into the
 397 MSBs of RS.
 398
 399 When Rc=1, the overflow flag in CR0 is set if RS is nonzero,
 400 or cleared if it is zero; all other bits of CR0 are set from RT as normal.
 401 XER.OV and XER.SO remain unchanged.
 402
 403 *Programmer's note:
 404 similar to maddedu and divmod2du, dsrd can be chained (using RC),
 405 effectively using RC as a 64-bit carry-in and carry-out. Arbitrary
 406 length Scalar-Vector shift may be performed without the additional
 407 masking instructions normally needed.*
 408
 409 ----------
 410
 411 \newpage{}
 412
 413 # VA2-Form
 414
 415 Add the following to Book I, 1.6.21.1, VA2-Form
 416
 417 ```
 418 |0      |6     |11     |16     |21  |24|26  |31  |
 419 | PO    |  RT  |   RA  |   RB  | RC    | XO | Rc |
 420 ```
 421
 422 Add 'VA2-Form' to `RA` thru `XO` Field in Book I, 1.6.2
 423
 424 ```
 425 RA (11:15)
 426     Field used to specify a GPR to be used as a
 427     source or as a target.
 428     Formats: ... VA2, ...
 429
 430 RB (16:20)
 431     Field used to specify a GPR to be used as a
 432     source.
 433     Formats: ... VA2, ...
 434
 435 RC (21:25)
 436     Field used to specify a GPR to be used as a
 437     source.
 438     Formats: ... VA2, ...
 439
 440 Rc (31)
 441     RECORD bit.
 442     0    Do not alter the Condition Register.
 443     1    Set Condition Register Field 0 or Field 1 as
 444          described in Section 2.3.1, 'Condition Regis-
 445          ter' on page 30.
 446     Formats: ... VA2, ...
 447
 448 RT (6:10)
 449     Field used to specify a GPR to be used as a target.
 450     Formats: ... VA2, ...
 451
 452 XO (26:30)
 453     Extended opcode field.
 454     Formats: ... VA2, ...
 455 ```
 456
 457 ----------
 458
 459 # Appendices
 460
 461     Appendix E Power ISA sorted by opcode
 462     Appendix F Power ISA sorted by version
 463     Appendix G Power ISA sorted by Compliancy Subset
 464     Appendix H Power ISA sorted by mnemonic
 465
 466 |Form| Book | Page | Version | mnemonic | Description |
 467 |----|------|------|---------|----------|-------------|
 468 |VA  | I    | #    | 3.2B    |maddedu   | Multiply-Add Extend Double Unsigned |
 469 |VA  | I    | #    | 3.2B    |maddedus  | Multiply-Add Extend Double Unsigned Signed |
 470 |VA  | I    | #    | 3.2B    |divmod2du | Divide/Modulo Quad-Double Unsigned |
 471 |VA2 | I    | #    | 3.2B    |dsld      | Double-Shift Left Doubleword |
 472 |VA2 | I    | #    | 3.2B    |dsrd      | Double-Shift Right Doubleword |
 473
 474 ----------------
 475
 476 [[!tag opf_rfc]]