openpower/sv/biginteger.mdwn

   1 [[!tag standards]]
   2
   3 # Big Integer Arithmetic
   4
   5 **DRAFT STATUS** 19apr2022, last edited 23may2022
   6
   7 * [[discussion]] page for notes
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=817> bugreport
   9 * [[biginteger/analysis]]
  10
  11 BigNum arithmetic is extremely common especially in cryptography,
  12 where for example RSA relies on arithmetic of 2048 or 4096 bits
  13 in length.  The primary operations are add, multiply and divide
  14 (and modulo) with specialisations of subtract and signed multiply.
  15
  16 A reminder that a particular focus of SVP64 is that it is built on
  17 top of Scalar operations, where those scalar operations are useful in
  18 their own right without SVP64. Thus the operations here are proposed
  19 first as Scalar Extensions to the Power ISA.
  20
  21 A secondary focus is that if Vectorised, implementors may choose
  22 to deploy macro-op fusion targetting back-end 256-bit or greater
  23 Dynamic SIMD ALUs for maximum performance and effectiveness.
  24
  25 # Analysis
  26
  27 Covered in [[biginteger/analysis]] the summary is that standard `adde`
  28 is sufficient for SVP64 Vectorisation of big-integer addition (and subfe
  29 for subtraction) but that big-integer multiply and divide require an
  30 extra 3-in 2-out instruction, similar to Intel's `mulx`, to be efficient.
  31 The same instruction (`madded`) is used for both because 'madded''s primary
  32 purpose is to perform a fused 64-bit scalar multiply with a large vector,
  33 where that result is Big-Added for Big-Multiply, but Big-Subtracted for
  34 Big-Divide.
  35
  36 Macro-op Fusion and back-end massively-wide SIMD ALUs may be deployed in a
  37 fashion that is hidden from the user, behind a consistent, stable ISA API.
  38 The same macro-op fusion may theoretically be deployed even on Scalar
  39 operations.
  40
  41 # madded
  42
  43 **DRAFT**
  44
  45 `madded` is similar to v3.0 `madd`, and
  46 is VA-Form despite having 2 outputs: the second
  47 destination register is implicit.
  48
  49 |0.....5|6..10|11..15|16..20|21..25|26..31|
  50 |-------|-----|------|------|------|------|
  51 | EXT04 | RT  |  RA  |  RB  |   RC |  XO  |
  52
  53 The pseudocode for `madded RT, RA, RB, RC` is:
  54
  55     prod[0:127] = (RA) * (RB)
  56     sum[0:127] = EXTZ(RC) + prod
  57     RT <- sum[64:127]
  58     RS <- sum[0:63] # RS implicit register, see below
  59
  60 RC is zero-extended (not shifted, not sign-extended), the 128-bit product added
  61 to it; the lower half of that result stored in RT and the upper half
  62 in RS.
  63
  64 The differences here to `maddhdu` are that `maddhdu` stores the upper
  65 half in RT, where `madded` stores the upper half in RS. There is no
  66 equivalent to `maddld` because `maddld` performs sign-extension on RC.
  67
  68 *Programmer's Note:
  69 As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1.
  70 To achieve the same big-integer rolling-accumulation effect
  71 as SVP64: assuming the scalar to multiply is in r0,
  72 the vector to multiply by starts at r4 and the result vector
  73 in r20, instructions may be issued `madded r20,r4,r0,r20
  74 madded r21,r5,r0,r21` etc. where the first `madded` will have
  75 stored the upper half of the 128-bit multiply into r21, such
  76 that it may be picked up by the second `madded`. Repeat inline
  77 to construct a larger bigint scalar-vector multiply,
  78 as Scalar GPR register file space permits.*
  79
  80 SVP64 overrides the Scalar behaviour of what defines RS.
  81 For SVP64 EXTRA register extension, the `RM-1P-3S-1D` format is
  82 used with the additional bit set for determining RS.
  83
  84 | Field Name | Field bits | Description                            |
  85 |------------|------------|----------------------------------------|
  86 | Rdest\_EXTRA2 | `10:11` | extends RT (R\*\_EXTRA2 Encoding)   |
  87 | Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding)   |
  88 | Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding)   |
  89 | Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding)   |
  90 | EXTRA2_MODE   | `18`    | used by `madded` for determining RS |
  91
  92 When `EXTRA2_MODE` is set to zero, the implicit RS register takes
  93 its Vector/Scalar setting from Rdest_EXTRA2, and takes
  94 the register number from RT, but all numbering
  95 is offset by MAXVL. *Note that element-width overrides influence this
  96 offset* (see SVP64 [[svp64/appendix]] for full details).
  97
  98 When `EXTRA2_MODE` is set to one, the implicit RS register is identical
  99 to RC extended with SVP64 using `Rsrc3_EXTRA2` in every respect, including whether RC is set Scalar or
 100 Vector.
 101
 102 # divrem2du RT,RA,RB,RC
 103
 104 **DRAFT**
 105
 106 Divide/Modulu Quad-Double Unsigned is another VA-Form instruction
 107 that is near-identical to `divdeu` except that:
 108
 109 * the lower 64 bits of the dividend, instead of being zero, contain a
 110   register, RC.
 111 * it performs a fused divide and modulo in a single instruction, storing
 112   the modulo in an implicit RS (similar to `madded`)
 113
 114 RB, the divisor, remains 64 bit.  The instruction is therefore a 128/64
 115 division, producing a (pair) of 64 bit result(s).  Overflow conditions
 116 are detected in exactly the same fashion as `divdeu`, except that rather
 117 than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
 118 zeros on overflow.
 119
 120 *Programmer's note: there are no Rc variants of any of these VA-Form
 121 instructions. `cmpi` will need to be used to detect overflow conditions:
 122 the saving in instruction count is that both RT and RS will have already
 123 been set to useful values needed as part of implementing Knuth's
 124 Algorithm D*
 125
 126 For SVP64, given that this instruction is also 3-in 2-out 64-bit registers,
 127 the exact same EXTRA format and setting of RS is used as for `sv.madded`.
 128 For Scalar usage, just as for `madded`, `RS=RT+1` (similar to `lq` and `stq`).
 129
 130 Pseudo-code:
 131
 132     if ((RA) <u (RB)) & ((RB) != [0]*XLEN) then
 133         dividend[0:(XLEN*2)-1] <- (RA) || (RC)
 134         divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
 135         result <- dividend / divisor
 136         modulo <- dividend % divisor
 137         RT <- result[XLEN:(XLEN*2)-1]
 138         RS <- modulo[XLEN:(XLEN*2)-1]
 139     else
 140         RT <- [1]*XLEN
 141         RS <- [0]*XLEN
 142
 143 # [DRAFT] EXT04 Proposed Map
 144
 145 For the Opcode map (XO Field)
 146 see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.
 147 Proposed is the addition of `madded` (**DRAFT, NOT APPROVED**) in `110010`
 148 and `divrem2du` in `110100`
 149
 150 |110000|110001 |110010    |110011|110100       |110101|110110|110111|
 151 |------|-------|----------|------|-------------|------|------|------|
 152 |maddhd|maddhdu|**madded**|maddld|**divrem2du**|rsvd  |rsvd  |rsvd  |
 153