openpower/sv/biginteger.mdwn

   1 [[!tag standards]]
   2
   3 # Big Integer Arithmetic
   4
   5 **DRAFT STATUS** 19apr2021
   6
   7 BigNum arithmetic is extremely common especially in cryptography,
   8 where for example RSA relies on arithmetic of 2048 or 4096 bits
   9 in length.  The primary operations are add, multiply and divide
  10 (and modulo) with specialisations of subtract and signed multiply.
  11
  12 A reminder that a particular focus of SVP64 is that it is built on
  13 top of Scalar operations, where those scalar operations are useful in
  14 their own right without SVP64. Thus the operstions here are proposed
  15 first as Scalar Extensions to the Power ISA.
  16
  17 A secondary focus is that if Vectorised, implementors may choose
  18 to deploy macro-op fusion targetting back-end 256-bit or greater
  19 Dynamic SIMD ALUs for maximum performance and effectiveness.
  20
  21 # Analysis
  22
  23 This section covers an analysis of big integer operations.  Use of
  24 smaller sub-operations is a given: worst-case, addition is O(N)
  25 whilst multiply and divide are O(N^2).
  26
  27 ## Add and Subtract
  28
  29 Surprisingly, no new additional instructions are required to perform
  30 a straightforward big-integer add or subtract.  Vectorised `addeo`
  31 or `addex` is perfectly sufficient to produce arbitrary-length
  32 big-integer add due to the rules set in SVP64 that all Vector Operations
  33 are directly equivalent to the strict Program Order Execution of
  34 their element-level operations.
  35
  36 Thus, due to sequential execution of `addeo` both consuming and producing
  37 a CA Flag, `sv.addeo` is in effect an alias for Vectorised add.  As such,
  38 implementors are entirely at liberty to recognise Horizontal-First Vector
  39 adds and send the vector of registers to a much larger and wider back-end
  40 ALU.
  41
  42 ## Multiply
  43
  44 Multiply is tricky: 64 bit operands actually produce a 128-bit result,
  45 which clearly cannot fit into an orthogonal register file.
  46 Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
  47 instructions, whilst some (OpenRISC) have "Accumulators" from which
  48 the results of the multiply must be explicitly extracted. High
  49 performance RISC advocates
  50 recommend "macro-op fusion" which is in effect where the second instruction
  51 gains access to the cached copy of the HI half of the
  52 multiply redult, which had already been
  53 computed by the first. This approach quickly complicates the internal
  54 microarchitecture, especially at the decode phase.
  55
  56 Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
  57 both HI and LO halves of the multiply to reach registers.  If done as a
  58 multiply-and-accumulate this becomes quite an expensive operation:
  59 3 64-Bit in, 2 64-bit registers out).
  60
  61 Long-multiplication may be performed a row at a time, starting
  62 with B0:
  63
  64     C4 C3 C2 C1 C0
  65              A0xB0
  66           A1xB0
  67        A2xB0
  68     A3xB0
  69     R4 R3 R2 R1 R0
  70
  71 * R0 contains C0 plus the LO half of A0 times B0
  72 * R1 contains C1 plus the LO half of A1 times B0
  73   plus the HI half of A0 times B0.
  74 * R2 contains C2 plus the LO half of A2 times B0
  75   plus the HI half of A1 times B0.
  76
  77 This would on the face of it be a 4-in operation:
  78 the upper half of a previous multiply, two new operands
  79 to multiply, and an additional accumulator (C). However if
  80 C is left out (and added afterwards with a Vector-Add)
  81 things become more manageable.
  82
  83 We therefore propose an operation that is 3-in, 2-out,
  84 that, noting that the connection between successive
  85 mul-adds has the UPPER half of the previous operation
  86 as its input, writes the UPPER half of the current
  87 product into a second output register for exactly that
  88 purpose.
  89
  90     product = RA*RB+RC
  91     RT = lowerhalf(product)
  92     RC = upperhalf(product)
  93
  94 Successive iterations effectively use RC as a 64-bit carry, and
  95 as noted by Intel in their notes on mulx,
  96 RA*RB+RC+RD cannot overflow, so does not require
  97 setting an additional CA flag.
  98
  99 Combined with a Vectorised big-int `sv.addeo` the key inner loop of
 100 Knuth's Algorithm M may be achieved in four instructions:
 101
 102     li r16, 0               # carry-accululator to zero
 103     addicc r16, r16, 0      # CA to zero as well
 104     sv.mulx r0.v, r8.v, r16 # mul vector using r16
 105     sv.addeo
 106
 107
 108 Normally, in a Scalar ISA, the use of a register as both a source
 109 and destination like this would create costly Dependency Hazards, so
 110 such an instruction would never be proposed.  However: it turns out
 111 that, just as with repeated chained application of `addeo`, macro-op
 112 fusion may be internally applied to a sequence of these strange multiply
 113 operations. Such a trick works equally as well in Scalar-only.
 114
 115 **Application of SVP64**
 116
 117 SVP64 has the means to mark registers as scalar or vector. However
 118 the available space in the prefix is extremely limited (9 bits).
 119 With effectively 5 operands (3 in, 2 out) some compromises are needed.
 120 However a little though gives a useful workaround: two modes,
 121 controlled by a single bit in `RM.EXTRA`, determine whether the 5th
 122 register is set to RC or whether to RT+VL. This then leaves only
 123 4 registers to qualify as scalar/vector, and this can use four
 124 EXTRA2 designators which fits into the available space.
 125
 126 RS=RT+VL Mode:
 127
 128     product = RA*RB+RC
 129     RT = lowerhalf(product)
 130     RS=RT+VL = upperhalf(product)
 131
 132 and RS=RC Mode:
 133
 134     product = RA*RB+RC
 135     RT = lowerhalf(product)
 136     RS=RC = upperhalf(product)
 137
 138 Now there is much more potential, including setting RC to a Scalar,
 139 which would be useful as a 64 bit Carry. RC as a Vector would produce
 140 a Vector of the HI halves of a Vector of multiplies.  RS=RT+VL Mode
 141 would allow that same Vector of HI halves to not be an overwrite of RC.
 142 Also it is possible to specify that any of RA, RB or RC are scalar or
 143 vector. Overall it is extremely powerful.
 144
 145 ## Divide
 146
 147 The simplest implementation of big-int divide is the standard schoolbook
 148 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
 149 Algorithm D performs estimates which, if wrong, are compensated for
 150 afterwards.  Essentiallythere are three phases:
 151
 152 * Calculation of the quotient estimate. This is Scalar division
 153   and can be ignored for the scope of this analysis
 154 * Big Integer multiply and subtract.
 155 * Carry-Correction with a big integer add, if the estimate from
 156   phase 1 was wrong by one digit.
 157
 158 In essence then the primary focus of Vectorised Big-Int divide is in
 159 fact big-integer multiply (or, mul-and-subtract).
 160
 161 # Appendix
 162
 163 see [[appendix]]