(no commit message)
[libreriscv.git] / openpower / sv / biginteger.mdwn
1 [[!tag standards]]
2
3 # Big Integer Arithmetic
4
5 **DRAFT STATUS** 19apr2021
6
7 BigNum arithmetic is extremely common especially in cryptography,
8 where for example RSA relies on arithmetic of 2048 or 4096 bits
9 in length. The primary operations are add, multiply and divide
10 (and modulo) with specialisations of subtract and signed multiply.
11
12 A reminder that a particular focus of SVP64 is that it is built on
13 top of Scalar operations, where those scalar operations are useful in
14 their own right without SVP64. Thus the operstions here are proposed
15 first as Scalar Extensions to the Power ISA.
16
17 A secondary focus is that if Vectorised, implementors may choose
18 to deploy macro-op fusion targetting back-end 256-bit or greater
19 Dynamic SIMD ALUs for maximum performance and effectiveness.
20
21 # Analysis
22
23 This section covers an analysis of big integer operations. Use of
24 smaller sub-operations is a given: worst-case, addition is O(N)
25 whilst multiply and divide are O(N^2).
26
27 ## Add and Subtract
28
29 Surprisingly, no new additional instructions are required to perform
30 a straightforward big-integer add or subtract. Vectorised `addeo`
31 or `addex` is perfectly sufficient to produce arbitrary-length
32 big-integer add due to the rules set in SVP64 that all Vector Operations
33 are directly equivalent to the strict Program Order Execution of
34 their element-level operations.
35
36 Thus, due to sequential execution of `addeo` both consuming and producing
37 a CA Flag, `sv.addeo` is in effect an alias for Vectorised add. As such,
38 implementors are entirely at liberty to recognise Horizontal-First Vector
39 adds and send the vector of registers to a much larger and wider back-end
40 ALU.
41
42 ## Multiply
43
44 Multiply is tricky: 64 bit operands actually produce a 128-bit result,
45 which clearly cannot fit into an orthogonal register file.
46 Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
47 instructions, whilst some (OpenRISC) have "Accumulators" from which
48 the results of the multiply must be explicitly extracted. High
49 performance RISC advocates
50 recommend "macro-op fusion" which is in effect where the second instruction
51 gains access to the cached copy of the HI half of the
52 multiply redult, which had already been
53 computed by the first. This approach quickly complicates the internal
54 microarchitecture, especially at the decode phase.
55
56 Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
57 both HI and LO halves of the multiply to reach registers. If done as a
58 multiply-and-accumulate this becomes quite an expensive operation:
59 3 64-Bit in, 2 64-bit registers out).
60
61 Long-multiplication may be performed a row at a time, starting
62 with B0:
63
64 C4 C3 C2 C1 C0
65 A0xB0
66 A1xB0
67 A2xB0
68 A3xB0
69 R4 R3 R2 R1 R0
70
71 * R0 contains C0 plus the LO half of A0 times B0
72 * R1 contains C1 plus the LO half of A1 times B0
73 plus the HI half of A0 times B0.
74 * R2 contains C2 plus the LO half of A2 times B0
75 plus the HI half of A1 times B0.
76
77 This would on the face of it be a 4-in operation:
78 the upper half of a previous multiply, two new operands
79 to multiply, and an additional accumulator (C). However if
80 C is left out (and added afterwards with a Vector-Add)
81 things become more manageable.
82
83 We therefore propose an operation that is 3-in, 2-out,
84 that, noting that the connection between successive
85 mul-adds has the UPPER half of the previous operation
86 as its input, writes the UPPER half of the current
87 product into a second output register for exactly that
88 purpose.
89
90 product = RA*RB+RC
91 RT = lowerhalf(product)
92 RC = upperhalf(product)
93
94 Successive iterations effectively use RC as a 64-bit carry, and
95 as noted by Intel in their notes on mulx,
96 RA*RB+RC+RD cannot overflow, so does not require
97 setting an additional CA flag.
98
99 Combined with a Vectorised big-int `sv.addeo` the key inner loop of
100 Knuth's Algorithm M may be achieved in four instructions:
101
102 li r16, 0 # carry-accululator to zero
103 addicc r16, r16, 0 # CA to zero as well
104 sv.mulx r0.v, r8.v, r16 # mul vector using r16
105 sv.addeo
106
107
108 Normally, in a Scalar ISA, the use of a register as both a source
109 and destination like this would create costly Dependency Hazards, so
110 such an instruction would never be proposed. However: it turns out
111 that, just as with repeated chained application of `addeo`, macro-op
112 fusion may be internally applied to a sequence of these strange multiply
113 operations. Such a trick works equally as well in Scalar-only.
114
115 **Application of SVP64**
116
117 SVP64 has the means to mark registers as scalar or vector. However
118 the available space in the prefix is extremely limited (9 bits).
119 With effectively 5 operands (3 in, 2 out) some compromises are needed.
120 However a little though gives a useful workaround: two modes,
121 controlled by a single bit in `RM.EXTRA`, determine whether the 5th
122 register is set to RC or whether to RT+VL. This then leaves only
123 4 registers to qualify as scalar/vector, and this can use four
124 EXTRA2 designators which fits into the available space.
125
126 RS=RT+VL Mode:
127
128 product = RA*RB+RC
129 RT = lowerhalf(product)
130 RS=RT+VL = upperhalf(product)
131
132 and RS=RC Mode:
133
134 product = RA*RB+RC
135 RT = lowerhalf(product)
136 RS=RC = upperhalf(product)
137
138 Now there is much more potential, including setting RC to a Scalar,
139 which would be useful as a 64 bit Carry. RC as a Vector would produce
140 a Vector of the HI halves of a Vector of multiplies. RS=RT+VL Mode
141 would allow that same Vector of HI halves to not be an overwrite of RC.
142 Also it is possible to specify that any of RA, RB or RC are scalar or
143 vector. Overall it is extremely powerful.
144
145 ## Divide
146
147 The simplest implementation of big-int divide is the standard schoolbook
148 "Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
149 Algorithm D performs estimates which, if wrong, are compensated for
150 afterwards. Essentiallythere are three phases:
151
152 * Calculation of the quotient estimate. This is Scalar division
153 and can be ignored for the scope of this analysis
154 * Big Integer multiply and subtract.
155 * Carry-Correction with a big integer add, if the estimate from
156 phase 1 was wrong by one digit.
157
158 In essence then the primary focus of Vectorised Big-Int divide is in
159 fact big-integer multiply (or, mul-and-subtract).
160
161 # Appendix
162
163 see [[appendix]]