--- /dev/null
+# Analysis
+
+This section covers an analysis of big integer operations. Use of
+smaller sub-operations is a given: worst-case, addition is O(N)
+whilst multiply and divide are O(N^2).
+
+# Add and Subtract
+
+Surprisingly, no new additional instructions are required to perform
+a straightforward big-integer add or subtract. Vectorised `adde`
+or `addex` is perfectly sufficient to produce arbitrary-length
+big-integer add due to the rules set in SVP64 that all Vector Operations
+are directly equivalent to the strict Program Order Execution of
+their element-level operations.
+
+Thus, due to sequential execution of `adde` both consuming and producing
+a CA Flag, `sv.adde` is in effect an alias for Vectorised add. As such,
+implementors are entirely at liberty to recognise Horizontal-First Vector
+adds and send the vector of registers to a much larger and wider back-end
+ALU.
+
+# Multiply
+
+Long-multiply, assuming an O(N^2) algorithm, is performed by summing
+NxN separate smaller multiplications together. Karatsuba's algorithm
+reduces the number of small multiplies at the expense of increasing
+the number of additions. Some algorithms follow the Vedic Multiply
+pattern by grouping together all multiplies of the same magnitude/power
+(same column) whilst others perform row-based multiplication: a single
+digit of B multiplies the entirety of A, summed a row at a time. This
+algorithm is the basis of the analysis below (Knuth's Algorithm M).
+
+Multiply is tricky: 64 bit operands actually produce a 128-bit result,
+which clearly cannot fit into an orthogonal register file.
+Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
+instructions, whilst some (OpenRISC) have "Accumulators" from which
+the results of the multiply must be explicitly extracted. High
+performance RISC advocates
+recommend "macro-op fusion" which is in effect where the second instruction
+gains access to the cached copy of the HI half of the
+multiply redult, which had already been
+computed by the first. This approach quickly complicates the internal
+microarchitecture, especially at the decode phase.
+
+Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
+both HI and LO halves of the multiply to reach registers. If done as a
+multiply-and-accumulate this becomes quite an expensive operation:
+3 64-Bit in, 2 64-bit registers out).
+
+Long-multiplication may be performed a row at a time, starting
+with B0:
+
+ C4 C3 C2 C1 C0
+ A0xB0
+ A1xB0
+ A2xB0
+ A3xB0
+ R4 R3 R2 R1 R0
+
+* R0 contains C0 plus the LO half of A0 times B0
+* R1 contains C1 plus the LO half of A1 times B0
+ plus the HI half of A0 times B0.
+* R2 contains C2 plus the LO half of A2 times B0
+ plus the HI half of A1 times B0.
+
+This would on the face of it be a 4-in operation:
+the upper half of a previous multiply, two new operands
+to multiply, and an additional accumulator (C). However if
+C is left out (and added afterwards with a Vector-Add)
+things become more manageable.
+
+We therefore propose an operation that is 3-in, 2-out,
+that, noting that the connection between successive
+mul-adds has the UPPER half of the previous operation
+as its input, writes the UPPER half of the current
+product into a second output register for exactly that
+purpose.
+
+ product = RA*RB+RC
+ RT = lowerhalf(product)
+ RC = upperhalf(product)
+
+Successive iterations effectively use RC as a 64-bit carry, and
+as noted by Intel in their notes on mulx,
+RA*RB+RC+RD cannot overflow, so does not require
+setting an additional CA flag.
+
+Combined with a Vectorised big-int `sv.adde` the key inner loop of
+Knuth's Algorithm M may be achieved in four instructions, two of
+which are scalar initialisation:
+
+ li r16, 0 # zero accululator
+ addic r16, r16, 0 # CA to zero as well
+ sv.madde r0.v, r8.v, r17, r16 # mul vector
+ sv.adde r24.v, r24.v, r0.v # big-add row to result
+
+Normally, in a Scalar ISA, the use of a register as both a source
+and destination like this would create costly Dependency Hazards, so
+such an instruction would never be proposed. However: it turns out
+that, just as with repeated chained application of `adde`, macro-op
+fusion may be internally applied to a sequence of these strange multiply
+operations. (*Such a trick works equally as well in a Scalar-only
+Out-of-Order microarchitecture, although the conditions are harder to
+detect*).
+
+**Application of SVP64**
+
+SVP64 has the means to mark registers as scalar or vector. However
+the available space in the prefix is extremely limited (9 bits).
+With effectively 5 operands (3 in, 2 out) some compromises are needed.
+However a little though gives a useful workaround: two modes,
+controlled by a single bit in `RM.EXTRA`, determine whether the 5th
+register is set to RC or whether to RT+VL. This then leaves only
+4 registers to qualify as scalar/vector, and this can use four
+EXTRA2 designators which fits into the available space.
+
+RS=RT+VL Mode:
+
+ product = RA*RB+RC
+ RT = lowerhalf(product)
+ RS=RT+VL = upperhalf(product)
+
+and RS=RC Mode:
+
+ product = RA*RB+RC
+ RT = lowerhalf(product)
+ RS=RC = upperhalf(product)
+
+Now there is much more potential, including setting RC to a Scalar,
+which would be useful as a 64 bit Carry. RC as a Vector would produce
+a Vector of the HI halves of a Vector of multiplies. RS=RT+VL Mode
+would allow that same Vector of HI halves to not be an overwrite of RC.
+Also it is possible to specify that any of RA, RB or RC are scalar or
+vector. Overall it is extremely powerful.
+
+# Divide
+
+The simplest implementation of big-int divide is the standard schoolbook
+"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
+Algorithm D performs estimates which, if wrong, are compensated for
+afterwards. Essentially there are three phases:
+
+* Calculation of the quotient estimate. This is Scalar division
+ and can be ignored for the scope of this analysis
+* Big Integer multiply and subtract.
+* Carry-Correction with a big integer add, if the estimate from
+ phase 1 was wrong by one digit.
+
+In essence then the primary focus of Vectorised Big-Int divide is in
+fact big-integer multiply (more specifically, mul-and-subtract).
+
+ product = RC - (RA) * (RB)
+ RT = lowerhalf(product)
+ RS = upperhalf(product)
+