[[!tag standards]]

# Big Integer Arithmetic

**DRAFT STATUS** 19apr2021

BigNum arithmetic is extremely common especially in cryptography,
where for example RSA relies on arithmetic of 2048 or 4096 bits
in length.  The primary operations are add, multiply and divide
(and modulo) with specialisations of subtract and signed multiply.

A reminder that a particular focus of SVP64 is that it is built on
top of Scalar operations, where those scalar operations are useful in
their own right without SVP64. Thus the operstions here are proposed
first as Scalar Extensions to the Power ISA.

A secondary focus is that if Vectorised, implementors may choose
to deploy macro-op fusion targetting back-end 256-bit or greater
Dynamic SIMD ALUs for maximum performance and effectiveness.

# Analysis

This section covers an analysis of big integer operations.  Use of
smaller sub-operations is a given: worst-case, addition is O(N)
whilst multiply and divide are O(N^2).

## Add and Subtract

Surprisingly, no new additional instructions are required to perform
a straightforward big-integer add or subtract.  Vectorised `adde`
or `addex` is perfectly sufficient to produce arbitrary-length
big-integer add due to the rules set in SVP64 that all Vector Operations
are directly equivalent to the strict Program Order Execution of
their element-level operations.

Thus, due to sequential execution of `adde` both consuming and producing
a CA Flag, `sv.adde` is in effect an alias for Vectorised add.  As such,
implementors are entirely at liberty to recognise Horizontal-First Vector
adds and send the vector of registers to a much larger and wider back-end
ALU.

## Multiply

Multiply is tricky: 64 bit operands actually produce a 128-bit result,
which clearly cannot fit into an orthogonal register file.
Most Scalar RISC ISAs have separate `mul-low-half` and `mul-hi-half`
instructions, whilst some (OpenRISC) have "Accumulators" from which
the results of the multiply must be explicitly extracted. High
performance RISC advocates
recommend "macro-op fusion" which is in effect where the second instruction
gains access to the cached copy of the HI half of the
multiply redult, which had already been
computed by the first. This approach quickly complicates the internal
microarchitecture, especially at the decode phase.

Instead, Intel, in 2012, specifically added a `mulx` instruction, allowing
both HI and LO halves of the multiply to reach registers.  If done as a
multiply-and-accumulate this becomes quite an expensive operation:
3 64-Bit in, 2 64-bit registers out).

Long-multiplication may be performed a row at a time, starting
with B0:

    C4 C3 C2 C1 C0
             A0xB0
          A1xB0
       A2xB0
    A3xB0
    R4 R3 R2 R1 R0

* R0 contains C0 plus the LO half of A0 times B0
* R1 contains C1 plus the LO half of A1 times B0
  plus the HI half of A0 times B0.
* R2 contains C2 plus the LO half of A2 times B0
  plus the HI half of A1 times B0.

This would on the face of it be a 4-in operation:
the upper half of a previous multiply, two new operands
to multiply, and an additional accumulator (C). However if
C is left out (and added afterwards with a Vector-Add)
things become more manageable.

We therefore propose an operation that is 3-in, 2-out,
that, noting that the connection between successive
mul-adds has the UPPER half of the previous operation
as its input, writes the UPPER half of the current
product into a second output register for exactly that
purpose.

    product = RA*RB+RC
    RT = lowerhalf(product)
    RC = upperhalf(product)

Successive iterations effectively use RC as a 64-bit carry, and
as noted by Intel in their notes on mulx,
RA*RB+RC+RD cannot overflow, so does not require
setting an additional CA flag.

Combined with a Vectorised big-int `sv.adde` the key inner loop of
Knuth's Algorithm M may be achieved in four instructions, two of
which are scalar initialisation:

    li r16, 0                     # zero accululator
    addic r16, r16, 0             # CA to zero as well
    sv.madde r0.v, r8.v, r17, r16 # mul vector 
    sv.adde r24.v, r24.v, r0.v   # big-add row to result
    
Normally, in a Scalar ISA, the use of a register as both a source
and destination like this would create costly Dependency Hazards, so
such an instruction would never be proposed.  However: it turns out
that, just as with repeated chained application of `adde`, macro-op
fusion may be internally applied to a sequence of these strange multiply
operations. Such a trick works equally as well in Scalar-only.

**Application of SVP64**

SVP64 has the means to mark registers as scalar or vector. However
the available space in the prefix is extremely limited (9 bits).
With effectively 5 operands (3 in, 2 out) some compromises are needed.
However a little though gives a useful workaround: two modes,
controlled by a single bit in `RM.EXTRA`, determine whether the 5th
register is set to RC or whether to RT+VL. This then leaves only
4 registers to qualify as scalar/vector, and this can use four
EXTRA2 designators which fits into the available space.

RS=RT+VL Mode:

    product = RA*RB+RC
    RT = lowerhalf(product)
    RS=RT+VL = upperhalf(product)

and RS=RC Mode:

    product = RA*RB+RC
    RT = lowerhalf(product)
    RS=RC = upperhalf(product)

Now there is much more potential, including setting RC to a Scalar,
which would be useful as a 64 bit Carry. RC as a Vector would produce
a Vector of the HI halves of a Vector of multiplies.  RS=RT+VL Mode
would allow that same Vector of HI halves to not be an overwrite of RC.
Also it is possible to specify that any of RA, RB or RC are scalar or
vector. Overall it is extremely powerful.

## Divide

The simplest implementation of big-int divide is the standard schoolbook
"Long Division", set with RADIX 64 instead of Base 10. Donald Knuth's
Algorithm D performs estimates which, if wrong, are compensated for
afterwards.  Essentially there are three phases:

* Calculation of the quotient estimate. This is Scalar division
  and can be ignored for the scope of this analysis
* Big Integer multiply and subtract.
* Carry-Correction with a big integer add, if the estimate from
  phase 1 was wrong by one digit.

In essence then the primary focus of Vectorised Big-Int divide is in
fact big-integer multiply (more specifically, mul-and-subtract).

    product = RC - (RA) * (RB)
    RT = lowerhalf(product)
    RS = upperhalf(product)

# Instructions

**DRAFT**

Both `madded` and `msubed` are VA-Form:

|0.....5|6..10|11..15|16..20|21..25|26..31|
|-------|-----|------|------|------|------|
| EXT04 | RT  |  RA  |  RB  |   RC |  XO  |

For the Opcode map (XO Field)
see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.
Proposed is the addition of `msubed` (**DRAFT, NOT APPROVED**) which is
in `110110`.  A corresponding `madded` is proposed for `110010`

| 110000 | 110001  | 110010 | 110011 | 110100 | 110101 | 110110 | 110111 |
| ------ | ------- | ------ | ------ | ------ | ------ | ------ | ------ |
| maddhd | maddhdu | madded | maddld | rsvd   | rsvd   | msubed | rsvd   |

For SVP64 EXTRA register extension, the `RM-1P-3S-1D` format is
used with the additional bit set for determining RS.

| Field Name | Field bits | Description                            |
|------------|------------|----------------------------------------|
| Rdest\_EXTRA2 | `10:11` | extends RT (R\*\_EXTRA2 Encoding)   |
| Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding)   |
| Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding)   |
| Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding)   |
| EXTRA2_MODE   | `18`    | used by `msubed` and `madded` for RS   |

When `EXTRA2_MODE` is set to zero, the implicit RS register takes
its Vector/Scalar setting from Rdest_EXTRA2, and takes
the register number from RT, but all numbering
is offset by VL. *Note that element-width overrides influence this
offset* (see SVP64 [[svp64/appendix]] for full details).

When `EXTRA2_MODE` is set to one, the implicit RS register is identical
to RC extended to SVP64 numbering, including whether RC is set Scalar or
Vector.

The pseudocode for `msubed RT, RA, RB, RC`` is:

    prod[0:127] = (RA) * (RB)
    sub[0:127] = EXTZ(RC) - prod
    RT <- sub[64:127]
    RS <- sub[0:63] # RS is either RC or RT+VL

Note that RC is not sign-extended to 64-bit.  In a Vector Loop
it contains the top half of the previous multiply-with-subtract,
and the current product must be subtracted from it.

The pseudocode for `madded RT, RA, RB, RC` is:

    prod[0:127] = (RA) * (RB)
    sum[0:127] = EXTZ(RC) + prod
    RT <- sum[64:127]
    RS <- sum[0:63] # RS is either RC or RT+VL

Again RC is zero-extended (not shifted), the 128-bit product added
to it; the lower half of the result stored in RT and the upper half
in RS.

The differences here to `maddhdu` are that `maddhdu` stores the upper
half in RT, where `madded` stores the upper half in RS. There is no
equivalent to `maddld` because `maddld` performs sign-extension on RC.

# Appendix

see [[appendix]]