(no commit message)

[libreriscv.git] / openpower / sv / biginteger.mdwn
diff --git a/openpower/sv/biginteger.mdwn b/openpower/sv/biginteger.mdwn

index 168db13a03f0badb213d16e43dd6da76d8e1d3ae..c84971b82731f7fadf7de77a34abcbef480c43df 100644 (file)
--- a/openpower/sv/biginteger.mdwn
+++ b/openpower/sv/biginteger.mdwn
@@ -2,9 +2,13 @@
  
  # Big Integer Arithmetic
  
-**DRAFT STATUS** 19apr2021
+**DRAFT STATUS** 19apr2022, last edited 23may2022
  
-(see [[discussion]] page for notes)
+* [[discussion]] page for notes
+* <https://bugs.libre-soc.org/show_bug.cgi?id=817> bugreport
+* <https://bugs.libre-soc.org/show_bug.cgi?id=937> 128/64 shifts
+* [[biginteger/analysis]]
+* [[openpower/isa/svfixedarith]] pseudocode
  
  BigNum arithmetic is extremely common especially in cryptography,
  where for example RSA relies on arithmetic of 2048 or 4096 bits
@@ -13,7 +17,7 @@ in length.  The primary operations are add, multiply and divide
  
  A reminder that a particular focus of SVP64 is that it is built on
  top of Scalar operations, where those scalar operations are useful in
-their own right without SVP64. Thus the operstions here are proposed
+their own right without SVP64. Thus the operations here are proposed
  first as Scalar Extensions to the Power ISA.
  
  A secondary focus is that if Vectorised, implementors may choose
@@ -23,31 +27,125 @@ Dynamic SIMD ALUs for maximum performance and effectiveness.
  # Analysis
  
  Covered in [[biginteger/analysis]] the summary is that standard `adde`
-is sufficient for SVP64 Vectorisation of big-integer addition (and subfe
-for subtraction) but that big-integer multiply and divide require two
-extra 3-in 2-out instructions, similar to Intel's `mulx`, to be efficient.
+is sufficient for SVP64 Vectorisation of big-integer addition (and `subfe`
+for subtraction) but that big-integer shift, multiply and divide require an
+extra 3-in 2-out instructions, similar to Intel's
+[shld](https://www.felixcloutier.com/x86/shld)
+and [shrd](https://www.felixcloutier.com/x86/shrd),
+[mulx](https://www.felixcloutier.com/x86/mulx) and
+[divq](https://www.felixcloutier.com/x86/div),
+to be efficient.
+The same instruction (`maddedu`) is used in both
+big-divide and big-multiply because 'maddedu''s primary
+purpose is to perform a fused 64-bit scalar multiply with a large vector,
+where that result is Big-Added for Big-Multiply, but Big-Subtracted for
+Big-Divide.
+
+Chaining the operations together gives Scalar-by-Vector 
+operations, except for `sv.adde` and `sv.subfe` which are
+Vector-by-Vector Chainable (through the `CA` flag).
  Macro-op Fusion and back-end massively-wide SIMD ALUs may be deployed in a
  fashion that is hidden from the user, behind a consistent, stable ISA API.
+The same macro-op fusion may theoretically be deployed even on Scalar
+operations.
  
-# Instructions
+# **DRAFT** dsld
+
+|0.....5|6..10|11..15|16..20|21.25|26..30|31|
+|-------|-----|------|------|-----|------|--|
+| EXT04 | RT  |  RA  |  RB  | RC  |  XO  |Rc|
+
+VA2-Form
+
+* dsld    RT,RA,RB,RC  (Rc=0)
+* dsld.   RT,RA,RB,RC  (Rc=1)
+
+Pseudo-code:
+
+    n <- (RB)[58:63]
+    v <- ROTL64((RA), n)
+    mask <- MASK(0, 63-n)
+    RT <- (v[0:63] & mask) | ((RC) & ¬mask)
+    RS <- v[0:63] & ¬mask
+    overflow = 0
+    if RS != [0]*64:
+        overflow = 1
+
+Special Registers Altered:
+
+    CR0                    (if Rc=1)
+
+# **DRAFT** dsrd
+
+|0.....5|6..10|11..15|16..20|21.25|26..30|31|
+|-------|-----|------|------|-----|------|--|
+| EXT04 | RT  |  RA  |  RB  | RC  |  XO  |Rc|
+
+VA2-Form
+
+* dsrd    RT,RA,RB,RC  (Rc=0)
+* dsrd.   RT,RA,RB,RC  (Rc=1)
+
+Pseudo-code:
+
+    n <- (RB)[58:63]
+    v <- ROTL64((RA), 64-n)
+    mask <- MASK(n, 63)
+    RT <- (v[0:63] & mask) | ((RC) & ¬mask)
+    RS <- v[0:63] & ¬mask
+    overflow = 0
+    if RS != [0]*64:
+        overflow = 1
+
+Special Registers Altered:
+
+    CR0                    (if Rc=1)
+
+
+# maddedu
  
  **DRAFT**
  
-Both `madded` and `msubed` are VA-Form:
+`maddedu` is similar to v3.0 `madd`, and
+is VA-Form despite having 2 outputs: the second
+destination register is implicit.
  
  |0.....5|6..10|11..15|16..20|21..25|26..31|
  |-------|-----|------|------|------|------|
  | EXT04 | RT  |  RA  |  RB  |   RC |  XO  |
  
-For the Opcode map (XO Field)
-see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.
-Proposed is the addition of `msubed` (**DRAFT, NOT APPROVED**) which is
-in `110110`.  A corresponding `madded` is proposed for `110010`
+The pseudocode for `maddedu RT, RA, RB, RC` is:
  
-| 110000 | 110001  | 110010 | 110011 | 110100 | 110101 | 110110 | 110111 |
-| ------ | ------- | ------ | ------ | ------ | ------ | ------ | ------ |
-| maddhd | maddhdu | madded | maddld | rsvd   | rsvd   | msubed | rsvd   |
+    prod[0:127] = (RA) * (RB)
+    sum[0:127] = EXTZ(RC) + prod
+    RT <- sum[64:127]
+    RS <- sum[0:63] # RS implicit register, see below
+
+RC is zero-extended (not shifted, not sign-extended), the 128-bit product added
+to it; the lower half of that result stored in RT and the upper half
+in RS.
  
+The differences here to `maddhdu` are that `maddhdu` stores the upper
+half in RT, where `maddedu` stores the upper half in RS.
+
+The value stored in RT is exactly equivalent to `maddld` despite `maddld`
+performing sign-extension on RC, because RT is the full mathematical result
+modulo 2^64 and sign/zero extension from 64 to 128 bits produces identical
+results modulo 2^64. This is why there is no maddldu instruction.
+
+*Programmer's Note:
+As a Scalar Power ISA operation, like `lq` and `stq`, RS=RT+1.
+To achieve the same big-integer rolling-accumulation effect
+as SVP64: assuming the scalar to multiply is in r0, 
+the vector to multiply by starts at r4 and the result vector
+in r20, instructions may be issued `maddedu r20,r4,r0,r20
+maddedu r21,r5,r0,r21` etc. where the first `maddedu` will have
+stored the upper half of the 128-bit multiply into r21, such
+that it may be picked up by the second `maddedu`. Repeat inline
+to construct a larger bigint scalar-vector multiply,
+as Scalar GPR register file space permits.*
+
+SVP64 overrides the Scalar behaviour of what defines RS.
  For SVP64 EXTRA register extension, the `RM-1P-3S-1D` format is
  used with the additional bit set for determining RS.
  
@@ -57,45 +155,74 @@ used with the additional bit set for determining RS.
  | Rsrc1\_EXTRA2 | `12:13` | extends RA (R\*\_EXTRA2 Encoding)   |
  | Rsrc2\_EXTRA2 | `14:15` | extends RB (R\*\_EXTRA2 Encoding)   |
  | Rsrc3\_EXTRA2 | `16:17` | extends RC (R\*\_EXTRA2 Encoding)   |
-| EXTRA2_MODE   | `18`    | used by `msubed` and `madded` for RS   |
+| EXTRA2_MODE   | `18`    | used by `maddedu` for determining RS |
  
  When `EXTRA2_MODE` is set to zero, the implicit RS register takes
  its Vector/Scalar setting from Rdest_EXTRA2, and takes
  the register number from RT, but all numbering
-is offset by VL. *Note that element-width overrides influence this
+is offset by MAXVL. *Note that element-width overrides influence this
  offset* (see SVP64 [[svp64/appendix]] for full details).
  
  When `EXTRA2_MODE` is set to one, the implicit RS register is identical
-to RC extended to SVP64 numbering, including whether RC is set Scalar or
+to RC extended with SVP64 using `Rsrc3_EXTRA2` in every respect, including whether RC is set Scalar or
  Vector.
  
-## msubed
-
-The pseudocode for `msubed RT, RA, RB, RC`` is:
-
-    prod[0:127] = (RA) * (RB)
-    sub[0:127] = EXTZ(RC) - prod
-    RT <- sub[64:127]
-    RS <- sub[0:63] # RS is either RC or RT+VL
-
-Note that RC is not sign-extended to 64-bit.  In a Vector Loop
-it contains the top half of the previous multiply-with-subtract,
-and the current product must be subtracted from it.
+# divmod2du RT,RA,RB,RC
  
-## madded
+**DRAFT**
  
-The pseudocode for `madded RT, RA, RB, RC` is:
+Divide/Modulu Quad-Double Unsigned is another VA-Form instruction
+that is near-identical to `divdeu` except that:
+
+* the lower 64 bits of the dividend, instead of being zero, contain a
+  register, RC.
+* it performs a fused divide and modulo in a single instruction, storing
+  the modulo in an implicit RS (similar to `maddedu`)
+
+RB, the divisor, remains 64 bit.  The instruction is therefore a 128/64
+division, producing a (pair) of 64 bit result(s), in the same way that
+Intel [divq](https://www.felixcloutier.com/x86/div) works.
+Overflow conditions
+are detected in exactly the same fashion as `divdeu`, except that rather
+than have `UNDEFINED` behaviour, RT is set to all ones and RS set to all
+zeros on overflow.
+
+*Programmer's note: there are no Rc variants of any of these VA-Form
+instructions. `cmpi` will need to be used to detect overflow conditions:
+the saving in instruction count is that both RT and RS will have already
+been set to useful values (all 1s and all zeros respectively)
+needed as part of implementing Knuth's
+Algorithm D*
+
+For SVP64, given that this instruction is also 3-in 2-out 64-bit registers,
+the exact same EXTRA format and setting of RS is used as for `sv.maddedu`.
+For Scalar usage, just as for `maddedu`, `RS=RT+1` (similar to `lq` and `stq`).
+
+Pseudo-code:
+
+    if ((RA) <u (RB)) & ((RB) != [0]*XLEN) then
+        dividend[0:(XLEN*2)-1] <- (RA) || (RC)
+        divisor[0:(XLEN*2)-1] <- [0]*XLEN || (RB)
+        result <- dividend / divisor
+        modulo <- dividend % divisor
+        RT <- result[XLEN:(XLEN*2)-1]
+        RS <- modulo[XLEN:(XLEN*2)-1]
+    else
+        RT <- [1]*XLEN
+        RS <- [0]*XLEN
+
+# [DRAFT] EXT04 Proposed Map
  
-    prod[0:127] = (RA) * (RB)
-    sum[0:127] = EXTZ(RC) + prod
-    RT <- sum[64:127]
-    RS <- sum[0:63] # RS is either RC or RT+VL
+For the Opcode map (XO Field)
+see Power ISA v3.1, Book III, Appendix D, Table 13 (sheet 7 of 8), p1357.
+Proposed is the addition of:
  
-Again RC is zero-extended (not shifted), the 128-bit product added
-to it; the lower half of the result stored in RT and the upper half
-in RS.
+* `maddedu` in `110010`
+* `divmod2du` in `111010`
+* `pcdec` in `111000`
  
-The differences here to `maddhdu` are that `maddhdu` stores the upper
-half in RT, where `madded` stores the upper half in RS. There is no
-equivalent to `maddld` because `maddld` performs sign-extension on RC.
+|v >|   000|   001 |   010    |   011|   100  |   101  |   110   |   111  |
+|---|------|-------|----------|------|--------|--------|---------|--------|
+|110|maddhd|maddhdu|maddedu   |maddld|rsvd    |rsvd    |rsvd     |rsvd    |
+|111|pcdec.|rsvd   |divmod2du |vpermr|vaddequm|vaddecuq|vsubeuqm |vsubecuq|