add twin butterfly page (stub)

[libreriscv.git] / openpower / sv / vector_ops.mdwn
diff --git a/openpower/sv/vector_ops.mdwn b/openpower/sv/vector_ops.mdwn

index a01b4f1a5ff1475e7f64fee7458b8239909bda7d..b2d526af69e34d43f8016980e1b6240c5f3c9c88 100644 (file)
--- a/openpower/sv/vector_ops.mdwn
+++ b/openpower/sv/vector_ops.mdwn
@@ -1,6 +1,6 @@
  [[!tag standards]]
  
-# SV Vector Operations.
+# SV Vector-assist Operations.
  
  Links:
  
@@ -15,26 +15,26 @@ Links:
    contains pseudocode for sof, sif, sbf
  * <https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)>
  
-The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
-Therefore there are not that many cases where *actual* Vector
-instructions are needed. If they are, they are more "assistance"
-functions.  Two traditional Vector instructions were initially
-considered (conflictd and vmiota) however they may be synthesised
-from existing SVP64 instructions: vmiota may use [[svstep]].
-Details in [[discussion]]
+The core Power ISA was designed as scalar: SV provides a level of
+abstraction to add variable-length element-independent parallelism.
+Therefore there are not that many cases where *actual* Vector instructions
+are needed. If they are, they are more "assistance" functions.  Two
+traditional Vector instructions were initially considered (conflictd and
+vmiota) however they may be synthesised from existing SVP64 instructions:
+vmiota may use [[svstep]].  Details in [[discussion]]
  
  Notes:
  
-* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
-* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations.  See [[sv/cr_int_predication]].
+* Instructions suited to 3D GPU workloads (dotproduct, crossproduct,
+  normalise) are out of scope: this document is for more general-purpose
+  instructions that underpin and are critical to general-purpose Vector
+  workloads (including GPU and VPU)
+* Instructions related to the adaptation of CRs for use as
+  predicate masks are covered separately, by crweird operations.
+  See [[sv/cr_int_predication]].
  
-# Mask-suited Bitmanipulation
+## Mask-suited Bitmanipulation
  
-Based on RVV masked set-before-first, set-after-first etc.
-and Intel and AMD Bitmanip instructions made generalised then
-advanced further to include masks, this is a single instruction
-covering 24 individual instructions in other ISAs.
-*(sbf/sof/sif moved to [[discussion]])*
  
  BM2-Form
  
@@ -44,102 +44,72 @@ BM2-Form
  
  * bmask RS,RA,RB,bm,L
  
-The patterns within the pseudocode for AMD TBM and x86 BMI1 are
-as follows:
+Pseudo-code:
  
-* first pattern A: `x / ~x`
-* second pattern B: `| / & / ^`
-* third pattern C: `x+1 / x-1 / ~(x+1) / (~x)+1`
+```
+    if _RB = 0 then mask <- [1] * XLEN
+    else            mask <- (RB)
+    ra <- (RA) & mask
+    a1 <- ra
+    if bm[4] = 0 then a1 <- ¬ra
+    mode2 <- bm[2:3]
+    if mode2 = 0 then a2 <- (¬ra)+1
+    if mode2 = 1 then a2 <- ra-1
+    if mode2 = 2 then a2 <- ra+1
+    if mode2 = 3 then a2 <- ¬(ra+1)
+    a1 <- a1 & mask
+    a2 <- a2 & mask
+    # select operator
+    mode3 <- bm[0:1]
+    if mode3 = 0 then result <- a1 | a2
+    if mode3 = 1 then result <- a1 & a2
+    if mode3 = 2 then result <- a1 ^ a2
+    if mode3 = 3 then result <- undefined([0]*XLEN)
+    # mask output
+    result <- result & mask
+    # optionally restore masked-out bits
+    if L = 1 then
+        result <- result | (RA & ¬mask)
+    RT <- result
+```
+
+* first pattern A: two options `x` or `~x`
+* second pattern B: three options `|` `&` or `^`
+* third pattern C: four options `x+1`, `x-1`, `~(x+1)` or `(~x)+1`
  
-Thus it makes sense to create a single instruction
-that covers all of these.  A crucial addition that is essential
-for Scalable Vector usage as Predicate Masks, is the second mask parameter
-(RB).  The additional paramater, L, if set, will leave bits of RA masked
-by RB unaltered, otherwise those bits are set to zero. Note that when `RB=0`
-then instead of reading from the register file the mask is set to all ones.
  
-Executable pseudocode demo:
+The lower two bits of `bm` set to 0b11 are `RESERVED`. An illegal instruction
+trap must be raised.
+
+Special Registers Altered:
  
  ```
-[[!inline pages="openpower/sv/bmask.py" quick="yes" raw="yes" ]]
+    None
  ```
  
-# Carry-lookahead
+## Carry-lookahead
  
  As a single scalar 32-bit instruction, up to 64 carry-propagation bits
  may be computed.  When the output is then used as a Predicate mask it can
  be used to selectively perform the "add carry" of biginteger math, with
  `sv.addi/sm=rN RT.v, RA.v, 1`.
  
-* cprop RT,RA,RB
-* cprop. RT,RA,RB
+* cprop RT,RA,RB (Rc=0)
+* cprop. RT,RA,RB (Rc=1)
  
  pseudocode:
  
+```
      P = (RA)
      G = (RB)
      RT = ((P|G)+G)^P 
+```
  
  X-Form
  
-| 0.5|6.10|11.15|16.20| 21..30     |31| name      |  Form   |
+| 0:5|6:10|11:15|16:20| 21:30      |31| name      |  Form   |
  | -- | -- | --- | --- | ---------  |--| ----      | ------- |
-| NN | RT | RA  | RB  | 0110001110 |Rc|     cprop | X-Form  |
+| PO | RT | RA  | RB  | XO         |Rc|     cprop | X-Form  |
  
  used not just for carry lookahead, also a special type of predication mask operation.
  
-* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
-* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
-* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
-* <https://i.stack.imgur.com/QSLKY.png>
-* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
-  `((P|G)+G)^P`
-* <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
-
-From QLSKY.png:
-
-```
-    x0 = nand(CIn, P0)
-    C0 = nand(x0, ~G0)
-
-    x1 = nand(CIn, P0, P1)
-    y1 = nand(G0, P1)
-    C1 = nand(x1, y1, ~G1)
-
-    x2 = nand(CIn, P0, P1, P2)
-    y2 = nand(G0, P1, P2)
-    z2 = nand(G1, P2)
-    C1 = nand(x2, y2, z2, ~G2)
-
-    # Gen*
-    x3 = nand(G0, P1, P2, P3)
-    y3 = nand(G1, P2, P3)
-    z3 = nand(G2, P3)
-    G* = nand(x3, y3, z3, ~G3)
-```
-
-```
-     P = (A | B) & Ci
-     G = (A & B)
-```
-
-Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here).  The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
-
-```
-    At each id, compute C[id] = A[id]+B[id]+0
-    Get G[id] = C[id] > radix -1
-    Get P[id] = C[id] == radix-1
-    Join all P[id] together, likewise G[id]
-    Compute newC = ((P|G)+G)^P
-    result[id] = (C[id] + newC[id]) % radix
-```   
-
-two versions: scalar int version and CR based version.
-
-scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument.  the end bits go into XER.CA and CR0.ge
-
-vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
-
-if zero (no propagation) then CR0.eq is zero
-
-CR based version, TODO.