add twin butterfly page (stub)

[libreriscv.git] / openpower / sv / overview.mdwn
diff --git a/openpower/sv/overview.mdwn b/openpower/sv/overview.mdwn

index d9d65fef777fa93af6c3fc774f5abc2e4fcf4e39..a17afdcb001407765de228a0309f7ef77b234679 100644 (file)
--- a/openpower/sv/overview.mdwn
+++ b/openpower/sv/overview.mdwn
@@ -45,7 +45,8 @@ more problematic with each power of two SIMD width increase introduced
  through an ISA revision.  The opcode proliferation, at O(N^6), inexorably
  spirals out of control in the ISA, detrimentally impacting the hardware,
  the software, the compilers and the testing and compliance.  Here are
-the typical dimensions that result in such massive proliferation:
+the typical dimensions that result in such massive proliferation,
+based on mass-volume DSPs and Micro-Processors:
  
  * Operation (add, mul)
  * bitwidth (8, 16, 32, 64, 128)
@@ -77,6 +78,8 @@ a register file size
  increase using "tagging" (similar to how x86 originally extended
  registers from 32 to 64 bit).
  
+![Single-Issue concept](/openpower/svp64-primer/img/power_pipelines.svg){ width=40% height=20% }
+
  ## SV
  
  The fundamentals are (just like x86 "REP"):
@@ -93,7 +96,7 @@ The fundamentals are (just like x86 "REP"):
  * Once the loop is completed *only then* is the Program Counter
    allowed to move to the next instruction.
  
-![image](/svp64-primer/img/power_pipelines.svg)
+![Multi-Issue with Predicated SIMD back-end ALUs](/openpower/svp64-primer/img/sv_multi_issue.svg){ width=40% height=40% }
  
  Hardware (and simulator) implementors are free and clear to implement this
  as literally a for-loop, sitting in between instruction decode and issue.
@@ -101,8 +104,16 @@ Higher performance systems may deploy SIMD backends, multi-issue and
  out-of-order execution, although it is strongly recommended to add
  predication capability directly into SIMD backend units.
  
-In Power ISA v3.0B pseudo-code form, an ADD operation, assuming both
-source and destination have been "tagged" as Vectors, is simply:
+A typical Cray-style Scalable Vector ISA (where a SIMD one has a fixed
+non-negotiable static parameter instead of a runtime-dynamic VL)
+performs its arithmetic as:
+
+    for i = 0 to VL-1:
+         VPR(RT)[i] = VPR[RA][i] + VPR(RB)[i]
+
+In Power ISA v3.0B pseudo-code form, an ADD operation in Simple-V,
+assuming both source and destination have been "tagged" as Vectors,
+is simply:
  
      for i = 0 to VL-1:
           GPR(RT+i) = GPR(RA+i) + GPR(RB+i)
@@ -159,7 +170,6 @@ The rest of this document builds on the above simple loop to add:
  * Compacted operations into registers (normally only provided by SIMD)
  * Fail-on-first (introduced in ARM SVE2)
  * A new concept: Data-dependent fail-first
-* Condition-Register based *post-result* predication (also new)
  * A completely new concept: "Twin Predication"
  * vec2/3/4 "Subvectors" and Swizzling (standard fare for 3D)
  
@@ -265,7 +275,7 @@ effectively the same as "no predicate".
  
      function op_add(RT, RA, RB) # add not VADD!
        int id=0, irs1=0, irs2=0;
-      predval = get_pred_val(FALSE, rd);
+      predval = get_pred_val(FALSE, RT); # dest mask
        for i = 0 to VL-1:
          if (predval & 1<<i) # predication bit test
             ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
@@ -320,7 +330,7 @@ into account:
  
      function op_add(RT, RA, RB) # add not VADD!
        int id=0, irs1=0, irs2=0;
-      predval = get_pred_val(FALSE, rd);
+      predval = get_pred_val(FALSE, RT); # dest pred
        for i = 0 to VL-1:
          if (predval & 1<<i) # predication bit test
             ireg[RT+id] <= ireg[RA+irs1] + ireg[RB+irs2];
@@ -381,7 +391,7 @@ This means that Vector elements start from locations specified by 64 bit
  "register" but that from that location onwards the elements *overlap
  subsequent registers*.
  
-![image](/svp64-primer/img/svp64_regs.svg){ width=40% }
+![image](/openpower/svp64-primer/img/svp64_regs.svg){ width=40% height=40% }
  
  Here is another way to view the same concept, bearing in mind that it
  is assumed a LE memory order:
@@ -792,49 +802,6 @@ tricky because it typically does not exist in standard scalar ISAs.
  If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a
  VGATHER/VSCATTER.
  
-# CR predicate result analysis
-
-Power ISA has Condition Registers.  These store an analysis of the result
-of an operation to test it for being greater, less than or equal to zero.
-What if a test could be done, similar to branch BO testing, which hooked
-into the predication system?
-
-    for i in range(VL):
-        # predication test, skip all masked out elements.
-        if predicate_masked_out(i): continue # skip
-        result = op(iregs[RA+i], iregs[RB+i])
-        CRnew = analyse(result) # calculates eq/lt/gt
-        # Rc=1 always stores the CR
-        if RC1 or Rc=1: crregs[offs+i] = CRnew
-        if RC1: continue # RC1 mode skips result store
-        # now test CR, similar to branch
-        if CRnew[BO[0:1]] == BO[2]:
-            # result optionally stored but CR always is
-            iregs[RT+i] = result
-
-Note that whilst the Vector of CRs is always written to the CR regfile,
-only those result elements that pass the BO test get written to the
-integer regfile (when RC1 mode is not set).  In RC1 mode the CR is always
-stored, but the result never is. This effectively turns every arithmetic
-operation into a type of `cmp` instruction.
-
-Here for example if FP overflow occurred, and the CR testing was carried
-out for that, all valid results would be stored but invalid ones would
-not, but in addition the Vector of CRs would contain the indicators of
-which ones failed.  With the invalid results being simply not written
-this could save resources (save on register file writes).
-
-Also expected is, due to the fact that the predicate mask is effectively
-ANDed with the post-result analysis as a secondary type of predication,
-that there would be savings to be had in some types of operations where
-the post-result analysis, if not included in SV, would need a second
-predicate calculation followed by a predicate mask AND operation.
-
-Note, hilariously, that Vectorised Condition Register Operations (crand,
-cror) may also have post-result analysis applied to them.  With Vectors
-of CRs being utilised *for* predication, possibilities for compact and
-elegant code begin to emerge from this innocuous-looking addition to SV.
-
  # Exception-based Fail-on-first
  
  One of the major issues with Vectorised LD/ST operations is when a
@@ -887,9 +854,7 @@ Note: see <https://bugs.libre-soc.org/show_bug.cgi?id=561>
  
  # Data-dependent fail-first
  
-This is a minor variant on the CR-based predicate-result mode.  Where
-pred-result continues with independent element testing (any of which may
-be parallelised), data-dependent fail-first *stops* at the first failure:
+Data-dependent fail-first *stops* at the first failure:
  
      if Rc=0: BO = inv<<2 | 0b00 # test CR.eq bit z/nz
      for i in range(VL):
@@ -899,7 +864,7 @@ be parallelised), data-dependent fail-first *stops* at the first failure:
          CRnew = analyse(result) # calculates eq/lt/gt
          # now test CR, similar to branch
          if CRnew[BO[0:1]] != BO[2]:
-            VL = i # truncate: only successes allowed
+            VL = i+VLi # truncate: only successes allowed
              break
          # test passed: store result (and CR?)
          if not RC1: iregs[RT+i] = result
@@ -913,12 +878,12 @@ that tested for the possibility of that failure, in advance of doing
  the actual calculation.
  
  The only minor downside here though is the change to VL, which in some
-implementations may cause pipeline stalls.  This was one of the reasons
-why CR-based pred-result analysis was added, because that at least is
-entirely paralleliseable.
+implementations may cause pipeline stalls.
  
  # Vertical-First Mode
  
+![image](/openpower/sv/sv_horizontal_vs_vertical.svg){ width=40% height=40% }
+
  This is a relatively new addition to SVP64 under development as of
  July 2021.  Where Horizontal-First is the standard Cray-style for-loop,
  Vertical-First typically executes just the **one** scalar element
@@ -937,8 +902,6 @@ loop:
    beq loop
  ```
  
-![image](/openpower/sv/sv_horizontal_vs_vertical.svg)
-
  Three examples are illustrated of different types of Scalar-Vector
  operations. Note that in its simplest form  **only one** element is
  executed per instruction **not** multiple elements per instruction.