(no commit message)

[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn

index a414bf2663288f0202eb5c1443af3967679e3285..c4dc6701e4c5d9fe79ce1ca0817e3a5b4030995f 100644 (file)
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -56,7 +56,7 @@ but only for SVP64 Prefixed Operations.
  XER.CA/CA32 on the other hand is expected and required to be implemented
  according to standard Power ISA Scalar behaviour.  Interestingly, due
  to SVP64 being in effect a hardware for-loop around Scalar instructions
-executing in precise Program Order, a little thought shows that a Vectorised
+executing in precise Program Order, a little thought shows that a Vectorized
  Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
  and producing, at the end, a single bit Carry out.  High performance
  implementations may exploit this observation to deploy efficient
@@ -164,7 +164,7 @@ The following schedule for srcstep and dststep will occur:
  | 0       | 0       | both mask[src=0] and mask[dst=0] are 1 |
  | 1       | 2       | sz=1 but dz=0: dst skips mask[1], src soes not |
  | 2       | 3       | mask[src=2] and mask[dst=3] are 1 |
-| end     | end     | loop has ended because dst reached VL-1 |
+| 3       | end     | loop has ended because dst reached VL-1 |
  
  Example 2:
  
@@ -179,7 +179,7 @@ The following schedule for srcstep and dststep will occur:
  | 0       | 0       | both mask[src=0] and mask[dst=0] are 1 |
  | 2       | 1       | sz=0 but dz=1: src skips mask[1], dst does not |
  | 3       | 2       | mask[src=3] and mask[dst=2] are 1 |
-| end     | end     | loop has ended because src reached VL-1 |
+| end     | 3       | loop has ended because src reached VL-1 |
  
  In both these examples it is crucial to note that despite there being
  a single predicate mask, with sz and dz being different, srcstep and
@@ -200,7 +200,7 @@ The following schedule for srcstep and dststep will occur:
  | 3       | 3       | mask[src=3] and mask[dst=3] are 1 |
  | end     | end     | loop has ended because src and dst reached VL-1 |
  
-Here, both srcstep and dststep remain in lockstep because sz=dz=1
+Here, both srcstep and dststep remain in lockstep because sz=dz=0
  
  ## Twin Predication <a name="2p"> </a>
  
@@ -315,6 +315,13 @@ Pack/Unpack has quirky interactions on [[sv/mv.swizzle]] because it can
  set a different subvector length for destination, and has a slightly
  different pseudocode algorithm for Vertical-First Mode.
  
+Ordering is as follows:
+
+* SVSHAPE srcstep, dststep, ssubstep and dsubstep are advanced sequentially
+  depending on PACK/UNPACK.
+* srcstep and dststep are pushed through REMAP to compute actual Element offsets.
+* Swizzle is independently applied to ssubstep and dsubstep
+
  Pack/Unpack is enabled (set up) through [[sv/svstep]].
  
  ## Reduce modes
@@ -351,7 +358,8 @@ Scalar Reduction per se does not exist, instead is implemented in SVP64
  as a simple and natural relaxation of the usual restriction on the Vector
  Looping which would terminate if the destination was marked as a Scalar.
  Scalar Reduction by contrast *keeps issuing Vector Element Operations*
-even though the destination register is marked as scalar.  Thus it is
+even though the destination register is marked as scalar *and*
+the same register is used as a source register.  Thus it is
  up to the programmer to be aware of this, observe some conventions,
  and thus end up achieving the desired outcome of scalar reduction.
  
@@ -368,8 +376,8 @@ one register must be assigned, by convention by the programmer to be the
  * One of the sources is a Vector
  * the destination is a scalar
  * optionally but most usefully when one source scalar register is
-  also the scalar destination (which may be informally termed
-  the "accumulator")
+  also the scalar destination (which may be informally termed by
+  convention the "accumulator")
  * That the source register type is the same as the destination register
    type identified as the "accumulator".  Scalar reduction on `cmp`,
    `setb` or `isel` makes no sense for example because of the mixture
@@ -469,8 +477,9 @@ in sequential Program Order, element 0 being the first.
  
  * LD/ST ffirst (not to be confused with *Data-Dependent* LD/ST ffirst)
    treats the first LD/ST in a vector (element 0) as an ordinary one.
-  Exceptions occur "as normal".  However for elements 1 and above, if an
-  exception would occur, then VL is **truncated** to the previous element.
+  Exceptions occur "as normal" on the first element.  However for elements
+  1 and above, if an exception would occur, then VL is **truncated**
+  to the previous element.
  * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
    CR-creating operation produces a result (including cmp).  Similar to
    branch, an analysis of the CR is performed and if the test fails,
@@ -480,7 +489,16 @@ in sequential Program Order, element 0 being the first.
    depending on whether VLi (VL "inclusive") is set.
  
  Thus the new VL comprises a contiguous vector of results, all of which
-pass the testing criteria (equal to zero, less than zero).
+pass the testing criteria (equal to zero, less than zero). Demonstrated
+approximately in pseudocode:
+
+```
+for i in range(VL):
+   GPR[RT+i], CR[i] = operation(GPR[RA+i]... )
+   if test(CR[i]) == failure:
+      VL = i+VLi
+      break
+```
  
  The CR-based data-driven fail-on-first is new and not
  found in ARM SVE or RVV. At the same time it is also
@@ -507,7 +525,7 @@ to include the terminating zero.
  
  In CR-based data-driven fail-on-first there is only the option to select
  and test one bit of each CR (just as with branch BO).  For more complex
-tests this may be insufficient.  If that is the case, a vectorised crops
+tests this may be insufficient.  If that is the case, a vectorized crops
  (crand, cror) may be used, and ffirst applied to the crop instead of to
  the arithmetic vector.
  
@@ -519,7 +537,7 @@ One extremely important aspect of ffirst is:
    to zero. This is the only means in the entirety of SV that VL may be set
    to zero (with the exception of via the SV.STATE SPR).  When VL is set
    zero due to the first element failing the CR bit-test, all subsequent
-  vectorised operations are effectively `nops` which is
+  vectorized operations are effectively `nops` which is
    *precisely the desired and intended behaviour*.
  
  Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
@@ -555,24 +573,6 @@ There are two primary different types of CR operations:
  
  More details can be found in [[sv/cr_ops]].
  
-## pred-result mode
-
-Pred-result mode may not be applied on CR-based operations.
-
-Although CR operations (mtcr, crand, cror) may be Vectorised, predicated,
-pred-result mode applies to operations that have an Rc=1 mode, or make
-sense to add an RC1 option.
-
-Predicate-result merges common CR testing with predication, saving
-on instruction count.  In essence, a Condition Register Field test is
-performed, and if it fails it is considered to have been *as if* the
-destination predicate bit was zero.  Given that there are no CR-based
-operations that produce Rc=1 co-results, there can be no pred-result
-mode  for mtcr and other CR-based instructions
-
-Arithmetic and Logical Pred-result, which does have Rc=1 or for which
-RC1 Mode makes sense, is covered in [[sv/normal]]
-
  ## CR Operations
  
  CRs are slightly more involved than INT or FP registers due to the
@@ -625,8 +625,10 @@ applies, **not** the `CR_bit` portion (bits 3-4):
  ```
      if extra3_mode:
          spec = EXTRA3
-    else:
-        spec = EXTRA2<<1 | 0b0
+    elif EXTRA2[0]:  # vector mode
+        spec = EXTRA2 << 1  # same as EXTRA3, shifted
+    else:            # scalar mode
+        spec = (EXTRA2[0] << 2) | EXTRA2[1]
      if spec[0]:
         # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
         return ((BA >> 2)<<6) | # hi 3 bits shifted up
@@ -665,10 +667,10 @@ do not start on a 32-bit aligned boundary, performance may be affected.
  ### CR fields as inputs/outputs of vector operations
  
  CRs (or, the arithmetic operations associated with them)
-may be marked as Vectorised or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised.  Likewise if the destination is scalar then so is the CR.
+may be marked as Vectorized or Scalar.  When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorized if the destination is Vectorized.  Likewise if the destination is scalar then so is the CR.
  
  When vectorized, the CR inputs/outputs are sequentially read/written
-to 4-bit CR fields.  Vectorised Integer results, when Rc=1, will begin
+to 4-bit CR fields.  Vectorized Integer results, when Rc=1, will begin
  writing to CR8 (TBD evaluate) and increase sequentially from there.
  This is so that:
  
@@ -685,8 +687,8 @@ EXTRA field the *standard* v3.0B behaviour applies: the accompanying
  CR when Rc=1 is written to.  This is CR0 for integer operations and CR1
  for FP operations.
  
-Note that yes, the CR Fields are genuinely Vectorised.  Unlike in SIMD VSX which
-has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
+Note that yes, the CR Fields are genuinely Vectorized.  Unlike in SIMD VSX which
+has a single CR (CR6) for a given SIMD result, SV Vectorized OpenPOWER
  v3.0B scalar operations produce a **tuple** of element results: the
  result of the operation as one part of that element *and a corresponding
  CR element*.  Greatly simplified pseudocode:
@@ -706,7 +708,7 @@ then a followup instruction must be performed, setting "reduce" mode on
  the Vector of CRs, using cr ops (crand, crnor) to do so.  This provides far
  more flexibility in analysing vectors than standard Vector ISAs.  Normal
  Vector ISAs are typically restricted to "were all results nonzero" and
-"were some results nonzero". The application of mapreduce to Vectorised
+"were some results nonzero". The application of mapreduce to Vectorized
  cr operations allows far more sophisticated analysis, particularly in
  conjunction with the new crweird operations see [[sv/cr_int_predication]].
  
@@ -740,7 +742,7 @@ CRn is the notation used by the OpenPower spec to refer to CR field #i,
  so FP instructions with Rc=1 write to CR1 (n=1).
  
  CRs are not stored in SPRs: they are registers in their own right.
-Therefore context-switching the full set of CRs involves a Vectorised
+Therefore context-switching the full set of CRs involves a Vectorized
  mfcr or mtcr, using VL=8 to do so.  This is exactly as how
  scalar OpenPOWER context-switches CRs: it is just that there are now
  more of them.
@@ -839,7 +841,6 @@ Qualifiers:
  * ew={N}: ew=8/16/32 - sets elwidth override
  * sw={N}: sw=8/16/32 - sets source elwidth override
  * ff={xx}: see fail-first mode
-* pr={xx}: see predicate-result mode
  * sat{x}: satu / sats - see saturation mode
  * mr: see map-reduce mode
  * mrr: map-reduce, reverse-gear (VL-1 downto 0)
@@ -851,9 +852,6 @@ Qualifiers:
  
  For modes:
  
-* pred-result:
-  - pm=lt/gt/le/ge/eq/ne/so/ns
-  - RC1 mode
  * fail-first
    - ff=lt/gt/le/ge/eq/ne/so/ns
    - RC1 mode
@@ -1031,7 +1029,7 @@ In Scalar mode, `maddedu` therefore stores the two halves of the 128-bit
  multiply into RT and RT+1.
  
  What, then, of `sv.maddedu`? If the destination is hard-coded to RT and
-RT+1 the instruction is not useful when Vectorised because the output
+RT+1 the instruction is not useful when Vectorized because the output
  will be overwritten on the next element.  To solve this is easy: define
  the destination registers as RT and RT+MAXVL respectively.  This makes
  it easy for compilers to statically allocate registers even when VL