remove Rc=1 for now from bmatflip

[libreriscv.git] / openpower / sv / mv.swizzle.mdwn
diff --git a/openpower/sv/mv.swizzle.mdwn b/openpower/sv/mv.swizzle.mdwn

index 952fabe8a39a9ac35af743d6a7dad8f2248d083c..3dc801822c42e3b110e0273c0042f78570256c30 100644 (file)
--- a/openpower/sv/mv.swizzle.mdwn
+++ b/openpower/sv/mv.swizzle.mdwn
@@ -18,7 +18,9 @@ prefixes into a single instruction.
  A compromise is to provide a Swizzle "Move": one such move is
  then required for each operand used in a subsequent instruction.
  The encoding for Swizzle Move embeds static predication into the
-swizzle as well as constants 1/1.0 and 0/0.0.
+swizzle as well as constants 1/1.0 and 0/0.0, and if Saturation
+is enabled maximum arithmetic constants may be placed into the
+destination as well.
  
  An extremely important aspect of 3D GPU workloads is that the source
  and destination subvector lengths may be *different*.  A vector of
@@ -26,13 +28,14 @@ contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
  swizzle-copied to
  a contiguous array of vec2.  A contiguous array of vec2 sources
  may have multiple of each vec2 elements (XY) copied to a contiguous
-vec4 array (YYXX or XYXX). For this reason
+vec4 array (YYXX or XYXX). For this reason, *when Vectorized*
  Swizzle Moves support independent subvector lengths for both
  source and destination.
  
-Although conceptually similar to `vpermd` of Packed SIMD VSX,
+Although conceptually similar to `vpermd` and `vpermdi`
+of Packed SIMD VSX,
  Swizzle Moves come in immediate-only form with only up to four
-selectors, where VSX refers to individual bytes and may not
+selectors, where `vpermd` refers to individual bytes and may not
  copy constants to the destination.
  3D Shader programs commonly use the letters "XYZW"
  when referring to the four swizzle indices, and also often
@@ -40,6 +43,11 @@ use the letters "RGBA"
  if referring to pixel data.  These designations are also
  part of both the OpenGL(TM) and Vulkan(TM) specifications.
  
+As a standalone Scalar operation this instruction is valuable
+if Prefixed with SVP64Single (providing Predication).
+Combined with `cmpi` it synthesises Compare-and-Swap.
+It is also more flexible than `xxpermdi`.
+
  # Format
  
  | 0.5 |6.10|11.15|16.27|28.31|  name        | Form    |
@@ -71,7 +79,8 @@ In very simplistic terms the relationship between swizzle indices
  
      dest[i] = src[swiz[i]]
  
-Note that 7 options are needed (not 6) because option 0b000 allows static 
+Note that 8 options are needed (not 6) because option 0b001 encodes
+the subvector length, and option 0b000 allows static 
  predicate masking (skipping) to be encoded within the swizzle immediate.
  For example it allows "W.Y." to specify: "copy W to position X,
  and Y to position Z, leave the other two positions Y and W unaltered"
@@ -80,11 +89,13 @@ and Y to position Z, leave the other two positions Y and W unaltered"
      X    Y    Z    W  source
           |         |     
           +----+    |
-         |    |    |
+         .    |    |
      +--------------+
-    |    |    |    |
+    |    .    |    .
      W    .    Y    .  swizzle
-    |    |    |    |
+    |    .    |    .
+    |    Y    |    W  Y,W unmodified
+    |    .    |    .
      W    Y    Y    W  dest
  
  **As a Scalar instruction**
@@ -93,8 +104,9 @@ Given that XYZW Swizzle can select simultaneously between one *and four*
  register operands, a full version of this instruction would
  be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
  ISA this not practical. A compromise is to cut the registers required
-by half.
-When part of the Scalar Power ISA (not SVP64 Vectorised)
+by half, placing it on-par with `lq`, `stq` and Indexed
+Load-with-update instructions.
+When part of the Scalar Power ISA (not SVP64 Vectorized)
  mv.swiz and fmv.swiz operate on four 32-bit
  quantities, reducing this instruction to a feasible
  2-in, 2-out pairs of 64-bit registers:
@@ -119,13 +131,15 @@ Also, making life easier, RT and RA are only permitted to be even
  as in `lq` and `stq`.  Scalar Swizzle instructions must be atomically
  indivisible: an Exception or Interrupt may not occur during the Moves.
  
-Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
+Note that unlike the Vectorized variant, when `RT=RA` the Scalar variant
  *must* buffer (read) both 64-bit RA registers before writing to the
-RT pair. This ensures that register file corruption does not occur.
+RT pair (in an Out-of-Order Micro-architecture, both of the register
+pair must be "in-flight").
+This ensures that register file corruption does not occur.
  
-**SVP64 Vectorised**
+**SVP64 Vectorized**
  
-Vectorised Swizzle may be considered to 
+Vectorized Swizzle may be considered to 
  contain an extended static predicate
  mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
  the static predication capability, the destination
@@ -133,10 +147,11 @@ subvector length can be *different* from the source subvector
  length, and consequently the destination subvector length is
  encoded into the Swizzle.
  
-When Vectorised, given the use-case is for a High-performance GPU,
+When Vectorized, given the use-case is for a High-performance GPU,
  the fundamental assumption is that Micro-coding or
  other technique will
-be deployed in hardware to issue multiple Scalar MV operations which
+be deployed in hardware to issue multiple Scalar MV operations and
+full parallel crossbars, which
  would be impractical in a smaller Scalar-only Micro-architecture.
  Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
  quantities as the default is lifted on `sv.mv.swiz`.
@@ -144,7 +159,7 @@ quantities as the default is lifted on `sv.mv.swiz`.
  Additionally, in order to make life easier for implementers, some of
  whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
  the usual strict Element-level Program Order is relaxed.
-An overlap between all and any Vectorised
+An overlap between all and any Vectorized
  sources and destination Elements for the entirety of
  the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
  
@@ -221,7 +236,7 @@ than operating on the entire vec2/3/4 together would
  violate that expectation.  The exceptions to this, explained
  later, are when Pack/Unpack is enabled.
  
-**Effect of Saturation on Vectorised Swizzle**
+**Effect of Saturation on Vectorized Swizzle**
  
  A useful convenience for pixel data is to be able to insert values
  0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
@@ -233,27 +248,21 @@ the maximum permitted Saturated value is inserted rather than Constant 1.
  zero because there is no encoding space to select between -1, 0 and 1, and
  0 and max values are more useful.
  
-# RM Mode Concept:
-
-MVRM-2P-1S1D:
-
-| Field Name | Field bits | Description                     |
-|------------|------------|----------------------------|
-| Rdest_EXTRA2 | `10:11`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
-| Rsrc_EXTRA2  | `12:13`  | extends Rsrc  (R\*\_EXTRA2 Encoding)   |
-| PACK_en      | `14`     | Enable pack              |
-| UNPACK_en    | `15`     | Enable unpack             |
-| MASK_SRC     | `16:18`  | Execution Mask for Source     |
+# Pack/Unpack Mode:
  
-The inclusion of a separate src SUBVL allows
-`sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack).
-This is conceptually achieved by having both source and
+It is possible to apply Pack and Unpack to Vectorized 
+swizzle moves. The interaction requires specific explanation
+because it involves the separate SUBVLs (with destination SUBVL
+being separate). Key to understanding is that the 
+source and
  destination SUBVL be "outer" loops instead of inner loops,
-exactly as in [[sv/remap]] Matrix mode.
+exactly as in [[sv/remap]] Matrix mode, under the control
+of `SVSTATE.PACK` and `SVSTATE.UNPACK`.
  
  Illustrating a
-"normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides):
+"normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
  
+```
      def index():
          for i in range(VL):
              for j in range(SUBVL):
@@ -261,20 +270,33 @@ Illustrating a
  
      for idx in index():
          operation_on(RA+idx)
+```
  
  For a separate source/dest SUBVL (again, no elwidth overrides):
  
-    # yield an outer-SUBVL, inner VL loop with SRC SUBVL
-    def index_src():
-        for j in range(SUBVL):
+```
+    # yield an outer-SUBVL or inner VL loop with SUBVL
+    def index_dest(outer):
+        if outer:
+            for j in range(dst_subvl):
+                for i in range(VL):
+                    yield j*VL+i
+        else:
              for i in range(VL):
-                yield i+VL*j
+                for j in range(dst_subvl):
+                    yield i*dst_subvl+j
  
-    # yield an outer-SUBVL, inner VL loop with DEST SUBVL
-    def index_dest():
-        for j in range(dst_subvl):
+    # yield an outer-SUBVL or inner VL loop with SUBVL
+    def index_src(outer):
+        if outer:
+            for j in range(SUBVL):
+                for i in range(VL):
+                    yield j*VL+i
+        else:
              for i in range(VL):
-                yield i+VL*j
+                for j in range(SUBVL):
+                    yield i*SUBVL+j
+```
  
  "yield" from python is used here for simplicity and clarity.
  The two Finite State Machines for the generation of the source
@@ -307,7 +329,7 @@ if VERTICAL_FIRST:
      if PACK_en and UNPACK_en:
          num_runs = 1 # both are outer loops
      for substep in num_runs:
-        (src_idx, offs) = yield from index_src()
-        dst_idx = yield from index_dst()
+        (src_idx, offs) = yield from index_src(PACK_en)
+        dst_idx = yield from index_dst(UNPACK_en)
          move_operation(RT+dst_idx, RA+src_idx+offs)
  ```