remove Rc=1 for now from bmatflip

[libreriscv.git] / openpower / sv / mv.swizzle.mdwn
diff --git a/openpower/sv/mv.swizzle.mdwn b/openpower/sv/mv.swizzle.mdwn

index d2f83f0b1c5ce3268f01f14e09aec259034fa30a..3dc801822c42e3b110e0273c0042f78570256c30 100644 (file)
--- a/openpower/sv/mv.swizzle.mdwn
+++ b/openpower/sv/mv.swizzle.mdwn
@@ -18,7 +18,9 @@ prefixes into a single instruction.
  A compromise is to provide a Swizzle "Move": one such move is
  then required for each operand used in a subsequent instruction.
  The encoding for Swizzle Move embeds static predication into the
-swizzle as well as constants 1/1.0 and 0/0.0.
+swizzle as well as constants 1/1.0 and 0/0.0, and if Saturation
+is enabled maximum arithmetic constants may be placed into the
+destination as well.
  
  An extremely important aspect of 3D GPU workloads is that the source
  and destination subvector lengths may be *different*.  A vector of
@@ -26,13 +28,14 @@ contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
  swizzle-copied to
  a contiguous array of vec2.  A contiguous array of vec2 sources
  may have multiple of each vec2 elements (XY) copied to a contiguous
-vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
+vec4 array (YYXX or XYXX). For this reason, *when Vectorized*
  Swizzle Moves support independent subvector lengths for both
  source and destination.
  
-Although conceptually similar to `vpermd` of Packed SIMD VSX,
+Although conceptually similar to `vpermd` and `vpermdi`
+of Packed SIMD VSX,
  Swizzle Moves come in immediate-only form with only up to four
-selectors, where VSX refers to individual bytes and may not
+selectors, where `vpermd` refers to individual bytes and may not
  copy constants to the destination.
  3D Shader programs commonly use the letters "XYZW"
  when referring to the four swizzle indices, and also often
@@ -40,6 +43,11 @@ use the letters "RGBA"
  if referring to pixel data.  These designations are also
  part of both the OpenGL(TM) and Vulkan(TM) specifications.
  
+As a standalone Scalar operation this instruction is valuable
+if Prefixed with SVP64Single (providing Predication).
+Combined with `cmpi` it synthesises Compare-and-Swap.
+It is also more flexible than `xxpermdi`.
+
  # Format
  
  | 0.5 |6.10|11.15|16.27|28.31|  name        | Form    |
@@ -71,7 +79,8 @@ In very simplistic terms the relationship between swizzle indices
  
      dest[i] = src[swiz[i]]
  
-Note that 7 options are needed (not 6) because option 0b000 allows static 
+Note that 8 options are needed (not 6) because option 0b001 encodes
+the subvector length, and option 0b000 allows static 
  predicate masking (skipping) to be encoded within the swizzle immediate.
  For example it allows "W.Y." to specify: "copy W to position X,
  and Y to position Z, leave the other two positions Y and W unaltered"
@@ -80,11 +89,13 @@ and Y to position Z, leave the other two positions Y and W unaltered"
      X    Y    Z    W  source
           |         |     
           +----+    |
-         |    |    |
+         .    |    |
      +--------------+
-    |    |    |    |
+    |    .    |    .
      W    .    Y    .  swizzle
-    |    |    |    |
+    |    .    |    .
+    |    Y    |    W  Y,W unmodified
+    |    .    |    .
      W    Y    Y    W  dest
  
  **As a Scalar instruction**
@@ -95,7 +106,7 @@ be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
  ISA this not practical. A compromise is to cut the registers required
  by half, placing it on-par with `lq`, `stq` and Indexed
  Load-with-update instructions.
-When part of the Scalar Power ISA (not SVP64 Vectorised)
+When part of the Scalar Power ISA (not SVP64 Vectorized)
  mv.swiz and fmv.swiz operate on four 32-bit
  quantities, reducing this instruction to a feasible
  2-in, 2-out pairs of 64-bit registers:
@@ -120,13 +131,15 @@ Also, making life easier, RT and RA are only permitted to be even
  as in `lq` and `stq`.  Scalar Swizzle instructions must be atomically
  indivisible: an Exception or Interrupt may not occur during the Moves.
  
-Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
+Note that unlike the Vectorized variant, when `RT=RA` the Scalar variant
  *must* buffer (read) both 64-bit RA registers before writing to the
-RT pair. This ensures that register file corruption does not occur.
+RT pair (in an Out-of-Order Micro-architecture, both of the register
+pair must be "in-flight").
+This ensures that register file corruption does not occur.
  
-**SVP64 Vectorised**
+**SVP64 Vectorized**
  
-Vectorised Swizzle may be considered to 
+Vectorized Swizzle may be considered to 
  contain an extended static predicate
  mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
  the static predication capability, the destination
@@ -134,10 +147,11 @@ subvector length can be *different* from the source subvector
  length, and consequently the destination subvector length is
  encoded into the Swizzle.
  
-When Vectorised, given the use-case is for a High-performance GPU,
+When Vectorized, given the use-case is for a High-performance GPU,
  the fundamental assumption is that Micro-coding or
  other technique will
-be deployed in hardware to issue multiple Scalar MV operations which
+be deployed in hardware to issue multiple Scalar MV operations and
+full parallel crossbars, which
  would be impractical in a smaller Scalar-only Micro-architecture.
  Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
  quantities as the default is lifted on `sv.mv.swiz`.
@@ -145,7 +159,7 @@ quantities as the default is lifted on `sv.mv.swiz`.
  Additionally, in order to make life easier for implementers, some of
  whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
  the usual strict Element-level Program Order is relaxed.
-An overlap between all and any Vectorised
+An overlap between all and any Vectorized
  sources and destination Elements for the entirety of
  the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
  
@@ -222,7 +236,7 @@ than operating on the entire vec2/3/4 together would
  violate that expectation.  The exceptions to this, explained
  later, are when Pack/Unpack is enabled.
  
-**Effect of Saturation on Vectorised Swizzle**
+**Effect of Saturation on Vectorized Swizzle**
  
  A useful convenience for pixel data is to be able to insert values
  0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
@@ -236,19 +250,19 @@ zero because there is no encoding space to select between -1, 0 and 1, and
  
  # Pack/Unpack Mode:
  
-It is possible to apply Pack and Unpack to Vectorised 
-swizzle moves, and these instructions are of EXTRA type
-`RM-2P-1S1D-PU`. The interaction requires specific explanation
+It is possible to apply Pack and Unpack to Vectorized 
+swizzle moves. The interaction requires specific explanation
  because it involves the separate SUBVLs (with destination SUBVL
  being separate). Key to understanding is that the 
  source and
  destination SUBVL be "outer" loops instead of inner loops,
  exactly as in [[sv/remap]] Matrix mode, under the control
-of `PACK_en` and `UNPACK_en`.
+of `SVSTATE.PACK` and `SVSTATE.UNPACK`.
  
  Illustrating a
  "normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
  
+```
      def index():
          for i in range(VL):
              for j in range(SUBVL):
@@ -256,30 +270,33 @@ Illustrating a
  
      for idx in index():
          operation_on(RA+idx)
+```
  
  For a separate source/dest SUBVL (again, no elwidth overrides):
  
+```
      # yield an outer-SUBVL or inner VL loop with SUBVL
      def index_dest(outer):
          if outer:
              for j in range(dst_subvl):
                  for i in range(VL):
-                    ....
+                    yield j*VL+i
          else:
              for i in range(VL):
                  for j in range(dst_subvl):
-                    ....
+                    yield i*dst_subvl+j
  
      # yield an outer-SUBVL or inner VL loop with SUBVL
      def index_src(outer):
          if outer:
              for j in range(SUBVL):
                  for i in range(VL):
-                    ....
+                    yield j*VL+i
          else:
              for i in range(VL):
                  for j in range(SUBVL):
-                    ....
+                    yield i*SUBVL+j
+```
  
  "yield" from python is used here for simplicity and clarity.
  The two Finite State Machines for the generation of the source
@@ -312,7 +329,7 @@ if VERTICAL_FIRST:
      if PACK_en and UNPACK_en:
          num_runs = 1 # both are outer loops
      for substep in num_runs:
-        (src_idx, offs) = yield from index_src(UNPACK_en)
-        dst_idx = yield from index_dst(PACK_en)
+        (src_idx, offs) = yield from index_src(PACK_en)
+        dst_idx = yield from index_dst(UNPACK_en)
          move_operation(RT+dst_idx, RA+src_idx+offs)
  ```