if referring to pixel data. These designations are also
part of both the OpenGL(TM) and Vulkan(TM) specifications.
+As a standalone Scalar operation this instruction is valuable
+if Prefixed with SVP64Single (providing Predication).
+Combined with `cmpi` it synthesises Compare-and-Swap.
+
# Format
| 0.5 |6.10|11.15|16.27|28.31| name | Form |
dest[i] = src[swiz[i]]
-Note that 7 options are needed (not 6) because option 0b000 allows static
+Note that 8 options are needed (not 6) because option 0b001 encodes
+the subvector length, and option 0b000 allows static
predicate masking (skipping) to be encoded within the swizzle immediate.
For example it allows "W.Y." to specify: "copy W to position X,
and Y to position Z, leave the other two positions Y and W unaltered"
X Y Z W source
| |
+----+ |
- | | |
+ . | |
+--------------+
- | | | |
+ | . | .
W . Y . swizzle
- | | | |
+ | . | .
+ | Y | W Y,W unmodified
+ | . | .
W Y Y W dest
**As a Scalar instruction**
register operands, a full version of this instruction would
be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
ISA this not practical. A compromise is to cut the registers required
-by half.
+by half, placing it on-par with `lq`, `stq` and Indexed
+Load-with-update instructions.
When part of the Scalar Power ISA (not SVP64 Vectorised)
mv.swiz and fmv.swiz operate on four 32-bit
quantities, reducing this instruction to a feasible
Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
*must* buffer (read) both 64-bit RA registers before writing to the
-RT pair. This ensures that register file corruption does not occur.
+RT pair (in an Out-of-Order Micro-architecture, both of the register
+pair must be "in-flight").
+This ensures that register file corruption does not occur.
**SVP64 Vectorised**
When Vectorised, given the use-case is for a High-performance GPU,
the fundamental assumption is that Micro-coding or
other technique will
-be deployed in hardware to issue multiple Scalar MV operations which
+be deployed in hardware to issue multiple Scalar MV operations and
+full parallel crossbars, which
would be impractical in a smaller Scalar-only Micro-architecture.
Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
quantities as the default is lifted on `sv.mv.swiz`.
zero because there is no encoding space to select between -1, 0 and 1, and
0 and max values are more useful.
-# RM Mode Concept:
-
-MVRM-2P-1S1D:
+# Pack/Unpack Mode:
-| Field Name | Field bits | Description |
-|------------|------------|----------------------------|
-| Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
-| Rsrc_EXTRA2 | `12:13` | extends Rsrc (R\*\_EXTRA2 Encoding) |
-| PACK_en | `14` | Enable pack |
-| UNPACK_en | `15` | Enable unpack |
-| MASK_SRC | `16:18` | Execution Mask for Source |
-
-The inclusion of a separate src SUBVL allows
-`sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack).
-This is conceptually achieved by having both source and
+It is possible to apply Pack and Unpack to Vectorised
+swizzle moves, and these instructions are of EXTRA type
+`RM-2P-1S1D-PU`. The interaction requires specific explanation
+because it involves the separate SUBVLs (with destination SUBVL
+being separate). Key to understanding is that the
+source and
destination SUBVL be "outer" loops instead of inner loops,
-exactly as in [[sv/remap]] Matrix mode.
+exactly as in [[sv/remap]] Matrix mode, under the control
+of `PACK_en` and `UNPACK_en`.
Illustrating a
-"normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides):
+"normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
def index():
for i in range(VL):
For a separate source/dest SUBVL (again, no elwidth overrides):
- # yield an outer-SUBVL, inner VL loop with SRC SUBVL
- def index_src():
- for j in range(SUBVL):
+ # yield an outer-SUBVL or inner VL loop with SUBVL
+ def index_dest(outer):
+ if outer:
+ for j in range(dst_subvl):
+ for i in range(VL):
+ ....
+ else:
for i in range(VL):
- yield i+VL*j
+ for j in range(dst_subvl):
+ ....
- # yield an outer-SUBVL, inner VL loop with DEST SUBVL
- def index_dest():
- for j in range(dst_subvl):
+ # yield an outer-SUBVL or inner VL loop with SUBVL
+ def index_src(outer):
+ if outer:
+ for j in range(SUBVL):
+ for i in range(VL):
+ ....
+ else:
for i in range(VL):
- yield i+VL*j
+ for j in range(SUBVL):
+ ....
"yield" from python is used here for simplicity and clarity.
The two Finite State Machines for the generation of the source
if PACK_en and UNPACK_en:
num_runs = 1 # both are outer loops
for substep in num_runs:
- (src_idx, offs) = yield from index_src()
- dst_idx = yield from index_dst()
+ (src_idx, offs) = yield from index_src(PACK_en)
+ dst_idx = yield from index_dst(UNPACK_en)
move_operation(RT+dst_idx, RA+src_idx+offs)
```