however it is not practical to add two or more sets of 12-bit
prefixes into a single instruction.
A compromise is to provide a Swizzle "Move": one such move is
-then required for each operand to a subsequent instruction.
-The encoding for this instruction embeds static predication into the
-swizzle as well as constants 1/1.0 and 0/0.0
+then required for each operand used in a subsequent instruction.
+The encoding for Swizzle Move embeds static predication into the
+swizzle as well as constants 1/1.0 and 0/0.0.
An extremely important aspect of 3D GPU workloads is that the source
and destination subvector lengths may be *different*. A vector of
-contiguous array of vec3 may only 2 elements swizzle-copied to a contiguous
-array of vec2. Swizzle Moves support independent subvector lengths.
+contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
+swizzle-copied to
+a contiguous array of vec2. A contiguous array of vec2 sources
+may have multiple of each vec2 elements (XY) copied to a contiguous
+vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
+Swizzle Moves support independent subvector lengths for both
+source and destination.
Although conceptually similar to `vpermd` of Packed SIMD VSX,
Swizzle Moves come in immediate-only form with only up to four
3D Shader programs commonly use the letters "XYZW"
when referring to the four swizzle indices, and also often
use the letters "RGBA"
-if referring to pixel data.
+if referring to pixel data. These designations are also
+part of both the OpenGL(TM) and Vulkan(TM) specifications.
+
+As a standalone Scalar operation this instruction is valuable
+if Prefixed with SVP64Single (providing Predication).
+Combined with `cmpi` it synthesises Compare-and-Swap.
# Format
dest[i] = src[swiz[i]]
-Note that 7 options are needed (not 6) because option 0b000 allows static
+Note that 8 options are needed (not 6) because option 0b001 encodes
+the subvector length, and option 0b000 allows static
predicate masking (skipping) to be encoded within the swizzle immediate.
For example it allows "W.Y." to specify: "copy W to position X,
and Y to position Z, leave the other two positions Y and W unaltered"
0 1 2 3
- X Y Z W
+ X Y Z W source
| |
+----+ |
- | | |
+ . | |
+--------------+
- | | | |
- W Y Y W
+ | . | .
+ W . Y . swizzle
+ | . | .
+ | Y | W Y,W unmodified
+ | . | .
+ W Y Y W dest
**As a Scalar instruction**
register operands, a full version of this instruction would
be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
ISA this not practical. A compromise is to cut the registers required
-by half.
+by half, placing it on-par with `lq`, `stq` and Indexed
+Load-with-update instructions.
When part of the Scalar Power ISA (not SVP64 Vectorised)
mv.swiz and fmv.swiz operate on four 32-bit
quantities, reducing this instruction to a feasible
Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
*must* buffer (read) both 64-bit RA registers before writing to the
-RT pair. This ensures that register file corruption does not occur.
+RT pair (in an Out-of-Order Micro-architecture, both of the register
+pair must be "in-flight").
+This ensures that register file corruption does not occur.
**SVP64 Vectorised**
When Vectorised, given the use-case is for a High-performance GPU,
the fundamental assumption is that Micro-coding or
other technique will
-be deployed in hardware to issue multiple Scalar MV operations which
+be deployed in hardware to issue multiple Scalar MV operations and
+full parallel crossbars, which
would be impractical in a smaller Scalar-only Micro-architecture.
Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
quantities as the default is lifted on `sv.mv.swiz`.
may be assumed to be 4. SUBVL is considered to be the "source" subvector
length.
+Pseudocode exploiting python "yield" for clarity: element-width overrides,
+Saturation and Predication also left out, for clarity:
+
```
def index_src():
for i in range(VL):
else:
yield (i*SUBVL, swiz[j]-3)
- # yield an outer-SUBVL, inner VL loop with DEST SUBVL
def index_dest():
for i in range(VL):
for j in range(dst_subvl):
move_operation(RT+dst_idx, RA+src_idx+offs)
```
+**Vertical-First Mode**
+
+It is important to appreciate that *only* the main loop VL
+is Vertical-First: the SUBVL loop is not. This makes sense
+from the perspective that the Swizzle Move is a group of
+moves, but is still a single instruction that happens to take
+vec2/3/4 as operands. Vertical-First
+only performing one of the *sub*-elements at a time rather
+than operating on the entire vec2/3/4 together would
+violate that expectation. The exceptions to this, explained
+later, are when Pack/Unpack is enabled.
+
**Effect of Saturation on Vectorised Swizzle**
A useful convenience for pixel data is to be able to insert values
zero because there is no encoding space to select between -1, 0 and 1, and
0 and max values are more useful.
-# RM Mode Concept:
+# Pack/Unpack Mode:
-MVRM-2P-1S1D:
-
-| Field Name | Field bits | Description |
-|------------|------------|----------------------------|
-| Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
-| Rsrc_EXTRA2 | `12:13` | extends Rsrc (R\*\_EXTRA2 Encoding) |
-| PACK_en | `14` | Enable pack |
-| UNPACK_en | `15` | Enable unpack |
-| MASK_SRC | `16:18` | Execution Mask for Source |
-
-The inclusion of a separate src SUBVL allows
-`sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack).
-This is conceptually achieved by having both source and
-destination SUBVL be "outer" loops instead of inner loops.
+It is possible to apply Pack and Unpack to Vectorised
+swizzle moves, and these instructions are of EXTRA type
+`RM-2P-1S1D-PU`. The interaction requires specific explanation
+because it involves the separate SUBVLs (with destination SUBVL
+being separate). Key to understanding is that the
+source and
+destination SUBVL be "outer" loops instead of inner loops,
+exactly as in [[sv/remap]] Matrix mode, under the control
+of `PACK_en` and `UNPACK_en`.
Illustrating a
-"normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides):
+"normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
def index():
for i in range(VL):
For a separate source/dest SUBVL (again, no elwidth overrides):
- # yield an outer-SUBVL, inner VL loop with SRC SUBVL
- def index_src():
- for j in range(SRC_SUBVL):
+ # yield an outer-SUBVL or inner VL loop with SUBVL
+ def index_dest(outer):
+ if outer:
+ for j in range(dst_subvl):
+ for i in range(VL):
+ ....
+ else:
for i in range(VL):
- yield i+VL*j
+ for j in range(dst_subvl):
+ ....
- # yield an outer-SUBVL, inner VL loop with DEST SUBVL
- def index_dest():
- for j in range(SUBVL):
+ # yield an outer-SUBVL or inner VL loop with SUBVL
+ def index_src(outer):
+ if outer:
+ for j in range(SUBVL):
+ for i in range(VL):
+ ....
+ else:
for i in range(VL):
- yield i+VL*j
-
- # inner looping when SUBVLs are equal
- if SRC_SUBVL == SUBVL:
- for idx in index():
- move_operation(RT+idx, RA+idx)
- else:
- # walk through both source and dest indices simultaneously
- for src_idx, dst_idx in zip(index_src(), index_dst()):
- move_operation(RT+dst_idx, RA+src_idx)
+ for j in range(SUBVL):
+ ....
"yield" from python is used here for simplicity and clarity.
The two Finite State Machines for the generation of the source
and destination element offsets progress incrementally in
lock-step.
-Ether `SRC_SUBVL=1, SUBVL=2/3/4` gives
-a "pack" effect, and `SUBVL=1, SRC_SUBVL=2/3/4` gives an
-"unpack". Setting both SUBVL and SRC_SUBVL to greater than
-1 is `UNDEFINED`.
+Just as in [[sv/mv.vec]], when `PACK_en` is set it is the source
+that swaps to Outer-subvector loops, and when `UNPACK_en` is set
+it is the destination that swaps its loop-order. Setting both
+`PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED`
+because the behaviour is fully deterministic.
+
+*However*, in
+Vertical-First Mode, when both are enabled,
+with both source and destination being outer loops a **single**
+step of srstep and dststep is performed. Contrast this when
+one of `PACK_en` is set, it is the *destination* that is an inner
+subvector loop, and therefore Vertical-First runs through the
+entire `dst_subvl` group. Likewise when `UNPACK_en` is set it
+is the source subvector that is run through as a group.
+
+```
+if VERTICAL_FIRST:
+ # must run through SUBVL or dst_subvl elements, to keep
+ # the subvector "together". weirdness occurs due to
+ # PACK_en/UNPACK_en
+ num_runs = SUBVL # 1-4
+ if PACK_en:
+ num_runs = dst_subvl # destination still an inner loop
+ if PACK_en and UNPACK_en:
+ num_runs = 1 # both are outer loops
+ for substep in num_runs:
+ (src_idx, offs) = yield from index_src(PACK_en)
+ dst_idx = yield from index_dst(UNPACK_en)
+ move_operation(RT+dst_idx, RA+src_idx+offs)
+```