A compromise is to provide a Swizzle "Move": one such move is
then required for each operand used in a subsequent instruction.
The encoding for Swizzle Move embeds static predication into the
-swizzle as well as constants 1/1.0 and 0/0.0.
+swizzle as well as constants 1/1.0 and 0/0.0, and if Saturation
+is enabled maximum arithmetic constants may be placed into the
+destination as well.
An extremely important aspect of 3D GPU workloads is that the source
and destination subvector lengths may be *different*. A vector of
swizzle-copied to
a contiguous array of vec2. A contiguous array of vec2 sources
may have multiple of each vec2 elements (XY) copied to a contiguous
-vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
+vec4 array (YYXX or XYXX). For this reason, *when Vectorized*
Swizzle Moves support independent subvector lengths for both
source and destination.
-Although conceptually similar to `vpermd` of Packed SIMD VSX,
+Although conceptually similar to `vpermd` and `vpermdi`
+of Packed SIMD VSX,
Swizzle Moves come in immediate-only form with only up to four
-selectors, where VSX refers to individual bytes and may not
+selectors, where `vpermd` refers to individual bytes and may not
copy constants to the destination.
3D Shader programs commonly use the letters "XYZW"
when referring to the four swizzle indices, and also often
As a standalone Scalar operation this instruction is valuable
if Prefixed with SVP64Single (providing Predication).
Combined with `cmpi` it synthesises Compare-and-Swap.
+It is also more flexible than `xxpermdi`.
# Format
ISA this not practical. A compromise is to cut the registers required
by half, placing it on-par with `lq`, `stq` and Indexed
Load-with-update instructions.
-When part of the Scalar Power ISA (not SVP64 Vectorised)
+When part of the Scalar Power ISA (not SVP64 Vectorized)
mv.swiz and fmv.swiz operate on four 32-bit
quantities, reducing this instruction to a feasible
2-in, 2-out pairs of 64-bit registers:
as in `lq` and `stq`. Scalar Swizzle instructions must be atomically
indivisible: an Exception or Interrupt may not occur during the Moves.
-Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
+Note that unlike the Vectorized variant, when `RT=RA` the Scalar variant
*must* buffer (read) both 64-bit RA registers before writing to the
RT pair (in an Out-of-Order Micro-architecture, both of the register
pair must be "in-flight").
This ensures that register file corruption does not occur.
-**SVP64 Vectorised**
+**SVP64 Vectorized**
-Vectorised Swizzle may be considered to
+Vectorized Swizzle may be considered to
contain an extended static predicate
mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
the static predication capability, the destination
length, and consequently the destination subvector length is
encoded into the Swizzle.
-When Vectorised, given the use-case is for a High-performance GPU,
+When Vectorized, given the use-case is for a High-performance GPU,
the fundamental assumption is that Micro-coding or
other technique will
be deployed in hardware to issue multiple Scalar MV operations and
Additionally, in order to make life easier for implementers, some of
whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
the usual strict Element-level Program Order is relaxed.
-An overlap between all and any Vectorised
+An overlap between all and any Vectorized
sources and destination Elements for the entirety of
the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
violate that expectation. The exceptions to this, explained
later, are when Pack/Unpack is enabled.
-**Effect of Saturation on Vectorised Swizzle**
+**Effect of Saturation on Vectorized Swizzle**
A useful convenience for pixel data is to be able to insert values
0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
# Pack/Unpack Mode:
-It is possible to apply Pack and Unpack to Vectorised
+It is possible to apply Pack and Unpack to Vectorized
swizzle moves. The interaction requires specific explanation
because it involves the separate SUBVLs (with destination SUBVL
being separate). Key to understanding is that the
if outer:
for j in range(dst_subvl):
for i in range(VL):
- ....
+ yield j*VL+i
else:
for i in range(VL):
for j in range(dst_subvl):
- ....
+ yield i*dst_subvl+j
# yield an outer-SUBVL or inner VL loop with SUBVL
def index_src(outer):
if outer:
for j in range(SUBVL):
for i in range(VL):
- ....
+ yield j*VL+i
else:
for i in range(VL):
for j in range(SUBVL):
- ....
+ yield i*SUBVL+j
```
"yield" from python is used here for simplicity and clarity.