[[!tag standards]]

# mv.swizzle

Links

* <https://bugs.libre-soc.org/show_bug.cgi?id=139>
* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>

Swizzle is a type of permute shorthand allowing arbitrary selection
of elements from vec2/3/4 creating a new vec2/3/4.
Their value lies in the high occurrence of Swizzle
in 3D Shader Binaries (over 10% of all instructions).
Swizzle is usually done on a per-vec-operand basis in 3D GPU ISAs, making
for extremely long instructions (64 bits or greater),
however it is not practical to add two or more sets of 12-bit
prefixes into a single instruction.
A compromise is to provide a Swizzle "Move": one such move is
then required for each operand used in a subsequent instruction.
The encoding for Swizzle Move embeds static predication into the
swizzle as well as constants 1/1.0 and 0/0.0, and if Saturation
is enabled maximum arithmetic constants may be placed into the
destination as well.

An extremely important aspect of 3D GPU workloads is that the source
and destination subvector lengths may be *different*.  A vector of
contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
swizzle-copied to
a contiguous array of vec2.  A contiguous array of vec2 sources
may have multiple of each vec2 elements (XY) copied to a contiguous
vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
Swizzle Moves support independent subvector lengths for both
source and destination.

Although conceptually similar to `vpermd` of Packed SIMD VSX,
Swizzle Moves come in immediate-only form with only up to four
selectors, where VSX refers to individual bytes and may not
copy constants to the destination.
3D Shader programs commonly use the letters "XYZW"
when referring to the four swizzle indices, and also often
use the letters "RGBA"
if referring to pixel data.  These designations are also
part of both the OpenGL(TM) and Vulkan(TM) specifications.

As a standalone Scalar operation this instruction is valuable
if Prefixed with SVP64Single (providing Predication).
Combined with `cmpi` it synthesises Compare-and-Swap.

# Format

| 0.5 |6.10|11.15|16.27|28.31|  name        | Form    |
|-----|----|-----|-----|-----|--------------|-------- |
|PO   | RTp| RAp |imm  | 0011| mv.swiz      | DQ-Form |
|PO   | RTp| RAp |imm  | 1011| fmv.swiz     | DQ-Form |

this gives a 12 bit immediate across bits 16 to 27.
Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
has an associated index.  3 bits of the immediate are allocated
to each:

| imm   |0.2 |3.5 |6.8|9.11|
|-------|----|----|---|----|
|swizzle|X   | Y  | Z | W  |
|pixel  |R   | G  | B | A  |
|index  |0   | 1  | 2 | 3  |

The options for each Swizzle are:

* 0b000 to indicate "skip".  this is equivalent to predicate masking
* 0b001 subvector length end marker (length=4 if not present)
* 0b010 to indicate "constant 0"
* 0b011 to indicate "constant 1" (or 1.0)
* 0b1NN index 0 thru 3 to copy from subelement in pos XYZW

In very simplistic terms the relationship between swizzle indices
(NN, above), source, and destination is:

    dest[i] = src[swiz[i]]

Note that 8 options are needed (not 6) because option 0b001 encodes
the subvector length, and option 0b000 allows static 
predicate masking (skipping) to be encoded within the swizzle immediate.
For example it allows "W.Y." to specify: "copy W to position X,
and Y to position Z, leave the other two positions Y and W unaltered"

    0    1    2    3
    X    Y    Z    W  source
         |         |     
         +----+    |
         .    |    |
    +--------------+
    |    .    |    .
    W    .    Y    .  swizzle
    |    .    |    .
    |    Y    |    W  Y,W unmodified
    |    .    |    .
    W    Y    Y    W  dest

**As a Scalar instruction**

Given that XYZW Swizzle can select simultaneously between one *and four*
register operands, a full version of this instruction would
be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
ISA this not practical. A compromise is to cut the registers required
by half, placing it on-par with `lq`, `stq` and Indexed
Load-with-update instructions.
When part of the Scalar Power ISA (not SVP64 Vectorised)
mv.swiz and fmv.swiz operate on four 32-bit
quantities, reducing this instruction to a feasible
2-in, 2-out pairs of 64-bit registers:

| swizzle name | source | dest | half    |
|--            | --     | --   | --      |
| X            | RA     | RT   | lo-half |
| Y            | RA     | RT   | hi-half |
| Z            | RA+1   | RT+1 | lo-half |
| W            | RA+1   | RT+1 | hi-half |

When `RA=RT` (in-place swizzle) any portion of RT not covered by
the Swizzle is unmodified.  For example a Swizzle of "..XY"
will copy the contents RA+1 into RT but leave RT+1 unmodified.

When `RA!=RT` any part of RT or RT+1 not set as a destination by
the Swizzle will be set to zero.  A Swizzle of "..XY" would
copy the contents RA+1 into RT, but set RT+1 to zero.

Also, making life easier, RT and RA are only permitted to be even
(no overlapping can occur).  This makes RT (and RA) a "pair" exactly
as in `lq` and `stq`.  Scalar Swizzle instructions must be atomically
indivisible: an Exception or Interrupt may not occur during the Moves.

Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
*must* buffer (read) both 64-bit RA registers before writing to the
RT pair (in an Out-of-Order Micro-architecture, both of the register
pair must be "in-flight").
This ensures that register file corruption does not occur.

**SVP64 Vectorised**

Vectorised Swizzle may be considered to 
contain an extended static predicate
mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
the static predication capability, the destination
subvector length can be *different* from the source subvector
length, and consequently the destination subvector length is
encoded into the Swizzle.

When Vectorised, given the use-case is for a High-performance GPU,
the fundamental assumption is that Micro-coding or
other technique will
be deployed in hardware to issue multiple Scalar MV operations and
full parallel crossbars, which
would be impractical in a smaller Scalar-only Micro-architecture.
Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
quantities as the default is lifted on `sv.mv.swiz`.

Additionally, in order to make life easier for implementers, some of
whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
the usual strict Element-level Program Order is relaxed.
An overlap between all and any Vectorised
sources and destination Elements for the entirety of
the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.

This in turn implies that Traps and Exceptions are, as usual,
permitted in between element-level moves, because due to there
being no overlap there is no risk of destroying a source with
an overwrite.  This is *unlike* the Scalar variant which, when
`RT=RA`, must buffer both halves of the RT pair.

Determining the source and destination subvector lengths is tricky.
Swizzle Pseudocode:

```
    swiz[0] = imm[0:3]   # X
    swiz[1] = imm[3:6]   # Y
    swiz[2] = imm[6:9]   # Z
    swiz[3] = imm[9:12]  # W
    # determine implied subvector length from Swizzle
    dst_subvl = 4
    for i in range(4):
        if swiz[i] == 0b001:
            dst_subvl = i+1
            break
```

What is going on here is that the option is provided to have different
source and destination subvector lengths, by exploiting redundancy in
the Swizzle Immediate.  With the Swizzles marking what goes into
each destination position, the marker "0b001" may be used to indicate
the end. If no marker is present then the destination subvector length
may be assumed to be 4.  SUBVL is considered to be the "source" subvector
length.

Pseudocode exploiting python "yield" for clarity: element-width overrides,
Saturation and Predication also left out, for clarity:

```
    def index_src():
        for i in range(VL):
            for j in range(SUBVL):
                if swiz[j] == 0b000: # skip
                    continue
                if swiz[j] == 0b001: # end
                    break
                if swiz[j] in [0b010, 0b011]:
                    yield (i*SUBVL, CONSTANT)
                else:
                    yield (i*SUBVL, swiz[j]-3)

    def index_dest():
        for i in range(VL):
            for j in range(dst_subvl):
                if swiz[j] == 0b000: # skip
                    continue
                yield i*dst_subvl+j

    # walk through both source and dest indices simultaneously
    for (src_idx, offs), dst_idx in zip(index_src(), index_dst()):
        if offs == CONSTANT:
             set(RT+dst_idx, CONSTANT)
        else
             move_operation(RT+dst_idx, RA+src_idx+offs)
```

**Vertical-First Mode**

It is important to appreciate that *only* the main loop VL
is Vertical-First: the SUBVL loop is not.  This makes sense
from the perspective that the Swizzle Move is a group of
moves, but is still a single instruction that happens to take
vec2/3/4 as operands.  Vertical-First
only performing one of the *sub*-elements at a time rather
than operating on the entire vec2/3/4 together would
violate that expectation.  The exceptions to this, explained
later, are when Pack/Unpack is enabled.

**Effect of Saturation on Vectorised Swizzle**

A useful convenience for pixel data is to be able to insert values
0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
when Saturation is enabled and a Swizzle=0b011 (Constant 1) is requested,
the maximum permitted Saturated value is inserted rather than Constant 1.
`sv.mv.swiz/sats/vec2/ew=8 RT.v, RA.v, Y1` would insert the 2nd subelement
(Y) into the first destination subelement and the signed-maximum constant
0x7f into the second. A Constant 0 Swizzle on the other hand still inserts
zero because there is no encoding space to select between -1, 0 and 1, and
0 and max values are more useful.

# Pack/Unpack Mode:

It is possible to apply Pack and Unpack to Vectorised 
swizzle moves. The interaction requires specific explanation
because it involves the separate SUBVLs (with destination SUBVL
being separate). Key to understanding is that the 
source and
destination SUBVL be "outer" loops instead of inner loops,
exactly as in [[sv/remap]] Matrix mode, under the control
of `SVSTATE.PACK` and `SVSTATE.UNPACK`.

Illustrating a
"normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):

```
    def index():
        for i in range(VL):
            for j in range(SUBVL):
                yield i*SUBVL+j

    for idx in index():
        operation_on(RA+idx)
```

For a separate source/dest SUBVL (again, no elwidth overrides):

```
    # yield an outer-SUBVL or inner VL loop with SUBVL
    def index_dest(outer):
        if outer:
            for j in range(dst_subvl):
                for i in range(VL):
                    yield j*VL+i
        else:
            for i in range(VL):
                for j in range(dst_subvl):
                    yield i*dst_subvl+j

    # yield an outer-SUBVL or inner VL loop with SUBVL
    def index_src(outer):
        if outer:
            for j in range(SUBVL):
                for i in range(VL):
                    yield j*VL+i
        else:
            for i in range(VL):
                for j in range(SUBVL):
                    yield i*SUBVL+j
```

"yield" from python is used here for simplicity and clarity.
The two Finite State Machines for the generation of the source
and destination element offsets progress incrementally in
lock-step.

Just as in [[sv/mv.vec]], when `PACK_en` is set it is the source
that swaps to Outer-subvector loops, and when `UNPACK_en` is set
it is the destination that swaps its loop-order.  Setting both
`PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED`
because the behaviour is fully deterministic.

*However*, in
Vertical-First Mode, when both are enabled,
with both source and destination being outer loops a **single**
step of srstep and dststep is performed.  Contrast this when
one of `PACK_en` is set, it is the *destination* that is an inner
subvector loop, and therefore Vertical-First runs through the
entire `dst_subvl` group. Likewise when `UNPACK_en` is set it
is the source subvector that is run through as a group.

```
if VERTICAL_FIRST:
    # must run through SUBVL or dst_subvl elements, to keep
    # the subvector "together".  weirdness occurs due to
    # PACK_en/UNPACK_en
    num_runs = SUBVL # 1-4
    if PACK_en:
        num_runs = dst_subvl # destination still an inner loop
    if PACK_en and UNPACK_en:
        num_runs = 1 # both are outer loops
    for substep in num_runs:
        (src_idx, offs) = yield from index_src(PACK_en)
        dst_idx = yield from index_dst(UNPACK_en)
        move_operation(RT+dst_idx, RA+src_idx+offs)
```