[[!tag standards]]

# mv.swizzle

Links

* <https://bugs.libre-soc.org/show_bug.cgi?id=139>
* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>

Swizzle is usually done on a per-operand basis in 3D GPU ISAs, making
for extremely long instructions (64 bits or greater).
Their value lies in the high occurrence of Swizzle
in 3D Shader Binaries (over 10% of all instructions),
however it is not practical to add two or more sets of 12-bit
prefixes into a single instruction.
A compromise is to provide a Swizzle "Move".
The encoding for this instruction embeds static predication into the
swizzle as well as constants 1/1.0 and 0/0.0

# Format

| 0.5 |6.10|11.15|16.27|28.31|  name        | Form    |
|-----|----|-----|-----|-----|--------------|-------- |
|PO   | RTp| RAp |imm  | 0011| mv.swiz      | DQ-Form |
|PO   | RTp| RAp |imm  | 1011| fmv.swiz     | DQ-Form |

this gives a 12 bit immediate across bits 16 to 27.
Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
has an associated index.  3 bits of the immediate are allocated
to each:

| imm   |0.2 |3.5 |6.8|9.11|
|-------|----|----|---|----|
|swizzle|X   | Y  | Z | W  |
|index  |0   | 1  | 2 | 3  |

the options for each Swizzle are:

* 0b000 to indicate "skip".  this is equivalent to predicate masking
* 0b001 is not needed (reserved)
* 0b010 to indicate "constant 0"
* 0b011 to indicate "constant 1" (or 1.0)
* 0b1NN index 0 thru 3 to copy from subelement in pos XYZW

Evaluating efforts to encode 12 bit swizzle into less proved unsuccessful: 7^4 comes out to 2,400 which is larger than 11 bits.

Note that 7 options are needed (not 6) because the 7th option allows static 
predicate masking to be encoded within the swizzle immediate.
For example this allows "W.Y." to specify: "copy W to position X,
and Y to position Z, leave the other two positions Y and W unaltered"

    0    1    2    3
    X    Y    Z    W
         |         |     
         +----+    |
         |    |    |
    +--------------+
    |    |    |    |
    W    Y    Y    W

**As a Scalar instruction**

Given that XYZW Swizzle can select simultaneously between one *and four*
register operands, a full version of this instruction would
be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
ISA this not practical. A compromise is to cut the registers required
by half.
When part of the Scalar Power ISA (not SVP64 Vectorised)
mv.swiz and fmv.swiz operate on four 32-bit
quantities, reducing this instruction to 2-in, 2-out pairs of 64-bit
registers:

| swizzle name | source | dest | half   |
|--            | --     | --   | --      |
| X            | RA     | RT   | lo-half |
| Y            | RA     | RT   | hi-half |
| Z            | RA+1   | RT+1 | lo-half |
| W            | RA+1   | RT+1 | hi-half |

When `RA=RT` (in-place swizzle) any portion of RT not covered by
the Swizzle is unmodified.  For example a Swizzle of "..XY"
will copy the contents RA+1 into RT but leave RT+1 unmodified.

When `RA!=RT` any part of RT or RT+1 not set as a destination by
the Swizzle will be set to zero.  A Swizzle of "..XY" would
copy the contents RA+1 into RT, but set RT+1 to zero.

Also, making life easier, RT and RA are only permitted to be even
(no overlapping can occur).  This makes RT (and RA) a "pair" exactly
like `lq` and `stq`.  Swizzle instructions must be atomically indivisible:
an Exception or Interrupt may not occur during the pair of Moves.

**SVP64 Vectorised**

When Vectorised, given the use-case is for a High-performance GPU,
the fundamental assumption is that Micro-coding or
other technique will
be deployed in hardware to issue multiple Scalar MV operations which
would be impractical in a smaller Scalar-only Micro-architecture.
Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
quantities as the default is lifted on `sv.mv.swiz`.

Additionally, in order to make life easier for implementers, some of
whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
the usual strict Element-level Program Order is relaxed but only for
Horizontal-First Mode:

* In Horizontal-First Mode, an overlap between all and any Vectorised
  sources and destination Elements for the entirety of
  the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
* In Vertical-First Mode, an overlap on any given one execution of
  the Swizzle instruction requires that all Swizzled source elements be
  copied into intermediary buffers (in-flight Reservation Stations,
  pipeline registers) **before* being swapped and placed in
  destinations.  Strict Program Order is required in full.

*Implementor's note: the cost of Vertical-First Mode in an Embedded design
of storing four 64-bit in-flight elements may be considered
too high. If this is the
case it is acceptable to throw an Illegal Instruction Trap, and emulate
the instruction in software. Performance will obviously be adversely affected.
See [[sv/compliancy_levels]]: all aspects of
Swizzle are entirely optional in hardware at the Embedded Level.*

Implementors must consider Swizzle instructions to be atomically indivisible,
even if implemented as Micro-coded.  The rest of SVP64 permits element-level
operations to be Precise-Interrupted: *Swizzle moves do not*.  All XYZW
elements *must* be completed in full before any Trap or Interrupt is
permitted
to be serviced. Out-of-Order Micro-architectures may of course cancel
the in-flight instruction as usual if the Interrupt requires fast servicing.

# RM Mode Concept:

MVRM-2P-2S1D:

| Field Name | Field bits | Description                     |
|------------|------------|----------------------------|
| Rdest_EXTRA2 | `10:11`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
| Rsrc_EXTRA2  | `12:13`  | extends Rsrc  (R\*\_EXTRA2 Encoding)   |
| src_SUBVL    | `14:15`  | SUBVL for Source              |
| MASK_SRC     | `16:18`  | Execution Mask for Source     |

The inclusion of a separate src SUBVL allows
`sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack).
This is conceptually achieved by having both source and
destination SUBVL be "outer" loops instead of inner loops.

Illustrating a
"normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides):

    def index():
        for i in range(VL):
            for j in range(SUBVL):
                yield i*SUBVL+j

    for idx in index():
        operation_on(RA+idx)

For a separate source/dest SUBVL (again, no elwidth overrides):

    # yield an outer-SUBVL, inner VL loop with SRC SUBVL
    def index_src():
        for j in range(SRC_SUBVL):
            for i in range(VL):
                yield i+VL*j

    # yield an outer-SUBVL, inner VL loop with DEST SUBVL
    def index_dest():
        for j in range(SUBVL):
            for i in range(VL):
                yield i+VL*j

    # walk through both source and dest indices simultaneously
    for src_idx, dst_idx in zip(index_src(), index_dst()):
        move_operation(RT+dst_idx, RA+src_idx)

"yield" from python is used here for simplicity and clarity.
The two Finite State Machines for the generation of the source
and destination element offsets progress incrementally in
lock-step.

Although not prohibited, it is not expected that
Software would set both source and destination SUBVL at the
same time.  Usually, either `SRC_SUBVL=1, SUBVL=2/3/4` to give
a "pack" effect, or `SUBVL=1, SRC_SUBVL=2/3/4` to give an
"unpack".