| swizzle name | source | dest | half |
|-- | -- | -- | -- |
-| X | RS | RT | lo-half |
-| Y | RS | RT | hi-half |
-| Z | RS+1 | RT+1 | lo-half |
-| W | RS+1 | RT+1 | hi-half |
+| X | RA | RT | lo-half |
+| Y | RA | RT | hi-half |
+| Z | RA+1 | RT+1 | lo-half |
+| W | RA+1 | RT+1 | hi-half |
-When RS=RT (in-place swizzle) any portion of RT not covered by
+When RA=RT (in-place swizzle) any portion of RT not covered by
the Swizzle is unmodified. For example a Swizzle of "..XY"
-will copy the contents RS+1 into RT but leave RT+1 unmodified.
+will copy the contents RA+1 into RT but leave RT+1 unmodified.
-When RS!=RT any part of RT or RT+1 not set as a destination by
+When RA!=RT any part of RT or RT+1 not set as a destination by
the Swizzle will be set to zero. A Swizzle of "..XY" would
-copy the contents RS+1 into RT, but set RT+1 to zero.
+copy the contents RA+1 into RT, but set RT+1 to zero.
Also, making life easier, RT and RA are only permitted to be even
-(no overlapping can occur). This makes RT (and RS) a "pair" exactly
+(no overlapping can occur). This makes RT (and RA) a "pair" exactly
like `lq` and `stq`
**SVP64 Vectorised**
Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
quantities as the default is lifted on `sv.mv.swiz`.
+Additionally, in order to make life easier for implementers, some of
+whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
+the usual strict Element-level Program Order is relaxed but only for
+Horizontal-First Mode:
+
+* In Horizontal-First Mode, an overlap between all and any Vectorised
+ sources and destination Elements for the entirety of
+ the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
+* In Vertical-First Mode, an overlap on any given one execution of
+ the Swizzle instruction requires that all Swizzled source elements be
+ copied into intermediary buffers (in-flight Reservation Stations,
+ pipeline registers) **before* being swapped and placed in
+ destinations. Strict Program Order is required in full.
+
+*Implementor's note: the cost of Vertical-First Mode in an Embedded design
+of storing four 64-bit in-flight elements may be too high. If this is the
+case it is acceptable to throw an Illegal Instruction Trap.
+See [[sv/compliancy_levels]]*
+
# Format
| 0.5 |6.10|11.15|16.27|28.31| name |
|-----|----|-----|-----|-----|--------------|
-|PO | RTp| RSp |imm | 0011| mv.swiz |
-|PO | RTp| RSp |imm | 1011| fmv.swiz |
+|PO | RTp| RAp |imm | 0011| mv.swiz |
+|PO | RTp| RAp |imm | 1011| fmv.swiz |
this gives a 12 bit immediate across bits 16 to 27.
Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,