[[!tag standards]] # mv.swizzle Links * * Swizzle is usually done on a per-operand basis in 3D GPU ISAs, making for extremely long instructions (64 bits or greater). Their value lies in the high occurrence of Swizzle in 3D Shader Binaries (over 10% of all instructions), however it is not practical to add two or more sets of 12-bit prefixes into a single instruction. A compromise is to provide a Swizzle "Move". The encoding for this instruction embeds static predication into the swizzle as well as constants 1/1.0 and 0/0.0 # Format | 0.5 |6.10|11.15|16.27|28.31| name | Form | |-----|----|-----|-----|-----|--------------|-------- | |PO | RTp| RAp |imm | 0011| mv.swiz | DQ-Form | |PO | RTp| RAp |imm | 1011| fmv.swiz | DQ-Form | this gives a 12 bit immediate across bits 16 to 27. Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming, has an associated index. 3 bits of the immediate are allocated to each: | imm |0.2 |3.5 |6.8|9.11| |-------|----|----|---|----| |swizzle|X | Y | Z | W | |index |0 | 1 | 2 | 3 | the options for each Swizzle are: * 0b000 to indicate "skip". this is equivalent to predicate masking * 0b001 is not needed (reserved) * 0b010 to indicate "constant 0" * 0b011 to indicate "constant 1" (or 1.0) * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW Evaluating efforts to encode 12 bit swizzle into less proved unsuccessful: 7^4 comes out to 2,400 which is larger than 11 bits. Note that 7 options are needed (not 6) because the 7th option allows static predicate masking to be encoded within the swizzle immediate. For example this allows "W.Y." to specify: "copy W to position X, and Y to position Z, leave the other two positions Y and W unaltered" 0 1 2 3 X Y Z W | | +----+ | | | | +--------------+ | | | | W Y Y W **As a Scalar instruction** Given that XYZW Swizzle can select simultaneously between one *and four* register operands, a full version of this instruction would be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar ISA this not practical. A compromise is to cut the registers required by half. When part of the Scalar Power ISA (not SVP64 Vectorised) mv.swiz and fmv.swiz operate on four 32-bit quantities, reducing this instruction to 2-in, 2-out pairs of 64-bit registers: | swizzle name | source | dest | half | |-- | -- | -- | -- | | X | RA | RT | lo-half | | Y | RA | RT | hi-half | | Z | RA+1 | RT+1 | lo-half | | W | RA+1 | RT+1 | hi-half | When `RA=RT` (in-place swizzle) any portion of RT not covered by the Swizzle is unmodified. For example a Swizzle of "..XY" will copy the contents RA+1 into RT but leave RT+1 unmodified. When `RA!=RT` any part of RT or RT+1 not set as a destination by the Swizzle will be set to zero. A Swizzle of "..XY" would copy the contents RA+1 into RT, but set RT+1 to zero. Also, making life easier, RT and RA are only permitted to be even (no overlapping can occur). This makes RT (and RA) a "pair" exactly like `lq` and `stq`. Swizzle instructions must be atomically indivisible: an Exception or Interrupt may not occur during the pair of Moves. **SVP64 Vectorised** When Vectorised, given the use-case is for a High-performance GPU, the fundamental assumption is that Micro-coding or other technique will be deployed in hardware to issue multiple Scalar MV operations which would be impractical in a smaller Scalar-only Micro-architecture. Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit quantities as the default is lifted on `sv.mv.swiz`. Additionally, in order to make life easier for implementers, some of whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding, the usual strict Element-level Program Order is relaxed but only for Horizontal-First Mode: * In Horizontal-First Mode, an overlap between all and any Vectorised sources and destination Elements for the entirety of the Vector Loop `0..VL-1` is `UNDEFINED` behaviour. * In Vertical-First Mode, an overlap on any given one execution of the Swizzle instruction requires that all Swizzled source elements be copied into intermediary buffers (in-flight Reservation Stations, pipeline registers) **before* being swapped and placed in destinations. Strict Program Order is required in full. *Implementor's note: the cost of Vertical-First Mode in an Embedded design of storing four 64-bit in-flight elements may be considered too high. If this is the case it is acceptable to throw an Illegal Instruction Trap, and emulate the instruction in software. Performance will obviously be adversely affected. See [[sv/compliancy_levels]]: all aspects of Swizzle are entirely optional in hardware at the Embedded Level.* Implementors must consider Swizzle instructions to be atomically indivisible, even if implemented as Micro-coded. The rest of SVP64 permits element-level operations to be Precise-Interrupted: *Swizzle moves do not*. All XYZW elements *must* be completed in full before any Trap or Interrupt is permitted to be serviced. Out-of-Order Micro-architectures may of course cancel the in-flight instruction as usual if the Interrupt requires fast servicing. # RM Mode Concept: MVRM-2P-2S1D: | Field Name | Field bits | Description | |------------|------------|----------------------------| | Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) | | Rsrc_EXTRA2 | `12:13` | extends Rsrc (R\*\_EXTRA2 Encoding) | | src_SUBVL | `14:15` | SUBVL for Source | | MASK_SRC | `16:18` | Execution Mask for Source | The inclusion of a separate src SUBVL allows `sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack). This is conceptually achieved by having both source and destination SUBVL be "outer" loops instead of inner loops. Illustrating a "normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides): def index(): for i in range(VL): for j in range(SUBVL): yield i*SUBVL+j for idx in index(): operation_on(RA+idx) For a separate source/dest SUBVL (again, no elwidth overrides): # yield an outer-SUBVL, inner VL loop with SRC SUBVL def index_src(): for j in range(SRC_SUBVL): for i in range(VL): yield i+VL*j # yield an outer-SUBVL, inner VL loop with DEST SUBVL def index_dest(): for j in range(SUBVL): for i in range(VL): yield i+VL*j # walk through both source and dest indices simultaneously for src_idx, dst_idx in zip(index_src(), index_dst()): move_operation(RT+dst_idx, RA+src_idx) "yield" from python is used here for simplicity and clarity. The two Finite State Machines for the generation of the source and destination element offsets progress incrementally in lock-step. Although not prohibited, it is not expected that Software would set both source and destination SUBVL at the same time. Usually, either `SRC_SUBVL=1, SUBVL=2/3/4` to give a "pack" effect, or `SUBVL=1, SRC_SUBVL=2/3/4` to give an "unpack".