openpower/sv/mv.swizzle.mdwn

   1 [[!tag standards]]
   2
   3 # mv.swizzle
   4
   5 Links
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
   8 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>
   9
  10 Swizzle is usually done on a per-operand basis in 3D GPU ISAs, making
  11 for extremely long instructions (64 bits or greater).
  12 Their value lies in the high occurrence of Swizzle
  13 in 3D Shader Binaries (over 10% of all instructions),
  14 however it is not practical to add two or more sets of 12-bit
  15 prefixes into a single instruction.
  16 A compromise is to provide a Swizzle "Move".
  17 The encoding for this instruction embeds static predication into the
  18 swizzle as well as constants 1/1.0 and 0/0.0
  19
  20 # Format
  21
  22 | 0.5 |6.10|11.15|16.27|28.31|  name        |
  23 |-----|----|-----|-----|-----|--------------|
  24 |PO   | RTp| RAp |imm  | 0011| mv.swiz      |
  25 |PO   | RTp| RAp |imm  | 1011| fmv.swiz     |
  26
  27 this gives a 12 bit immediate across bits 16 to 27.
  28 Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
  29 has an associated index.  3 bits of the immediate are allocated
  30 to each:
  31
  32 | imm   |0.2 |3.5 |6.8|9.11|
  33 |-------|----|----|---|----|
  34 |swizzle|X   | Y  | Z | W  |
  35 |index  |0   | 1  | 2 | 3  |
  36
  37 the options for each Swizzle are:
  38
  39 * 0b000 to indicate "skip".  this is equivalent to predicate masking
  40 * 0b001 is not needed (reserved)
  41 * 0b010 to indicate "constant 0"
  42 * 0b011 to indicate "constant 1" (or 1.0)
  43 * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW
  44
  45 Evaluating efforts to encode 12 bit swizzle into less proved unsuccessful: 7^4 comes out to 2,400 which is larger than 11 bits.
  46
  47 Note that 7 options are needed (not 6) because the 7th option allows static
  48 predicate masking to be encoded within the swizzle immediate.
  49 For example this allows "W.Y." to specify: "copy W to position X,
  50 and Y to position Z, leave the other two positions Y and W unaltered"
  51
  52     0    1    2    3
  53     X    Y    Z    W
  54          |         |
  55          +----+    |
  56          |    |    |
  57     +--------------+
  58     |    |    |    |
  59     W    Y    Y    W
  60
  61 **As a Scalar instruction**
  62
  63 Given that XYZW Swizzle can select simultaneously between one *and four*
  64 register operands, a full version of this instruction would
  65 be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
  66 ISA this not practical. A compromise is to cut the registers required
  67 by half.
  68 When part of the Scalar Power ISA (not SVP64 Vectorised)
  69 mv.swiz and fmv.swiz operate on four 32-bit
  70 quantities, reducing this instruction to 2-in, 2-out pairs of 64-bit
  71 registers:
  72
  73 | swizzle name | source | dest | half   |
  74 |--            | --     | --   | --      |
  75 | X            | RA     | RT   | lo-half |
  76 | Y            | RA     | RT   | hi-half |
  77 | Z            | RA+1   | RT+1 | lo-half |
  78 | W            | RA+1   | RT+1 | hi-half |
  79
  80 When `RA=RT` (in-place swizzle) any portion of RT not covered by
  81 the Swizzle is unmodified.  For example a Swizzle of "..XY"
  82 will copy the contents RA+1 into RT but leave RT+1 unmodified.
  83
  84 When `RA!=RT` any part of RT or RT+1 not set as a destination by
  85 the Swizzle will be set to zero.  A Swizzle of "..XY" would
  86 copy the contents RA+1 into RT, but set RT+1 to zero.
  87
  88 Also, making life easier, RT and RA are only permitted to be even
  89 (no overlapping can occur).  This makes RT (and RA) a "pair" exactly
  90 like `lq` and `stq`
  91
  92 **SVP64 Vectorised**
  93
  94 When Vectorised, given the use-case is for a High-performance GPU,
  95 the fundamental assumption is that Micro-coding or
  96 other technique will
  97 be deployed in hardware to issue multiple Scalar MV operations which
  98 would be impractical in a smaller Scalar-only Micro-architecture.
  99 Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
 100 quantities as the default is lifted on `sv.mv.swiz`.
 101
 102 Additionally, in order to make life easier for implementers, some of
 103 whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
 104 the usual strict Element-level Program Order is relaxed but only for
 105 Horizontal-First Mode:
 106
 107 * In Horizontal-First Mode, an overlap between all and any Vectorised
 108   sources and destination Elements for the entirety of
 109   the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
 110 * In Vertical-First Mode, an overlap on any given one execution of
 111   the Swizzle instruction requires that all Swizzled source elements be
 112   copied into intermediary buffers (in-flight Reservation Stations,
 113   pipeline registers) **before* being swapped and placed in
 114   destinations.  Strict Program Order is required in full.
 115
 116 *Implementor's note: the cost of Vertical-First Mode in an Embedded design
 117 of storing four 64-bit in-flight elements may be considered
 118 too high. If this is the
 119 case it is acceptable to throw an Illegal Instruction Trap, and emulate
 120 the instruction in software. Performance will obviously be adversely affected.
 121 See [[sv/compliancy_levels]]: all aspects of
 122 Swizzle are entirely optional in hardware at the Embedded Level.*
 123
 124 Implementors must consider `SUBVL` to have been implicitly set by
 125 the Swizzle instructions. Hardware may statically calculate `SUBVL`
 126 from the immediate.  "W.0Z" is SUBVL=4, where "X0Z." is SUBVL=3,
 127 and ".W.." sets SUBVL=2.  Setting `SUBVL` has a different meaning
 128 in Swizzle Move instructions,
 129 as explained below.
 130
 131 # RM Mode Concept:
 132
 133 MVRM-2P-2S1D:
 134
 135 | Field Name | Field bits | Description                     |
 136 |------------|------------|----------------------------|
 137 | Rdest_EXTRA2 | `10:11`  | extends Rdest (R\*\_EXTRA2 Encoding)   |
 138 | Rsrc_EXTRA2  | `12:13`  | extends Rsrc  (R\*\_EXTRA2 Encoding)   |
 139 | src_SUBVL    | `14:15`  | SUBVL for Source              |
 140 | MASK_SRC     | `16:18`  | Execution Mask for Source     |
 141
 142 The inclusion of a separate src SUBVL allows
 143 `sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack).
 144 This is conceptually achieved by having both source and
 145 destination SUBVL be "outer" loops instead of inner loops.