openpower/sv/mv.swizzle.mdwn

   1 [[!tag standards]]
   2
   3 # mv.swizzle
   4
   5 Links
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
   8 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>
   9
  10 Swizzle is a type of permute shorthand allowing arbitrary selection
  11 of elements from vec2/3/4 creating a new vec2/3/4.
  12 Their value lies in the high occurrence of Swizzle
  13 in 3D Shader Binaries (over 10% of all instructions).
  14 Swizzle is usually done on a per-vec-operand basis in 3D GPU ISAs, making
  15 for extremely long instructions (64 bits or greater),
  16 however it is not practical to add two or more sets of 12-bit
  17 prefixes into a single instruction.
  18 A compromise is to provide a Swizzle "Move": one such move is
  19 then required for each operand used in a subsequent instruction.
  20 The encoding for Swizzle Move embeds static predication into the
  21 swizzle as well as constants 1/1.0 and 0/0.0.
  22
  23 An extremely important aspect of 3D GPU workloads is that the source
  24 and destination subvector lengths may be *different*.  A vector of
  25 contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
  26 swizzle-copied to
  27 a contiguous array of vec2.  A contiguous array of vec2 sources
  28 may have multiple of each vec2 elements (XY) copied to a contiguous
  29 vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
  30 Swizzle Moves support independent subvector lengths for both
  31 source and destination.
  32
  33 Although conceptually similar to `vpermd` of Packed SIMD VSX,
  34 Swizzle Moves come in immediate-only form with only up to four
  35 selectors, where VSX refers to individual bytes and may not
  36 copy constants to the destination.
  37 3D Shader programs commonly use the letters "XYZW"
  38 when referring to the four swizzle indices, and also often
  39 use the letters "RGBA"
  40 if referring to pixel data.  These designations are also
  41 part of both the OpenGL(TM) and Vulkan(TM) specifications.
  42
  43 # Format
  44
  45 | 0.5 |6.10|11.15|16.27|28.31|  name        | Form    |
  46 |-----|----|-----|-----|-----|--------------|-------- |
  47 |PO   | RTp| RAp |imm  | 0011| mv.swiz      | DQ-Form |
  48 |PO   | RTp| RAp |imm  | 1011| fmv.swiz     | DQ-Form |
  49
  50 this gives a 12 bit immediate across bits 16 to 27.
  51 Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
  52 has an associated index.  3 bits of the immediate are allocated
  53 to each:
  54
  55 | imm   |0.2 |3.5 |6.8|9.11|
  56 |-------|----|----|---|----|
  57 |swizzle|X   | Y  | Z | W  |
  58 |pixel  |R   | G  | B | A  |
  59 |index  |0   | 1  | 2 | 3  |
  60
  61 The options for each Swizzle are:
  62
  63 * 0b000 to indicate "skip".  this is equivalent to predicate masking
  64 * 0b001 subvector length end marker (length=4 if not present)
  65 * 0b010 to indicate "constant 0"
  66 * 0b011 to indicate "constant 1" (or 1.0)
  67 * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW
  68
  69 In very simplistic terms the relationship between swizzle indices
  70 (NN, above), source, and destination is:
  71
  72     dest[i] = src[swiz[i]]
  73
  74 Note that 7 options are needed (not 6) because option 0b000 allows static
  75 predicate masking (skipping) to be encoded within the swizzle immediate.
  76 For example it allows "W.Y." to specify: "copy W to position X,
  77 and Y to position Z, leave the other two positions Y and W unaltered"
  78
  79     0    1    2    3
  80     X    Y    Z    W  source
  81          |         |
  82          +----+    |
  83          |    |    |
  84     +--------------+
  85     |    |    |    |
  86     W    .    Y    .  swizzle
  87     |    |    |    |
  88     W    Y    Y    W  dest
  89
  90 **As a Scalar instruction**
  91
  92 Given that XYZW Swizzle can select simultaneously between one *and four*
  93 register operands, a full version of this instruction would
  94 be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
  95 ISA this not practical. A compromise is to cut the registers required
  96 by half, placing it on-par with `lq`, `stq` and Indexed
  97 Load-with-update instructions.
  98 When part of the Scalar Power ISA (not SVP64 Vectorised)
  99 mv.swiz and fmv.swiz operate on four 32-bit
 100 quantities, reducing this instruction to a feasible
 101 2-in, 2-out pairs of 64-bit registers:
 102
 103 | swizzle name | source | dest | half    |
 104 |--            | --     | --   | --      |
 105 | X            | RA     | RT   | lo-half |
 106 | Y            | RA     | RT   | hi-half |
 107 | Z            | RA+1   | RT+1 | lo-half |
 108 | W            | RA+1   | RT+1 | hi-half |
 109
 110 When `RA=RT` (in-place swizzle) any portion of RT not covered by
 111 the Swizzle is unmodified.  For example a Swizzle of "..XY"
 112 will copy the contents RA+1 into RT but leave RT+1 unmodified.
 113
 114 When `RA!=RT` any part of RT or RT+1 not set as a destination by
 115 the Swizzle will be set to zero.  A Swizzle of "..XY" would
 116 copy the contents RA+1 into RT, but set RT+1 to zero.
 117
 118 Also, making life easier, RT and RA are only permitted to be even
 119 (no overlapping can occur).  This makes RT (and RA) a "pair" exactly
 120 as in `lq` and `stq`.  Scalar Swizzle instructions must be atomically
 121 indivisible: an Exception or Interrupt may not occur during the Moves.
 122
 123 Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
 124 *must* buffer (read) both 64-bit RA registers before writing to the
 125 RT pair. This ensures that register file corruption does not occur.
 126
 127 **SVP64 Vectorised**
 128
 129 Vectorised Swizzle may be considered to
 130 contain an extended static predicate
 131 mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
 132 the static predication capability, the destination
 133 subvector length can be *different* from the source subvector
 134 length, and consequently the destination subvector length is
 135 encoded into the Swizzle.
 136
 137 When Vectorised, given the use-case is for a High-performance GPU,
 138 the fundamental assumption is that Micro-coding or
 139 other technique will
 140 be deployed in hardware to issue multiple Scalar MV operations and
 141 full parallel crossbars, which
 142 would be impractical in a smaller Scalar-only Micro-architecture.
 143 Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
 144 quantities as the default is lifted on `sv.mv.swiz`.
 145
 146 Additionally, in order to make life easier for implementers, some of
 147 whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
 148 the usual strict Element-level Program Order is relaxed.
 149 An overlap between all and any Vectorised
 150 sources and destination Elements for the entirety of
 151 the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
 152
 153 This in turn implies that Traps and Exceptions are, as usual,
 154 permitted in between element-level moves, because due to there
 155 being no overlap there is no risk of destroying a source with
 156 an overwrite.  This is *unlike* the Scalar variant which, when
 157 `RT=RA`, must buffer both halves of the RT pair.
 158
 159 Determining the source and destination subvector lengths is tricky.
 160 Swizzle Pseudocode:
 161
 162 ```
 163     swiz[0] = imm[0:3]   # X
 164     swiz[1] = imm[3:6]   # Y
 165     swiz[2] = imm[6:9]   # Z
 166     swiz[3] = imm[9:12]  # W
 167     # determine implied subvector length from Swizzle
 168     dst_subvl = 4
 169     for i in range(4):
 170         if swiz[i] == 0b001:
 171             dst_subvl = i+1
 172             break
 173 ```
 174
 175 What is going on here is that the option is provided to have different
 176 source and destination subvector lengths, by exploiting redundancy in
 177 the Swizzle Immediate.  With the Swizzles marking what goes into
 178 each destination position, the marker "0b001" may be used to indicate
 179 the end. If no marker is present then the destination subvector length
 180 may be assumed to be 4.  SUBVL is considered to be the "source" subvector
 181 length.
 182
 183 Pseudocode exploiting python "yield" for clarity: element-width overrides,
 184 Saturation and Predication also left out, for clarity:
 185
 186 ```
 187     def index_src():
 188         for i in range(VL):
 189             for j in range(SUBVL):
 190                 if swiz[j] == 0b000: # skip
 191                     continue
 192                 if swiz[j] == 0b001: # end
 193                     break
 194                 if swiz[j] in [0b010, 0b011]:
 195                     yield (i*SUBVL, CONSTANT)
 196                 else:
 197                     yield (i*SUBVL, swiz[j]-3)
 198
 199     def index_dest():
 200         for i in range(VL):
 201             for j in range(dst_subvl):
 202                 if swiz[j] == 0b000: # skip
 203                     continue
 204                 yield i*dst_subvl+j
 205
 206     # walk through both source and dest indices simultaneously
 207     for (src_idx, offs), dst_idx in zip(index_src(), index_dst()):
 208         if offs == CONSTANT:
 209              set(RT+dst_idx, CONSTANT)
 210         else
 211              move_operation(RT+dst_idx, RA+src_idx+offs)
 212 ```
 213
 214 **Vertical-First Mode**
 215
 216 It is important to appreciate that *only* the main loop VL
 217 is Vertical-First: the SUBVL loop is not.  This makes sense
 218 from the perspective that the Swizzle Move is a group of
 219 moves, but is still a single instruction that happens to take
 220 vec2/3/4 as operands.  Vertical-First
 221 only performing one of the *sub*-elements at a time rather
 222 than operating on the entire vec2/3/4 together would
 223 violate that expectation.  The exceptions to this, explained
 224 later, are when Pack/Unpack is enabled.
 225
 226 **Effect of Saturation on Vectorised Swizzle**
 227
 228 A useful convenience for pixel data is to be able to insert values
 229 0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
 230 when Saturation is enabled and a Swizzle=0b011 (Constant 1) is requested,
 231 the maximum permitted Saturated value is inserted rather than Constant 1.
 232 `sv.mv.swiz/sats/vec2/ew=8 RT.v, RA.v, Y1` would insert the 2nd subelement
 233 (Y) into the first destination subelement and the signed-maximum constant
 234 0x7f into the second. A Constant 0 Swizzle on the other hand still inserts
 235 zero because there is no encoding space to select between -1, 0 and 1, and
 236 0 and max values are more useful.
 237
 238 # Pack/Unpack Mode:
 239
 240 It is possible to apply Pack and Unpack to Vectorised
 241 swizzle moves, and these instructions are of EXTRA type
 242 `RM-2P-1S1D-PU`. The interaction requires specific explanation
 243 because it involves the separate SUBVLs (with destination SUBVL
 244 being separate). Key to understanding is that the
 245 source and
 246 destination SUBVL be "outer" loops instead of inner loops,
 247 exactly as in [[sv/remap]] Matrix mode, under the control
 248 of `PACK_en` and `UNPACK_en`.
 249
 250 Illustrating a
 251 "normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
 252
 253     def index():
 254         for i in range(VL):
 255             for j in range(SUBVL):
 256                 yield i*SUBVL+j
 257
 258     for idx in index():
 259         operation_on(RA+idx)
 260
 261 For a separate source/dest SUBVL (again, no elwidth overrides):
 262
 263     # yield an outer-SUBVL or inner VL loop with SUBVL
 264     def index_dest(outer):
 265         if outer:
 266             for j in range(dst_subvl):
 267                 for i in range(VL):
 268                     ....
 269         else:
 270             for i in range(VL):
 271                 for j in range(dst_subvl):
 272                     ....
 273
 274     # yield an outer-SUBVL or inner VL loop with SUBVL
 275     def index_src(outer):
 276         if outer:
 277             for j in range(SUBVL):
 278                 for i in range(VL):
 279                     ....
 280         else:
 281             for i in range(VL):
 282                 for j in range(SUBVL):
 283                     ....
 284
 285 "yield" from python is used here for simplicity and clarity.
 286 The two Finite State Machines for the generation of the source
 287 and destination element offsets progress incrementally in
 288 lock-step.
 289
 290 Just as in [[sv/mv.vec]], when `PACK_en` is set it is the source
 291 that swaps to Outer-subvector loops, and when `UNPACK_en` is set
 292 it is the destination that swaps its loop-order.  Setting both
 293 `PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED`
 294 because the behaviour is fully deterministic.
 295
 296 *However*, in
 297 Vertical-First Mode, when both are enabled,
 298 with both source and destination being outer loops a **single**
 299 step of srstep and dststep is performed.  Contrast this when
 300 one of `PACK_en` is set, it is the *destination* that is an inner
 301 subvector loop, and therefore Vertical-First runs through the
 302 entire `dst_subvl` group. Likewise when `UNPACK_en` is set it
 303 is the source subvector that is run through as a group.
 304
 305 ```
 306 if VERTICAL_FIRST:
 307     # must run through SUBVL or dst_subvl elements, to keep
 308     # the subvector "together".  weirdness occurs due to
 309     # PACK_en/UNPACK_en
 310     num_runs = SUBVL # 1-4
 311     if PACK_en:
 312         num_runs = dst_subvl # destination still an inner loop
 313     if PACK_en and UNPACK_en:
 314         num_runs = 1 # both are outer loops
 315     for substep in num_runs:
 316         (src_idx, offs) = yield from index_src(PACK_en)
 317         dst_idx = yield from index_dst(UNPACK_en)
 318         move_operation(RT+dst_idx, RA+src_idx+offs)
 319 ```