openpower/sv/mv.swizzle.mdwn

   1 [[!tag standards]]
   2
   3 # mv.swizzle
   4
   5 Links
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
   8 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>
   9
  10 Swizzle is a type of permute shorthand allowing arbitrary selection
  11 of elements from vec2/3/4 creating a new vec2/3/4.
  12 Their value lies in the high occurrence of Swizzle
  13 in 3D Shader Binaries (over 10% of all instructions).
  14 Swizzle is usually done on a per-vec-operand basis in 3D GPU ISAs, making
  15 for extremely long instructions (64 bits or greater),
  16 however it is not practical to add two or more sets of 12-bit
  17 prefixes into a single instruction.
  18 A compromise is to provide a Swizzle "Move": one such move is
  19 then required for each operand used in a subsequent instruction.
  20 The encoding for Swizzle Move embeds static predication into the
  21 swizzle as well as constants 1/1.0 and 0/0.0, and if Saturation
  22 is enabled maximum arithmetic constants may be placed into the
  23 destination as well.
  24
  25 An extremely important aspect of 3D GPU workloads is that the source
  26 and destination subvector lengths may be *different*.  A vector of
  27 contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
  28 swizzle-copied to
  29 a contiguous array of vec2.  A contiguous array of vec2 sources
  30 may have multiple of each vec2 elements (XY) copied to a contiguous
  31 vec4 array (YYXX or XYXX). For this reason, *when Vectorized*
  32 Swizzle Moves support independent subvector lengths for both
  33 source and destination.
  34
  35 Although conceptually similar to `vpermd` and `vpermdi`
  36 of Packed SIMD VSX,
  37 Swizzle Moves come in immediate-only form with only up to four
  38 selectors, where `vpermd` refers to individual bytes and may not
  39 copy constants to the destination.
  40 3D Shader programs commonly use the letters "XYZW"
  41 when referring to the four swizzle indices, and also often
  42 use the letters "RGBA"
  43 if referring to pixel data.  These designations are also
  44 part of both the OpenGL(TM) and Vulkan(TM) specifications.
  45
  46 As a standalone Scalar operation this instruction is valuable
  47 if Prefixed with SVP64Single (providing Predication).
  48 Combined with `cmpi` it synthesises Compare-and-Swap.
  49 It is also more flexible than `xxpermdi`.
  50
  51 # Format
  52
  53 | 0.5 |6.10|11.15|16.27|28.31|  name        | Form    |
  54 |-----|----|-----|-----|-----|--------------|-------- |
  55 |PO   | RTp| RAp |imm  | 0011| mv.swiz      | DQ-Form |
  56 |PO   | RTp| RAp |imm  | 1011| fmv.swiz     | DQ-Form |
  57
  58 this gives a 12 bit immediate across bits 16 to 27.
  59 Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
  60 has an associated index.  3 bits of the immediate are allocated
  61 to each:
  62
  63 | imm   |0.2 |3.5 |6.8|9.11|
  64 |-------|----|----|---|----|
  65 |swizzle|X   | Y  | Z | W  |
  66 |pixel  |R   | G  | B | A  |
  67 |index  |0   | 1  | 2 | 3  |
  68
  69 The options for each Swizzle are:
  70
  71 * 0b000 to indicate "skip".  this is equivalent to predicate masking
  72 * 0b001 subvector length end marker (length=4 if not present)
  73 * 0b010 to indicate "constant 0"
  74 * 0b011 to indicate "constant 1" (or 1.0)
  75 * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW
  76
  77 In very simplistic terms the relationship between swizzle indices
  78 (NN, above), source, and destination is:
  79
  80     dest[i] = src[swiz[i]]
  81
  82 Note that 8 options are needed (not 6) because option 0b001 encodes
  83 the subvector length, and option 0b000 allows static
  84 predicate masking (skipping) to be encoded within the swizzle immediate.
  85 For example it allows "W.Y." to specify: "copy W to position X,
  86 and Y to position Z, leave the other two positions Y and W unaltered"
  87
  88     0    1    2    3
  89     X    Y    Z    W  source
  90          |         |
  91          +----+    |
  92          .    |    |
  93     +--------------+
  94     |    .    |    .
  95     W    .    Y    .  swizzle
  96     |    .    |    .
  97     |    Y    |    W  Y,W unmodified
  98     |    .    |    .
  99     W    Y    Y    W  dest
 100
 101 **As a Scalar instruction**
 102
 103 Given that XYZW Swizzle can select simultaneously between one *and four*
 104 register operands, a full version of this instruction would
 105 be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
 106 ISA this not practical. A compromise is to cut the registers required
 107 by half, placing it on-par with `lq`, `stq` and Indexed
 108 Load-with-update instructions.
 109 When part of the Scalar Power ISA (not SVP64 Vectorized)
 110 mv.swiz and fmv.swiz operate on four 32-bit
 111 quantities, reducing this instruction to a feasible
 112 2-in, 2-out pairs of 64-bit registers:
 113
 114 | swizzle name | source | dest | half    |
 115 |--            | --     | --   | --      |
 116 | X            | RA     | RT   | lo-half |
 117 | Y            | RA     | RT   | hi-half |
 118 | Z            | RA+1   | RT+1 | lo-half |
 119 | W            | RA+1   | RT+1 | hi-half |
 120
 121 When `RA=RT` (in-place swizzle) any portion of RT not covered by
 122 the Swizzle is unmodified.  For example a Swizzle of "..XY"
 123 will copy the contents RA+1 into RT but leave RT+1 unmodified.
 124
 125 When `RA!=RT` any part of RT or RT+1 not set as a destination by
 126 the Swizzle will be set to zero.  A Swizzle of "..XY" would
 127 copy the contents RA+1 into RT, but set RT+1 to zero.
 128
 129 Also, making life easier, RT and RA are only permitted to be even
 130 (no overlapping can occur).  This makes RT (and RA) a "pair" exactly
 131 as in `lq` and `stq`.  Scalar Swizzle instructions must be atomically
 132 indivisible: an Exception or Interrupt may not occur during the Moves.
 133
 134 Note that unlike the Vectorized variant, when `RT=RA` the Scalar variant
 135 *must* buffer (read) both 64-bit RA registers before writing to the
 136 RT pair (in an Out-of-Order Micro-architecture, both of the register
 137 pair must be "in-flight").
 138 This ensures that register file corruption does not occur.
 139
 140 **SVP64 Vectorized**
 141
 142 Vectorized Swizzle may be considered to
 143 contain an extended static predicate
 144 mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
 145 the static predication capability, the destination
 146 subvector length can be *different* from the source subvector
 147 length, and consequently the destination subvector length is
 148 encoded into the Swizzle.
 149
 150 When Vectorized, given the use-case is for a High-performance GPU,
 151 the fundamental assumption is that Micro-coding or
 152 other technique will
 153 be deployed in hardware to issue multiple Scalar MV operations and
 154 full parallel crossbars, which
 155 would be impractical in a smaller Scalar-only Micro-architecture.
 156 Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
 157 quantities as the default is lifted on `sv.mv.swiz`.
 158
 159 Additionally, in order to make life easier for implementers, some of
 160 whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
 161 the usual strict Element-level Program Order is relaxed.
 162 An overlap between all and any Vectorized
 163 sources and destination Elements for the entirety of
 164 the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
 165
 166 This in turn implies that Traps and Exceptions are, as usual,
 167 permitted in between element-level moves, because due to there
 168 being no overlap there is no risk of destroying a source with
 169 an overwrite.  This is *unlike* the Scalar variant which, when
 170 `RT=RA`, must buffer both halves of the RT pair.
 171
 172 Determining the source and destination subvector lengths is tricky.
 173 Swizzle Pseudocode:
 174
 175 ```
 176     swiz[0] = imm[0:3]   # X
 177     swiz[1] = imm[3:6]   # Y
 178     swiz[2] = imm[6:9]   # Z
 179     swiz[3] = imm[9:12]  # W
 180     # determine implied subvector length from Swizzle
 181     dst_subvl = 4
 182     for i in range(4):
 183         if swiz[i] == 0b001:
 184             dst_subvl = i+1
 185             break
 186 ```
 187
 188 What is going on here is that the option is provided to have different
 189 source and destination subvector lengths, by exploiting redundancy in
 190 the Swizzle Immediate.  With the Swizzles marking what goes into
 191 each destination position, the marker "0b001" may be used to indicate
 192 the end. If no marker is present then the destination subvector length
 193 may be assumed to be 4.  SUBVL is considered to be the "source" subvector
 194 length.
 195
 196 Pseudocode exploiting python "yield" for clarity: element-width overrides,
 197 Saturation and Predication also left out, for clarity:
 198
 199 ```
 200     def index_src():
 201         for i in range(VL):
 202             for j in range(SUBVL):
 203                 if swiz[j] == 0b000: # skip
 204                     continue
 205                 if swiz[j] == 0b001: # end
 206                     break
 207                 if swiz[j] in [0b010, 0b011]:
 208                     yield (i*SUBVL, CONSTANT)
 209                 else:
 210                     yield (i*SUBVL, swiz[j]-3)
 211
 212     def index_dest():
 213         for i in range(VL):
 214             for j in range(dst_subvl):
 215                 if swiz[j] == 0b000: # skip
 216                     continue
 217                 yield i*dst_subvl+j
 218
 219     # walk through both source and dest indices simultaneously
 220     for (src_idx, offs), dst_idx in zip(index_src(), index_dst()):
 221         if offs == CONSTANT:
 222              set(RT+dst_idx, CONSTANT)
 223         else
 224              move_operation(RT+dst_idx, RA+src_idx+offs)
 225 ```
 226
 227 **Vertical-First Mode**
 228
 229 It is important to appreciate that *only* the main loop VL
 230 is Vertical-First: the SUBVL loop is not.  This makes sense
 231 from the perspective that the Swizzle Move is a group of
 232 moves, but is still a single instruction that happens to take
 233 vec2/3/4 as operands.  Vertical-First
 234 only performing one of the *sub*-elements at a time rather
 235 than operating on the entire vec2/3/4 together would
 236 violate that expectation.  The exceptions to this, explained
 237 later, are when Pack/Unpack is enabled.
 238
 239 **Effect of Saturation on Vectorized Swizzle**
 240
 241 A useful convenience for pixel data is to be able to insert values
 242 0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
 243 when Saturation is enabled and a Swizzle=0b011 (Constant 1) is requested,
 244 the maximum permitted Saturated value is inserted rather than Constant 1.
 245 `sv.mv.swiz/sats/vec2/ew=8 RT.v, RA.v, Y1` would insert the 2nd subelement
 246 (Y) into the first destination subelement and the signed-maximum constant
 247 0x7f into the second. A Constant 0 Swizzle on the other hand still inserts
 248 zero because there is no encoding space to select between -1, 0 and 1, and
 249 0 and max values are more useful.
 250
 251 # Pack/Unpack Mode:
 252
 253 It is possible to apply Pack and Unpack to Vectorized
 254 swizzle moves. The interaction requires specific explanation
 255 because it involves the separate SUBVLs (with destination SUBVL
 256 being separate). Key to understanding is that the
 257 source and
 258 destination SUBVL be "outer" loops instead of inner loops,
 259 exactly as in [[sv/remap]] Matrix mode, under the control
 260 of `SVSTATE.PACK` and `SVSTATE.UNPACK`.
 261
 262 Illustrating a
 263 "normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
 264
 265 ```
 266     def index():
 267         for i in range(VL):
 268             for j in range(SUBVL):
 269                 yield i*SUBVL+j
 270
 271     for idx in index():
 272         operation_on(RA+idx)
 273 ```
 274
 275 For a separate source/dest SUBVL (again, no elwidth overrides):
 276
 277 ```
 278     # yield an outer-SUBVL or inner VL loop with SUBVL
 279     def index_dest(outer):
 280         if outer:
 281             for j in range(dst_subvl):
 282                 for i in range(VL):
 283                     yield j*VL+i
 284         else:
 285             for i in range(VL):
 286                 for j in range(dst_subvl):
 287                     yield i*dst_subvl+j
 288
 289     # yield an outer-SUBVL or inner VL loop with SUBVL
 290     def index_src(outer):
 291         if outer:
 292             for j in range(SUBVL):
 293                 for i in range(VL):
 294                     yield j*VL+i
 295         else:
 296             for i in range(VL):
 297                 for j in range(SUBVL):
 298                     yield i*SUBVL+j
 299 ```
 300
 301 "yield" from python is used here for simplicity and clarity.
 302 The two Finite State Machines for the generation of the source
 303 and destination element offsets progress incrementally in
 304 lock-step.
 305
 306 Just as in [[sv/mv.vec]], when `PACK_en` is set it is the source
 307 that swaps to Outer-subvector loops, and when `UNPACK_en` is set
 308 it is the destination that swaps its loop-order.  Setting both
 309 `PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED`
 310 because the behaviour is fully deterministic.
 311
 312 *However*, in
 313 Vertical-First Mode, when both are enabled,
 314 with both source and destination being outer loops a **single**
 315 step of srstep and dststep is performed.  Contrast this when
 316 one of `PACK_en` is set, it is the *destination* that is an inner
 317 subvector loop, and therefore Vertical-First runs through the
 318 entire `dst_subvl` group. Likewise when `UNPACK_en` is set it
 319 is the source subvector that is run through as a group.
 320
 321 ```
 322 if VERTICAL_FIRST:
 323     # must run through SUBVL or dst_subvl elements, to keep
 324     # the subvector "together".  weirdness occurs due to
 325     # PACK_en/UNPACK_en
 326     num_runs = SUBVL # 1-4
 327     if PACK_en:
 328         num_runs = dst_subvl # destination still an inner loop
 329     if PACK_en and UNPACK_en:
 330         num_runs = 1 # both are outer loops
 331     for substep in num_runs:
 332         (src_idx, offs) = yield from index_src(PACK_en)
 333         dst_idx = yield from index_dst(UNPACK_en)
 334         move_operation(RT+dst_idx, RA+src_idx+offs)
 335 ```