(no commit message)
[libreriscv.git] / openpower / sv / mv.swizzle.mdwn
1 [[!tag standards]]
2
3 # mv.swizzle
4
5 Links
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
8 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>
9
10 Swizzle is a type of permute shorthand allowing arbitrary selection
11 of elements from vec2/3/4 creating a new vec2/3/4.
12 Their value lies in the high occurrence of Swizzle
13 in 3D Shader Binaries (over 10% of all instructions).
14 Swizzle is usually done on a per-vec-operand basis in 3D GPU ISAs, making
15 for extremely long instructions (64 bits or greater),
16 however it is not practical to add two or more sets of 12-bit
17 prefixes into a single instruction.
18 A compromise is to provide a Swizzle "Move": one such move is
19 then required for each operand used in a subsequent instruction.
20 The encoding for Swizzle Move embeds static predication into the
21 swizzle as well as constants 1/1.0 and 0/0.0.
22
23 An extremely important aspect of 3D GPU workloads is that the source
24 and destination subvector lengths may be *different*. A vector of
25 contiguous array of vec3 (XYZ) may only have 2 elements (ZY)
26 swizzle-copied to
27 a contiguous array of vec2. A contiguous array of vec2 sources
28 may have multiple of each vec2 elements (XY) copied to a contiguous
29 vec4 array (YYXX or XYXX). For this reason, *when Vectorised*
30 Swizzle Moves support independent subvector lengths for both
31 source and destination.
32
33 Although conceptually similar to `vpermd` of Packed SIMD VSX,
34 Swizzle Moves come in immediate-only form with only up to four
35 selectors, where VSX refers to individual bytes and may not
36 copy constants to the destination.
37 3D Shader programs commonly use the letters "XYZW"
38 when referring to the four swizzle indices, and also often
39 use the letters "RGBA"
40 if referring to pixel data. These designations are also
41 part of both the OpenGL(TM) and Vulkan(TM) specifications.
42
43 # Format
44
45 | 0.5 |6.10|11.15|16.27|28.31| name | Form |
46 |-----|----|-----|-----|-----|--------------|-------- |
47 |PO | RTp| RAp |imm | 0011| mv.swiz | DQ-Form |
48 |PO | RTp| RAp |imm | 1011| fmv.swiz | DQ-Form |
49
50 this gives a 12 bit immediate across bits 16 to 27.
51 Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
52 has an associated index. 3 bits of the immediate are allocated
53 to each:
54
55 | imm |0.2 |3.5 |6.8|9.11|
56 |-------|----|----|---|----|
57 |swizzle|X | Y | Z | W |
58 |pixel |R | G | B | A |
59 |index |0 | 1 | 2 | 3 |
60
61 The options for each Swizzle are:
62
63 * 0b000 to indicate "skip". this is equivalent to predicate masking
64 * 0b001 subvector length end marker (length=4 if not present)
65 * 0b010 to indicate "constant 0"
66 * 0b011 to indicate "constant 1" (or 1.0)
67 * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW
68
69 In very simplistic terms the relationship between swizzle indices
70 (NN, above), source, and destination is:
71
72 dest[i] = src[swiz[i]]
73
74 Note that 7 options are needed (not 6) because option 0b000 allows static
75 predicate masking (skipping) to be encoded within the swizzle immediate.
76 For example it allows "W.Y." to specify: "copy W to position X,
77 and Y to position Z, leave the other two positions Y and W unaltered"
78
79 0 1 2 3
80 X Y Z W source
81 | |
82 +----+ |
83 | | |
84 +--------------+
85 | | | |
86 W . Y . swizzle
87 | | | |
88 W Y Y W dest
89
90 **As a Scalar instruction**
91
92 Given that XYZW Swizzle can select simultaneously between one *and four*
93 register operands, a full version of this instruction would
94 be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
95 ISA this not practical. A compromise is to cut the registers required
96 by half, placing it on-par with `lq`, `stq` and Indexed
97 Load-with-update instructions.
98 When part of the Scalar Power ISA (not SVP64 Vectorised)
99 mv.swiz and fmv.swiz operate on four 32-bit
100 quantities, reducing this instruction to a feasible
101 2-in, 2-out pairs of 64-bit registers:
102
103 | swizzle name | source | dest | half |
104 |-- | -- | -- | -- |
105 | X | RA | RT | lo-half |
106 | Y | RA | RT | hi-half |
107 | Z | RA+1 | RT+1 | lo-half |
108 | W | RA+1 | RT+1 | hi-half |
109
110 When `RA=RT` (in-place swizzle) any portion of RT not covered by
111 the Swizzle is unmodified. For example a Swizzle of "..XY"
112 will copy the contents RA+1 into RT but leave RT+1 unmodified.
113
114 When `RA!=RT` any part of RT or RT+1 not set as a destination by
115 the Swizzle will be set to zero. A Swizzle of "..XY" would
116 copy the contents RA+1 into RT, but set RT+1 to zero.
117
118 Also, making life easier, RT and RA are only permitted to be even
119 (no overlapping can occur). This makes RT (and RA) a "pair" exactly
120 as in `lq` and `stq`. Scalar Swizzle instructions must be atomically
121 indivisible: an Exception or Interrupt may not occur during the Moves.
122
123 Note that unlike the Vectorised variant, when `RT=RA` the Scalar variant
124 *must* buffer (read) both 64-bit RA registers before writing to the
125 RT pair. This ensures that register file corruption does not occur.
126
127 **SVP64 Vectorised**
128
129 Vectorised Swizzle may be considered to
130 contain an extended static predicate
131 mask for subvectors (SUBVL=2/3/4). Due to the skipping caused by
132 the static predication capability, the destination
133 subvector length can be *different* from the source subvector
134 length, and consequently the destination subvector length is
135 encoded into the Swizzle.
136
137 When Vectorised, given the use-case is for a High-performance GPU,
138 the fundamental assumption is that Micro-coding or
139 other technique will
140 be deployed in hardware to issue multiple Scalar MV operations and
141 full parallel crossbars, which
142 would be impractical in a smaller Scalar-only Micro-architecture.
143 Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
144 quantities as the default is lifted on `sv.mv.swiz`.
145
146 Additionally, in order to make life easier for implementers, some of
147 whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
148 the usual strict Element-level Program Order is relaxed.
149 An overlap between all and any Vectorised
150 sources and destination Elements for the entirety of
151 the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
152
153 This in turn implies that Traps and Exceptions are, as usual,
154 permitted in between element-level moves, because due to there
155 being no overlap there is no risk of destroying a source with
156 an overwrite. This is *unlike* the Scalar variant which, when
157 `RT=RA`, must buffer both halves of the RT pair.
158
159 Determining the source and destination subvector lengths is tricky.
160 Swizzle Pseudocode:
161
162 ```
163 swiz[0] = imm[0:3] # X
164 swiz[1] = imm[3:6] # Y
165 swiz[2] = imm[6:9] # Z
166 swiz[3] = imm[9:12] # W
167 # determine implied subvector length from Swizzle
168 dst_subvl = 4
169 for i in range(4):
170 if swiz[i] == 0b001:
171 dst_subvl = i+1
172 break
173 ```
174
175 What is going on here is that the option is provided to have different
176 source and destination subvector lengths, by exploiting redundancy in
177 the Swizzle Immediate. With the Swizzles marking what goes into
178 each destination position, the marker "0b001" may be used to indicate
179 the end. If no marker is present then the destination subvector length
180 may be assumed to be 4. SUBVL is considered to be the "source" subvector
181 length.
182
183 Pseudocode exploiting python "yield" for clarity: element-width overrides,
184 Saturation and Predication also left out, for clarity:
185
186 ```
187 def index_src():
188 for i in range(VL):
189 for j in range(SUBVL):
190 if swiz[j] == 0b000: # skip
191 continue
192 if swiz[j] == 0b001: # end
193 break
194 if swiz[j] in [0b010, 0b011]:
195 yield (i*SUBVL, CONSTANT)
196 else:
197 yield (i*SUBVL, swiz[j]-3)
198
199 def index_dest():
200 for i in range(VL):
201 for j in range(dst_subvl):
202 if swiz[j] == 0b000: # skip
203 continue
204 yield i*dst_subvl+j
205
206 # walk through both source and dest indices simultaneously
207 for (src_idx, offs), dst_idx in zip(index_src(), index_dst()):
208 if offs == CONSTANT:
209 set(RT+dst_idx, CONSTANT)
210 else
211 move_operation(RT+dst_idx, RA+src_idx+offs)
212 ```
213
214 **Vertical-First Mode**
215
216 It is important to appreciate that *only* the main loop VL
217 is Vertical-First: the SUBVL loop is not. This makes sense
218 from the perspective that the Swizzle Move is a group of
219 moves, but is still a single instruction that happens to take
220 vec2/3/4 as operands. Vertical-First
221 only performing one of the *sub*-elements at a time rather
222 than operating on the entire vec2/3/4 together would
223 violate that expectation. The exceptions to this, explained
224 later, are when Pack/Unpack is enabled.
225
226 **Effect of Saturation on Vectorised Swizzle**
227
228 A useful convenience for pixel data is to be able to insert values
229 0x7f or 0xff as magic constants for arbitrary R,G,B or A. Therefore,
230 when Saturation is enabled and a Swizzle=0b011 (Constant 1) is requested,
231 the maximum permitted Saturated value is inserted rather than Constant 1.
232 `sv.mv.swiz/sats/vec2/ew=8 RT.v, RA.v, Y1` would insert the 2nd subelement
233 (Y) into the first destination subelement and the signed-maximum constant
234 0x7f into the second. A Constant 0 Swizzle on the other hand still inserts
235 zero because there is no encoding space to select between -1, 0 and 1, and
236 0 and max values are more useful.
237
238 # Pack/Unpack Mode:
239
240 It is possible to apply Pack and Unpack to Vectorised
241 swizzle moves, and these instructions are of EXTRA type
242 `RM-2P-1S1D-PU`. The interaction requires specific explanation
243 because it involves the separate SUBVLs (with destination SUBVL
244 being separate). Key to understanding is that the
245 source and
246 destination SUBVL be "outer" loops instead of inner loops,
247 exactly as in [[sv/remap]] Matrix mode, under the control
248 of `PACK_en` and `UNPACK_en`.
249
250 Illustrating a
251 "normal" SVP64 operation with `SUBVL!=1` (assuming no elwidth overrides):
252
253 def index():
254 for i in range(VL):
255 for j in range(SUBVL):
256 yield i*SUBVL+j
257
258 for idx in index():
259 operation_on(RA+idx)
260
261 For a separate source/dest SUBVL (again, no elwidth overrides):
262
263 # yield an outer-SUBVL or inner VL loop with SUBVL
264 def index_dest(outer):
265 if outer:
266 for j in range(dst_subvl):
267 for i in range(VL):
268 ....
269 else:
270 for i in range(VL):
271 for j in range(dst_subvl):
272 ....
273
274 # yield an outer-SUBVL or inner VL loop with SUBVL
275 def index_src(outer):
276 if outer:
277 for j in range(SUBVL):
278 for i in range(VL):
279 ....
280 else:
281 for i in range(VL):
282 for j in range(SUBVL):
283 ....
284
285 "yield" from python is used here for simplicity and clarity.
286 The two Finite State Machines for the generation of the source
287 and destination element offsets progress incrementally in
288 lock-step.
289
290 Just as in [[sv/mv.vec]], when `PACK_en` is set it is the source
291 that swaps to Outer-subvector loops, and when `UNPACK_en` is set
292 it is the destination that swaps its loop-order. Setting both
293 `PACK_en` and `UNPACK_en` is neither prohibited nor `UNDEFINED`
294 because the behaviour is fully deterministic.
295
296 *However*, in
297 Vertical-First Mode, when both are enabled,
298 with both source and destination being outer loops a **single**
299 step of srstep and dststep is performed. Contrast this when
300 one of `PACK_en` is set, it is the *destination* that is an inner
301 subvector loop, and therefore Vertical-First runs through the
302 entire `dst_subvl` group. Likewise when `UNPACK_en` is set it
303 is the source subvector that is run through as a group.
304
305 ```
306 if VERTICAL_FIRST:
307 # must run through SUBVL or dst_subvl elements, to keep
308 # the subvector "together". weirdness occurs due to
309 # PACK_en/UNPACK_en
310 num_runs = SUBVL # 1-4
311 if PACK_en:
312 num_runs = dst_subvl # destination still an inner loop
313 if PACK_en and UNPACK_en:
314 num_runs = 1 # both are outer loops
315 for substep in num_runs:
316 (src_idx, offs) = yield from index_src(PACK_en)
317 dst_idx = yield from index_dst(UNPACK_en)
318 move_operation(RT+dst_idx, RA+src_idx+offs)
319 ```