(no commit message)
[libreriscv.git] / openpower / sv / mv.swizzle.mdwn
1 [[!tag standards]]
2
3 # mv.swizzle
4
5 Links
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=139>
8 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-June/004913.html>
9
10 Swizzle is usually done on a per-operand basis in 3D GPU ISAs, making
11 for extremely long instructions (64 bits or greater).
12 Their value lies in the high occurrence of Swizzle
13 in 3D Shader Binaries (over 10% of all instructions),
14 however it is not practical to add two or more sets of 12-bit
15 prefixes into a single instruction.
16 A compromise is to provide a Swizzle "Move".
17 The encoding for this instruction embeds static predication into the
18 swizzle as well as constants 1/1.0 and 0/0.0
19
20 # Format
21
22 | 0.5 |6.10|11.15|16.27|28.31| name |
23 |-----|----|-----|-----|-----|--------------|
24 |PO | RTp| RAp |imm | 0011| mv.swiz |
25 |PO | RTp| RAp |imm | 1011| fmv.swiz |
26
27 this gives a 12 bit immediate across bits 16 to 27.
28 Each swizzle mnemonic (XYZW), commonly known from 3D GPU programming,
29 has an associated index. 3 bits of the immediate are allocated
30 to each:
31
32 | imm |0.2 |3.5 |6.8|9.11|
33 |-------|----|----|---|----|
34 |swizzle|X | Y | Z | W |
35 |index |0 | 1 | 2 | 3 |
36
37 the options for each Swizzle are:
38
39 * 0b000 to indicate "skip". this is equivalent to predicate masking
40 * 0b001 is not needed (reserved)
41 * 0b010 to indicate "constant 0"
42 * 0b011 to indicate "constant 1" (or 1.0)
43 * 0b1NN index 0 thru 3 to copy from subelement in pos XYZW
44
45 Evaluating efforts to encode 12 bit swizzle into less proved unsuccessful: 7^4 comes out to 2,400 which is larger than 11 bits.
46
47 Note that 7 options are needed (not 6) because the 7th option allows static
48 predicate masking to be encoded within the swizzle immediate.
49 For example this allows "W.Y." to specify: "copy W to position X,
50 and Y to position Z, leave the other two positions Y and W unaltered"
51
52 0 1 2 3
53 X Y Z W
54 | |
55 +----+ |
56 | | |
57 +--------------+
58 | | | |
59 W Y Y W
60
61 **As a Scalar instruction**
62
63 Given that XYZW Swizzle can select simultaneously between one *and four*
64 register operands, a full version of this instruction would
65 be an eye-popping 8 64-bit operands: 4-in, 4-out. As part of a Scalar
66 ISA this not practical. A compromise is to cut the registers required
67 by half.
68 When part of the Scalar Power ISA (not SVP64 Vectorised)
69 mv.swiz and fmv.swiz operate on four 32-bit
70 quantities, reducing this instruction to 2-in, 2-out pairs of 64-bit
71 registers:
72
73 | swizzle name | source | dest | half |
74 |-- | -- | -- | -- |
75 | X | RA | RT | lo-half |
76 | Y | RA | RT | hi-half |
77 | Z | RA+1 | RT+1 | lo-half |
78 | W | RA+1 | RT+1 | hi-half |
79
80 When `RA=RT` (in-place swizzle) any portion of RT not covered by
81 the Swizzle is unmodified. For example a Swizzle of "..XY"
82 will copy the contents RA+1 into RT but leave RT+1 unmodified.
83
84 When `RA!=RT` any part of RT or RT+1 not set as a destination by
85 the Swizzle will be set to zero. A Swizzle of "..XY" would
86 copy the contents RA+1 into RT, but set RT+1 to zero.
87
88 Also, making life easier, RT and RA are only permitted to be even
89 (no overlapping can occur). This makes RT (and RA) a "pair" exactly
90 like `lq` and `stq`
91
92 **SVP64 Vectorised**
93
94 When Vectorised, given the use-case is for a High-performance GPU,
95 the fundamental assumption is that Micro-coding or
96 other technique will
97 be deployed in hardware to issue multiple Scalar MV operations which
98 would be impractical in a smaller Scalar-only Micro-architecture.
99 Therefore the restriction imposed on the Scalar `mv.swiz` to 32-bit
100 quantities as the default is lifted on `sv.mv.swiz`.
101
102 Additionally, in order to make life easier for implementers, some of
103 whom may wish, especially for Embedded GPUs, to use multi-cycle Micro-coding,
104 the usual strict Element-level Program Order is relaxed but only for
105 Horizontal-First Mode:
106
107 * In Horizontal-First Mode, an overlap between all and any Vectorised
108 sources and destination Elements for the entirety of
109 the Vector Loop `0..VL-1` is `UNDEFINED` behaviour.
110 * In Vertical-First Mode, an overlap on any given one execution of
111 the Swizzle instruction requires that all Swizzled source elements be
112 copied into intermediary buffers (in-flight Reservation Stations,
113 pipeline registers) **before* being swapped and placed in
114 destinations. Strict Program Order is required in full.
115
116 *Implementor's note: the cost of Vertical-First Mode in an Embedded design
117 of storing four 64-bit in-flight elements may be considered
118 too high. If this is the
119 case it is acceptable to throw an Illegal Instruction Trap, and emulate
120 the instruction in software. Performance will obviously be adversely affected.
121 See [[sv/compliancy_levels]]: all aspects of
122 Swizzle are entirely optional in hardware at the Embedded Level.*
123
124 Implementors must consider `SUBVL` to have been implicitly set by
125 the Swizzle instructions. Hardware may statically calculate `SUBVL`
126 from the immediate. "W.0Z" is SUBVL=4, where "X0Z." is SUBVL=3,
127 and ".W.." sets SUBVL=2. Setting `SUBVL` has a different meaning
128 in Swizzle Move instructions,
129 as explained below.
130
131 # RM Mode Concept:
132
133 MVRM-2P-2S1D:
134
135 | Field Name | Field bits | Description |
136 |------------|------------|----------------------------|
137 | Rdest_EXTRA2 | `10:11` | extends Rdest (R\*\_EXTRA2 Encoding) |
138 | Rsrc_EXTRA2 | `12:13` | extends Rsrc (R\*\_EXTRA2 Encoding) |
139 | src_SUBVL | `14:15` | SUBVL for Source |
140 | MASK_SRC | `16:18` | Execution Mask for Source |
141
142 The inclusion of a separate src SUBVL allows
143 `sv.mv.swiz RT.vecN RA.vecN` to mean zip/unzip (pack/unpack).
144 This is conceptually achieved by having both source and
145 destination SUBVL be "outer" loops instead of inner loops.