6 swizzle needs a MV (there are 2 of them: swizzle and swizzle2).
7 see below for a potential way to use the funct7 to do a swizzle in rs2.
9 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
10 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
11 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
12 | RV32-I-type + imm[11:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
13 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
14 | RV32-I-type + fn4[3:0] + swizzle[7:0] + rs1[4:0] + 0b000 | rd[4:0] + OP-V + 0b11 |
15 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
17 * funct3 = MV: 0b000 for FP, 0b001 for INT
19 * fn4 = 4 bit function.
20 * fn4 = 0b0000 - MV-SWIZZLE
21 * fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
22 * fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
24 swizzle (only active on SV or P48/P64 when SUBVL!=0):
26 +-----+-----+-----+-----+
27 | 7:6 | 5:4 | 3:2 | 1:0 |
28 +-----+-----+-----+-----+
30 +-----+-----+-----+-----+
32 MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
37 for j in range(SUBVL):
38 regs[rd] = regs[rd+regs[rs+j]]
40 Normal mode will apply the element offsets incrementally:
45 for j in range(SUBVL):
46 regs[rd] = regs[rd+regs[rs+k]]
50 Pseudocode for element width part of MV.X:
54 def mv_x(rd, rs1, funct4):
55 elwidth = (funct4>>2) & 0x3
56 bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
57 bytewidth = bitwidth / 8 # get bytes per el
59 addr = (unsigned char *)®s[rs1]
60 offset = addr + bytewidth # get offset within regfile as SRAM
61 # TODO, actually, needs to respect rd and rs1 element width,
62 # here, as well. this pseudocode just illustrates that the
63 # MV.X operation contains a way to compact the indices into
65 regs[rd] = (unsigned char*)(regs)[offset]
67 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
68 registers, such that rather than doing this:
75 {SVP.VL=4} MV.X x3, x8, elwidth=default
77 The alternative is this:
81 {SVP.VL=4} MV.X x3, x8, elwidth=8
83 Thus compacting four indices into the one register. x3 and x8's element
84 width are *independent* of the MV.X elwidth, thus allowing both source
85 and element element widths of the *elements* to be moved to be over-ridden,
86 whilst *at the same time* allowing the *indices* to be compacted, as well.
90 potential MV.X? register-version of MV-swizzle?
92 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
93 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
94 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
95 | RV32-R-type + funct7 + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
96 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
97 | RV32-R-type + 0b0000000 + rs2[4:0] + rs1[4:0] + 0b001 | rd[4:0] + OP-V + 0b11 |
98 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
102 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
103 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
104 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
105 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
107 question: do we need a swizzle MV.X as well?
112 regs[rd] = regs[rs1 + regs[rs2]]
114 Similar to LD/ST with the same twin predication rules
119 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
120 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
125 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
126 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
131 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
132 _MM_SHUFFLE(hi3,hi2,lo1,lo0))
133 Interleave inputs into low 2 floats and high 2 floats of output. Basically
139 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
140 a[i] into all 4 output floats.
145 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
146 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
154 transpose 4 corner 2x2 matrices
170 <http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
175 __m128i T0 = _mm_unpacklo_epi32(I0, I1);
176 __m128i T1 = _mm_unpacklo_epi32(I2, I3);
177 __m128i T2 = _mm_unpackhi_epi32(I0, I1);
178 __m128i T3 = _mm_unpackhi_epi32(I2, I3);
180 /* Assigning transposed values back into I[0-3] */
181 I0 = _mm_unpacklo_epi64(T0, T1);
182 I1 = _mm_unpackhi_epi64(T0, T1);
183 I2 = _mm_unpacklo_epi64(T2, T3);
184 I3 = _mm_unpackhi_epi64(T2, T3);
189 <https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
194 swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
196 +-----------+-------+-------+-------+-------+-------+------+
197 | | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
198 +===========+=======+=======+=======+=======+=======+======+
199 | swizzle2 | rs3 | 00 | rs2 | rs1 | 000 | rd |
200 +-----------+-------+-------+-------+-------+-------+------+
201 | fswizzle2 | rs3 | 01 | rs2 | rs1 | 000 | rd |
202 +-----------+-------+-------+-------+-------+-------+------+
203 | swizzle | 0 | 10 | rs2 | rs1 | 000 | rd |
204 +-----------+-------+-------+-------+-------+-------+------+
205 | fswizzle | 0 | 11 | rs2 | rs1 | 000 | rd |
206 +-----------+-------+-------+-------+-------+-------+------+
207 | swizzlei | imm | rs1 | 001 | rd |
208 +-----------+ +-------+-------+------+
209 | fswizzlei | | rs1 | 010 | rd |
210 +-----------+-------+-------+-------+-------+-------+------+
214 swizzlei would still need the 12-bit format due to not having enough immediate bits. we can get away with only 3 i-type funct3s used for [f]swizzlei by having one funct3 for destsubvl 1 through 3 for int and fp versions and a separate one for destsubvl = 4 that's shared between int/fp:
216 +--------+-----------+----+-----------+----------+-------+-------+------+
217 | int/fp | DESTSUBVL | 31 | 30:29 | 28:20 | 19:15 | 14:12 | 11:7 |
218 +========+===========+====+===========+==========+=======+=======+======+
219 | int | 1 to 3 | 0 | DESTSUBVL | selector | rs | 000 | rd |
220 +--------+-----------+----+-----------+----------+-------+-------+------+
221 | fp | 1 to 3 | 1 | DESTSUBVL | selector | rs | 000 | rd |
222 +--------+-----------+----+-----------+----------+-------+-------+------+
223 | int | 4 | selector[11:0] | rs | 001 | rd |
224 +--------+-----------+---------------------------+-------+-------+------+
225 | fp | 4 | selector[11:0] | rs | 010 | rd |
226 +--------+-----------+---------------------------+-------+-------+------+
228 the rest could be encoded as follows:
230 +-----------+-------+-----------+-------+-------+-------+------+
231 | | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
232 +===========+=======+===========+=======+=======+=======+======+
233 | swizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 100 | rd |
234 +-----------+-------+-----------+-------+-------+-------+------+
235 | swizzle | rs1 | DESTSUBVL | rs2 | rs1 | 100 | rd |
236 +-----------+-------+-----------+-------+-------+-------+------+
237 | fswizzle2 | rs3 | DESTSUBVL | rs2 | rs1 | 101 | rd |
238 +-----------+-------+-----------+-------+-------+-------+------+
239 | fswizzle | rs1 | DESTSUBVL | rs2 | rs1 | 101 | rd |
240 +-----------+-------+-----------+-------+-------+-------+------+
242 note how for [f]swizzle, rs3 == rs1
244 so it uses 5 funct3 values overall, which is appropriate, since swizzle is probably right after muladd in usage in graphics shaders.
249 +--------+----+-----------+----------+-------+-------+------+
250 | int/fp | 31:28 | 27:20 | 19:15 | 14:12 | 11:7 |
251 +========+===========+==========+=======+=======+======+
252 | int | DESTMASK | selector | rs | 000 | rd |
253 +--------+-----------+----------+-------+-------+------+
254 | fp | DESTMASK | selector | rs | 001 | rd |
255 +--------+-----------+----------+-------+-------+------+
256 | int | DESTMASK | constsel | rs | 010 | rd |
257 +--------+-----------+----------+-------+-------+------+
258 | fp | DESTMASK | constsel | rs | 011 | rd |
259 +--------+-----------+----------+-------+-------+------+
261 Matrix 4x4 Vector mul
262 =====================
266 pfscale,3 F2, F1, F10
267 pfscaleadd,2 F2, F1, F11, F2
268 pfscaleadd,1 F2, F1, F12, F2
269 pfscaleadd,0 F2, F1, F13, F2
271 pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.
273 In effect what this is doing is:
277 fmul f2, f1.xxxx, f10
278 fmac f2, f1.yyyy, f11, f2
279 fmac f2, f1.zzzz, f12, f2
280 fmac f2, f1.wwww, f13, f2
282 Where all of f2, f1, and f10-13 are vec4, and f1.x-w are copied (fixed index) where the other vec4 indices progress.
291 pub trait SwizzleConstants: Copy + 'static {
292 const CONSTANTS: &'static [Self; 4];
295 impl SwizzleConstants for u8 {
296 const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
299 impl SwizzleConstants for u16 {
300 const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
303 impl SwizzleConstants for f32 {
304 const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
307 // impl for other types too...
309 pub fn swizzle<Elm, Selector>(
317 Elm: SwizzleConstants,
318 // Selector is a copyable type that can be converted into u64
319 Selector: Copy + Into<u64>,
321 const FIELD_SIZE: usize = 3;
322 const FIELD_MASK: u64 = 0b111;
323 for vindex in 0..vl {
324 let selector = rs2[vindex].into();
325 // selector's type is u64
326 if selector >> (FIELD_SIZE * destsubvl) != 0 {
327 // handle illegal instruction trap
329 for i in 0..destsubvl {
330 let mut sel_field = selector >> (FIELD_SIZE * i);
331 sel_field &= FIELD_MASK;
332 let src = if (sel_field & 0b100) == 0 {
333 &rs1[(vindex * srcsubvl)..]
335 SwizzleConstants::CONSTANTS
338 if sel_field as usize >= srcsubvl {
339 // handle illegal instruction trap
341 let value = src[sel_field as usize];
342 rd[vindex * destsubvl + i] = value;
351 fn swizzle2<Elm, Selector>(
360 // Elm is a copyable type
362 // Selector is a copyable type that can be converted into u64
363 Selector: Copy + Into<u64>,
365 const FIELD_SIZE: usize = 3;
366 const FIELD_MASK: u64 = 0b111;
367 for vindex in 0..vl {
368 let selector = rs2[vindex].into();
369 // selector's type is u64
370 if selector >> (FIELD_SIZE * destsubvl) != 0 {
371 // handle illegal instruction trap
373 for i in 0..destsubvl {
374 let mut sel_field = selector >> (FIELD_SIZE * i);
375 sel_field &= FIELD_MASK;
376 let src = if (sel_field & 0b100) != 0 {
382 if sel_field as usize >= srcsubvl {
383 // handle illegal instruction trap
385 let value = src[vindex * srcsubvl + (sel_field as usize)];
386 rd[vindex * destsubvl + i] = value;