6 swizzle needs a MV (there are 2 of them: swizzle and swizzle2).
7 see below for a potential way to use the funct7 to do a swizzle in rs2.
9 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
10 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
11 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
12 | RV32-I-type + imm[11:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
13 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
14 | RV32-I-type + fn4[3:0] + swizzle[7:0] + rs1[4:0] + 0b000 | rd[4:0] + OP-V + 0b11 |
15 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
17 * funct3 = MV: 0b000 for FP, 0b001 for INT
19 * fn4 = 4 bit function.
20 * fn4 = 0b0000 - MV-SWIZZLE
21 * fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
22 * fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
24 swizzle (only active on SV or P48/P64 when SUBVL!=0):
26 +-----+-----+-----+-----+
27 | 7:6 | 5:4 | 3:2 | 1:0 |
28 +-----+-----+-----+-----+
30 +-----+-----+-----+-----+
32 MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
37 for j in range(SUBVL):
38 regs[rd] = regs[rd+regs[rs+j]]
40 Normal mode will apply the element offsets incrementally:
45 for j in range(SUBVL):
46 regs[rd] = regs[rd+regs[rs+k]]
50 Pseudocode for element width part of MV.X:
54 def mv_x(rd, rs1, funct4):
55 elwidth = (funct4>>2) & 0x3
56 bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
57 bytewidth = bitwidth / 8 # get bytes per el
59 addr = (unsigned char *)®s[rs1]
60 offset = addr + bytewidth # get offset within regfile as SRAM
61 # TODO, actually, needs to respect rd and rs1 element width,
62 # here, as well. this pseudocode just illustrates that the
63 # MV.X operation contains a way to compact the indices into
65 regs[rd] = (unsigned char*)(regs)[offset]
67 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
68 registers, such that rather than doing this:
75 {SVP.VL=4} MV.X x3, x8, elwidth=default
77 The alternative is this:
81 {SVP.VL=4} MV.X x3, x8, elwidth=8
83 Thus compacting four indices into the one register. x3 and x8's element
84 width are *independent* of the MV.X elwidth, thus allowing both source
85 and element element widths of the *elements* to be moved to be over-ridden,
86 whilst *at the same time* allowing the *indices* to be compacted, as well.
90 potential MV.X? register-version of MV-swizzle?
92 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
93 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
94 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
95 | RV32-R-type + funct7 + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
96 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
97 | RV32-R-type + 0b0000000 + rs2[4:0] + rs1[4:0] + 0b001 | rd[4:0] + OP-V + 0b11 |
98 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
102 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
103 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
104 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
105 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
107 question: do we need a swizzle MV.X as well?
112 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
113 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
118 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
119 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
124 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
125 _MM_SHUFFLE(hi3,hi2,lo1,lo0))
126 Interleave inputs into low 2 floats and high 2 floats of output. Basically
132 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
133 a[i] into all 4 output floats.
138 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
139 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
147 transpose 4 corner 2x2 matrices
163 <http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
168 __m128i T0 = _mm_unpacklo_epi32(I0, I1);
169 __m128i T1 = _mm_unpacklo_epi32(I2, I3);
170 __m128i T2 = _mm_unpackhi_epi32(I0, I1);
171 __m128i T3 = _mm_unpackhi_epi32(I2, I3);
173 /* Assigning transposed values back into I[0-3] */
174 I0 = _mm_unpacklo_epi64(T0, T1);
175 I1 = _mm_unpackhi_epi64(T0, T1);
176 I2 = _mm_unpacklo_epi64(T2, T3);
177 I3 = _mm_unpackhi_epi64(T2, T3);
182 <https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
187 swizzle2 takes 2 arguments, interleaving the two vectors depending on a 3rd (the swizzle selector)
189 +-----------+-------+-------+-------+-------+-------+------+
190 | | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
191 +===========+=======+=======+=======+=======+=======+======+
192 | swizzle2 | rs3 | 00 | rs2 | rs1 | 000 | rd |
193 +-----------+-------+-------+-------+-------+-------+------+
194 | fswizzle2 | rs3 | 01 | rs2 | rs1 | 000 | rd |
195 +-----------+-------+-------+-------+-------+-------+------+
196 | swizzle | 0 | 10 | rs2 | rs1 | 000 | rd |
197 +-----------+-------+-------+-------+-------+-------+------+
198 | fswizzle | 0 | 11 | rs2 | rs1 | 000 | rd |
199 +-----------+-------+-------+-------+-------+-------+------+
200 | swizzlei | imm | rs1 | 001 | rd |
201 +-----------+ +-------+-------+------+
202 | fswizzlei | | rs1 | 010 | rd |
203 +-----------+-------+-------+-------+-------+-------+------+
205 Matrix 4x4 Vector mul
206 =====================
210 pfscale,3 F2, F1, F10
211 pfscaleadd,2 F2, F1, F11, F2
212 pfscaleadd,1 F2, F1, F12, F2
213 pfscaleadd,0 F2, F1, F13, F2
215 pfscale is a 4 vec mv.shuffle followed by a fmul. pfscaleadd is a 4 vec mv.shuffle followed by a fmac.
224 pub trait SwizzleConstants: Copy + 'static {
225 const CONSTANTS: &'static [Self; 4];
228 impl SwizzleConstants for u8 {
229 const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFF, 0x7F];
232 impl SwizzleConstants for u16 {
233 const CONSTANTS: &'static [Self; 4] = &[0, 1, 0xFFFF, 0x7FFF];
236 impl SwizzleConstants for f32 {
237 const CONSTANTS: &'static [Self; 4] = &[0.0, 1.0, -1.0, 0.5];
240 // impl for other types too...
242 pub fn swizzle<Elm, Selector>(
250 Elm: SwizzleConstants,
251 // Selector is a copyable type that can be converted into u64
252 Selector: Copy + Into<u64>,
254 const FIELD_SIZE: usize = 3;
255 const FIELD_MASK: u64 = 0b111;
256 for vindex in 0..vl {
257 let selector = rs2[vindex].into();
258 // selector's type is u64
259 if selector >> (FIELD_SIZE * destsubvl) != 0 {
260 // handle illegal instruction trap
262 for i in 0..destsubvl {
263 let mut sel_field = selector >> (FIELD_SIZE * i);
264 sel_field &= FIELD_MASK;
265 let src = if (sel_field & 0b100) == 0 {
266 &rs1[(vindex * srcsubvl)..]
268 SwizzleConstants::CONSTANTS
271 if sel_field as usize >= srcsubvl {
272 // handle illegal instruction trap
274 let value = src[sel_field as usize];
275 rd[vindex * destsubvl + i] = value;
284 fn swizzle2<Elm, Selector>(
293 // Elm is a copyable type
295 // Selector is a copyable type that can be converted into u64
296 Selector: Copy + Into<u64>,
298 const FIELD_SIZE: usize = 3;
299 const FIELD_MASK: u64 = 0b111;
300 for vindex in 0..vl {
301 let selector = rs2[vindex].into();
302 // selector's type is u64
303 if selector >> (FIELD_SIZE * destsubvl) != 0 {
304 // handle illegal instruction trap
306 for i in 0..destsubvl {
307 let mut sel_field = selector >> (FIELD_SIZE * i);
308 sel_field &= FIELD_MASK;
309 let src = if (sel_field & 0b100) != 0 {
315 if sel_field as usize >= srcsubvl {
316 // handle illegal instruction trap
318 let value = src[vindex * srcsubvl + (sel_field as usize)];
319 rd[vindex * destsubvl + i] = value;