(no commit message)
[libreriscv.git] / simple_v_extension / specification / mv.x.rst
1 [[!tag standards]]
2
3 MV.X and MV.swizzle
4 ===================
5
6 swizzle needs a MV. see below for a potential way to use the funct7 to do a swizzle in rs2.
7
8 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
9 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
10 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
11 | RV32-I-type + imm[11:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
12 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
13 | RV32-I-type + fn4[3:0] + swizzle[7:0] + rs1[4:0] + 0b000 | rd[4:0] + OP-V + 0b11 |
14 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
15
16 * funct3 = MV: 0b000 for FP, 0b001 for INT
17 * OP-V = 0b1010111
18 * fn4 = 4 bit function.
19 * fn4 = 0b0000 - MV-SWIZZLE
20 * fn4 = 0bNN01 - MV-X, NN=elwidth (default/8/16/32)
21 * fn4 = 0bNN11 - MV-X.SUBVL NN=elwidth (default/8/16/32)
22
23 swizzle (only active on SV or P48/P64 when SUBVL!=0):
24
25 +-----+-----+-----+-----+
26 | 7:6 | 5:4 | 3:2 | 1:0 |
27 +-----+-----+-----+-----+
28 | w | z | y | x |
29 +-----+-----+-----+-----+
30
31 MV.X has two modes: SUBVL mode applies the element offsets only within a SUBVL inner loop. This can be used for transposition.
32
33 ::
34
35 for i in range(VL):
36 for j in range(SUBVL):
37 regs[rd] = regs[rd+regs[rs+j]]
38
39 Normal mode will apply the element offsets incrementally:
40
41 ::
42
43 for i in range(VL):
44 for j in range(SUBVL):
45 regs[rd] = regs[rd+regs[rs+k]]
46 k++
47
48
49 Pseudocode for element width part of MV.X:
50
51 ::
52
53 def mv_x(rd, rs1, funct4):
54 elwidth = (funct4>>2) & 0x3
55 bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
56 bytewidth = bitwidth / 8 # get bytes per el
57 for i in range(VL):
58 addr = (unsigned char *)&regs[rs1]
59 offset = addr + bytewidth # get offset within regfile as SRAM
60 # TODO, actually, needs to respect rd and rs1 element width,
61 # here, as well. this pseudocode just illustrates that the
62 # MV.X operation contains a way to compact the indices into
63 # less space.
64 regs[rd] = (unsigned char*)(regs)[offset]
65
66 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
67 registers, such that rather than doing this:
68
69 .. parsed-literal::
70 ldimm x8, 1
71 ldimm x9, 3
72 ldimm x10, 2
73 ldimm x11, 0
74 {SVP.VL=4} MV.X x3, x8, elwidth=default
75
76 The alternative is this:
77
78 .. parsed-literal::
79 ldimm x8, 0x00020301
80 {SVP.VL=4} MV.X x3, x8, elwidth=8
81
82 Thus compacting four indices into the one register. x3 and x8's element
83 width are *independent* of the MV.X elwidth, thus allowing both source
84 and element element widths of the *elements* to be moved to be over-ridden,
85 whilst *at the same time* allowing the *indices* to be compacted, as well.
86
87 ----
88
89 potential MV.X? register-version of MV-swizzle?
90
91 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
92 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
93 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
94 | RV32-R-type + funct7 + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
95 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
96 | RV32-R-type + 0b0000000 + rs2[4:0] + rs1[4:0] + 0b001 | rd[4:0] + OP-V + 0b11 |
97 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
98
99 * funct3 = MV.X
100 * OP-V = 0b1010111
101 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
102 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
103 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
104 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
105
106 question: do we need a swizzle MV.X as well?
107
108 macro-op fusion
109 ===============
110
111 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
112 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
113
114 VBLOCK context?
115 ===============
116
117 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
118 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
119
120 mm_shuffle_ps?
121 ==============
122
123 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
124 _MM_SHUFFLE(hi3,hi2,lo1,lo0))
125 Interleave inputs into low 2 floats and high 2 floats of output. Basically
126 out[0]=lo[lo0];
127 out[1]=lo[lo1];
128 out[2]=hi[hi2];
129 out[3]=hi[hi3];
130
131 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
132 a[i] into all 4 output floats.
133
134 Transpose
135 =========
136
137 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
138 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
139
140 input:
141 | m00 m10 m20 m30 |
142 | m01 m11 m21 m31 |
143 | m02 m12 m22 m32 |
144 | m03 m13 m23 m33 |
145
146 transpose 4 corner 2x2 matrices
147
148 intermediate:
149 | m00 m01 m20 m21 |
150 | m10 m11 m30 m31 |
151 | m02 m03 m22 m23 |
152 | m12 m13 m32 m33 |
153
154 finish transpose
155
156 output:
157 | m00 m01 m02 m03 |
158 | m10 m11 m12 m13 |
159 | m20 m21 m22 m23 |
160 | m30 m31 m32 m33 |
161
162 <http://web.archive.org/web/20100111104515/http://www.randombit.net:80/bitbashing/programming/integer_matrix_transpose_in_sse2.html>
163
164
165 ::
166
167 __m128i T0 = _mm_unpacklo_epi32(I0, I1);
168 __m128i T1 = _mm_unpacklo_epi32(I2, I3);
169 __m128i T2 = _mm_unpackhi_epi32(I0, I1);
170 __m128i T3 = _mm_unpackhi_epi32(I2, I3);
171
172 /* Assigning transposed values back into I[0-3] */
173 I0 = _mm_unpacklo_epi64(T0, T1);
174 I1 = _mm_unpackhi_epi64(T0, T1);
175 I2 = _mm_unpacklo_epi64(T2, T3);
176 I3 = _mm_unpackhi_epi64(T2, T3);
177
178 Transforms for DCT
179 ==================
180
181 <https://opencores.org/websvn/filedetails?repname=mpeg2fpga&path=%2Fmpeg2fpga%2Ftrunk%2Frtl%2Fmpeg2%2Fidct.v>
182
183 Table to evaluate
184 =================
185
186 +-----------+-------+-------+-------+-------+-------+------+
187 | | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 |
188 +===========+=======+=======+=======+=======+=======+======+
189 | swizzle2 | rs3 | 00 | rs2 | rs1 | 000 | rd |
190 +-----------+-------+-------+-------+-------+-------+------+
191 | fswizzle2 | rs3 | 01 | rs2 | rs1 | 000 | rd |
192 +-----------+-------+-------+-------+-------+-------+------+
193 | swizzle | 0 | 10 | rs2 | rs1 | 000 | rd |
194 +-----------+-------+-------+-------+-------+-------+------+
195 | fswizzle | 0 | 11 | rs2 | rs1 | 000 | rd |
196 +-----------+-------+-------+-------+-------+-------+------+
197 | swizzlei | imm | rs1 | 001 | rd |
198 +-----------+ +-------+-------+------+
199 | fswizzlei | | rs1 | 010 | rd |
200 +-----------+-------+-------+-------+-------+-------+------+