(no commit message)
[libreriscv.git] / simple_v_extension / specification / mv.x.rst
1 MV.X and MV.swizzle
2 ===================
3
4 swizzle needs a MV. see below for a potential way to use the funct7 to do a swizzle in rs2.
5
6 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
7 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
8 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
9 | RV32-I-type + imm[11:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
10 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
11 | RV32-I-type + fn4[3:0] + swizzle[7:0] + rs1[4:0] + 0b000 | rd[4:0] + OP-V + 0b11 |
12 +---------------+-------------+-------+----------+----------+--------+----------+--------+--------+
13
14 * funct3 = MV
15 * OP-V = 0b1010111
16 * fn4 = 4 bit function.
17 * fn4 = 0b0000 - INT MV-SWIZZLE ?
18 * fn4 = 0b0001 - FP MV-SWIZZLE ?
19 * fn4 = 0bNN10 - INT MV-X, NN=elwidth (default/8/16/32)
20 * fn4 = 0bNN11 - FP MV-X NN=elwidth (default/8/16/32)
21
22 swizzle (only active on SV or P48/P64 when SUBVL!=0):
23
24 +-----+-----+-----+-----+
25 | 7:6 | 5:4 | 3:2 | 1:0 |
26 +-----+-----+-----+-----+
27 | w | z | y | x |
28 +-----+-----+-----+-----+
29
30 Pseudocode for element width part of MV.X:
31
32 ::
33
34 def mv_x(rd, rs1, funct4):
35 elwidth = (funct4>>2) & 0x3
36 bitwidth = {0:XLEN, 1:8, 2:16, 3:32}[elwidth] # get bits per el
37 bytewidth = bitwidth / 8 # get bytes per el
38 for i in range(VL):
39 addr = (unsigned char *)&regs[rs1]
40 offset = addr + bytewidth # get offset within regfile as SRAM
41 # TODO, actually, needs to respect rd and rs1 element width,
42 # here, as well. this pseudocode just illustrates that the
43 # MV.X operation contains a way to compact the indices into
44 # less space.
45 regs[rd] = (unsigned char*)(regs)[offset]
46
47 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
48 registers, such that rather than doing this:
49
50 .. parsed-literal::
51 ldimm x8, 1
52 ldimm x9, 3
53 ldimm x10, 2
54 ldimm x11, 0
55 {SVP.VL=4} MV.X x3, x8, elwidth=default
56
57 The alternative is this:
58
59 .. parsed-literal::
60 ldimm x8, 0x00020301
61 {SVP.VL=4} MV.X x3, x8, elwidth=8
62
63 Thus compacting four indices into the one register. x3 and x8's element
64 width are *independent* of the MV.X elwidth, thus allowing both source
65 and element element widths of the *elements* to be moved to be over-ridden,
66 whilst *at the same time* allowing the *indices* to be compacted, as well.
67
68 ----
69
70 potential MV.X? register-version of MV-swizzle?
71
72 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
73 | Encoding | 31:27 | 26:25 | 24:20 | 19:15 | 14:12 | 11:7 | 6:2 | 1:0 |
74 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
75 | RV32-R-type + funct7 + rs2[4:0] + rs1[4:0] + funct3 | rd[4:0] + opcode + 0b11 |
76 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
77 | RV32-R-type + 0b0000000 + rs2[4:0] + rs1[4:0] + 0b001 | rd[4:0] + OP-V + 0b11 |
78 +-------------+-------+-------+----------+----------+--------+----------+--------+--------+
79
80 * funct3 = MV.X
81 * OP-V = 0b1010111
82 * funct7 = 0b000NN00 - INT MV.X, elwidth=NN (default/8/16/32)
83 * funct7 = 0b000NN10 - FP MV.X, elwidth=NN (default/8/16/32)
84 * funct7 = 0b0000001 - INT MV.swizzle to say that rs2 is a swizzle argument?
85 * funct7 = 0b0000011 - FP MV.swizzle to say that rs2 is a swizzle argument?
86
87 question: do we need a swizzle MV.X as well?
88
89 macro-op fusion
90 ===============
91
92 there is the potential for macro-op fusion of mv-swizzle with the following instruction and/or preceding instruction.
93 <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-August/002486.html>
94
95 VBLOCK context?
96 ===============
97
98 additional idea: a VBLOCK context that says that if a given register is used, it indicates that the
99 register is to be "swizzled", and the VBLOCK swizzle context contains the swizzling to be carried out.
100
101 mm_shuffle_ps?
102 ==============
103
104 __m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
105 _MM_SHUFFLE(hi3,hi2,lo1,lo0))
106 Interleave inputs into low 2 floats and high 2 floats of output. Basically
107 out[0]=lo[lo0];
108 out[1]=lo[lo1];
109 out[2]=hi[hi2];
110 out[3]=hi[hi3];
111
112 For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float
113 a[i] into all 4 output floats.
114
115 Transpose
116 =========
117
118 assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
119 using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
120
121 input:
122 | m00 m10 m20 m30 |
123 | m01 m11 m21 m31 |
124 | m02 m12 m22 m32 |
125 | m03 m13 m23 m33 |
126
127 transpose 4 corner 2x2 matrices
128
129 intermediate:
130 | m00 m01 m20 m21 |
131 | m10 m11 m30 m31 |
132 | m02 m03 m22 m23 |
133 | m12 m13 m32 m33 |
134
135 finish transpose
136
137 output:
138 | m00 m01 m02 m03 |
139 | m10 m11 m12 m13 |
140 | m20 m21 m22 m23 |
141 | m30 m31 m32 m33 |