discussion tidyup
[libreriscv.git] / simple_v_extension / sv_prefix_proposal / discussion.mdwn
1 # Questions
2
3 Confirmation needed as to whether subvector extraction can be covered
4 by twin predication (it probably can, it is one of the many purposes it
5 is for).
6
7 Answer:
8
9 Yes, it can, but VL needs to be changed for it to work, since predicates
10 work at the size of a whole subvector instead of an element of that
11 subvector. To avoid needing to constantly change VL, and since swizzles
12 are a very common operation, I think we should have a separate instruction
13 -- a subvector element swizzle instruction::
14
15 velswizzle x32, x64, SRCSUBVL=3, DESTSUBVL=4, ELTYPE=u8, elements=[0, 0, 2, 1]
16
17 Example pseudocode:
18
19 .. code:: C
20
21 // processor state:
22 uint64_t regs[128];
23 int VL = 5;
24
25 typedef uint8_t ELTYPE;
26 const int SRCSUBVL = 3;
27 const int DESTSUBVL = 4;
28 const int elements[] = [0, 0, 2, 1];
29 ELTYPE *rd = (ELTYPE *)&regs[32];
30 ELTYPE *rs1 = (ELTYPE *)&regs[48];
31 for(int i = 0; i < VL; i++)
32 {
33 rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
34 rd[i * DESTSUBVL + 1] = rs1[i * SRCSUBVL + elements[1]];
35 rd[i * DESTSUBVL + 2] = rs1[i * SRCSUBVL + elements[2]];
36 rd[i * DESTSUBVL + 3] = rs1[i * SRCSUBVL + elements[3]];
37 }
38
39 To use the subvector element swizzle instruction to extract a subvector element,
40 all that needs to be done is to have DESTSUBVL be 1::
41
42 // extract element index 2
43 velswizzle rd, rs1, SRCSUBVL=4, DESTSUBVL=1, ELTYPE=u32, elements=[2]
44
45 Example pseudocode:
46
47 .. code:: C
48
49 // processor state:
50 uint64_t regs[128];
51 int VL = 5;
52
53 typedef uint32_t ELTYPE;
54 const int SRCSUBVL = 4;
55 const int DESTSUBVL = 1;
56 const int elements[] = [2];
57 ELTYPE *rd = (ELTYPE *)&regs[...];
58 ELTYPE *rs1 = (ELTYPE *)&regs[...];
59 for(int i = 0; i < VL; i++)
60 {
61 rd[i * DESTSUBVL + 0] = rs1[i * SRCSUBVL + elements[0]];
62 }
63
64 > ok, i like that idea - adding to TODO list
65
66 ----
67
68 What is SUBVL and how does it work
69
70 Answer:
71
72 SUBVL is the instruction field in P48 instructions that specifies
73 the sub-vector length. The sub-vector length is the number of scalars
74 that are grouped together and treated like an element by both VL and
75 predication. This is used to support operations where the elements are
76 short vectors (2-4 elements) in Vulkan and OpenGL. Those short vectors
77 are mostly used as mathematical vectors to handle directions, positions,
78 and colors, rather than as a pure optimization.
79
80 For example, when VL is 5::
81
82 add x32, x48, x64, SUBVL=3, ELTYPE=u16, PRED=!x9
83
84 performs the following operation:
85
86 .. code:: C
87
88 // processor state:
89 uint64_t regs[128];
90 int VL = 5;
91
92 // instruction fields:
93 typedef uint16_t ELTYPE;
94 const int SUBVL = 3;
95 ELTYPE *rd = (ELTYPE *)&regs[32];
96 ELTYPE *rs1 = (ELTYPE *)&regs[48];
97 ELTYPE *rs2 = (ELTYPE *)&regs[64];
98 for(int i = 0; i < VL; i++)
99 {
100 if(~regs[9] & 0x1)
101 {
102 rd[i * SUBVL + 0] = rs1[i * SUBVL + 0] + rs2[i * SUBVL + 0];
103 rd[i * SUBVL + 1] = rs1[i * SUBVL + 1] + rs2[i * SUBVL + 1];
104 rd[i * SUBVL + 2] = rs1[i * SUBVL + 2] + rs2[i * SUBVL + 2];
105 }
106 }
107
108 ----
109
110 SVorig goes to a lot of effort to make VL 1<= MAXVL and MAXVL 1..64
111 where both CSRs may be stored internally in only 6 bits.
112
113 Thus, CSRRWI can reach 1..32 for VL and MAXVL.
114
115 In addition, setting a hardware loop to zero turning instructions into
116 NOPs, um, just branch over them, to start the first loop at the end,
117 on the test for loop variable being zero, a la c "while do" instead of
118 "do while".
119
120 Or, does it not matter that VL only goes up to 31 on a CSRRWI, and that
121 it only goes to a max of 63 rather than 64?
122
123 Answer:
124
125 I think supporting SETVL where VL would be set to 0 should be done. that
126 way, the branch can be put after SETVL, allowing SETVL to execute
127 earlier giving more time for VL to propagate (preventing stalling)
128 to the instruction decoder. I have no problem with having 0 stored to
129 VL via CSRW resulting in VL=64 (or whatever maximum value is supported
130 in hardware).
131
132 One related idea would to support VL > XLEN but to only allow unpredicated
133 instructions when VL > XLEN. This would allow later implementing register
134 pairs/triplets/etc. as predicates as an extension.
135
136 ----
137
138 Is MV.X good enough a substitute for swizzle?
139
140 Answer:
141
142 no, since the swizzle instruction specifies in the opcode which elements are
143 used and where they go, so it can run much faster since the execution engine
144 doesn't need to pessimize. Additionally, swizzles almost always have constant
145 element selectors. MV.X is meant more as a last-resort instruction that is
146 better than load/store, but worse than everything else.
147
148 ----
149
150 Is vectorised srcbase ok as a gather scatter and ok substitute for
151 register stride? 5 dependency registers (reg stride being the 5th)
152 is quite scary
153
154 ----
155
156 Why are integer conversion instructions needed, when the main SV spec
157 covers them by allowing elwidth to be set on both src and dest regs?
158
159 ----
160
161 Why are the SETVL rules so complex? What is the reason, how are loops
162 carried out?
163
164 Partial Answer:
165
166 The idea is that the compiler knows maxVL at compile time since it allocated the
167 backing registers, so SETVL has the maxVL as an immediate value. There is no
168 maxVL CSR needed for just SVPrefix.
169
170 > when looking at a loop assembly sequence
171 > i think you'll find this approach will not work.
172 > RVV loops on which SV loops are directly based needs understanding
173 > of the use of MIN. Yes MVL is known at compile time
174 > however unless MVL is communicates to the hardware, SETVL just
175 > does not work.
176 > The only other option which does work is to set a mandatory
177 > hardcoded MVL baked into the actual hardware.
178 > That results in loss of flexibility and defeats the purpose of SV.
179
180 ----
181
182 With SUBVL (sub vector len) being both a CSR and also part of the 48/64
183 bit opcode, how does that work?
184
185 Answer:
186
187 I think we should just ignore the SUBVL CSR and use the value from the
188 SUBVL field when executing 48/64-bit instructions. For just SVPrefix,
189 I would say that the only user-visible CSR needed is VL. This is ignoring
190 all the state for context-switching and exception handling.
191
192 > the consequence of that would be that P48/64 would need
193 > its own CSR State to track the subelement index.
194 > or that any exceptions would need to occur on a group
195 > basis, which is less than ideal,
196 > and interrupts would have to be stalled.
197 > interacting with SUBVL and requiring P48/64 to save the
198 > STATE CSR if needed is a workable compromise that
199 > does not result in huge CSR proliferation
200
201 ----
202
203 What are the interaction rules when a 48/64 prefix opcode has a rd/rs
204 that already has a Vector Context for either predication or a register?
205
206 It would perhaps make sense (and for svlen as well) to make 48/64 isolated
207 and unaffected by VLIW context, with the exception of VL/MVL.
208
209 MVL and VL should be modifiable by 64 bit prefix as they are global
210 in nature.
211
212 Possible solution, svlen and VLtyp allowed to share STATE CSR however
213 programmer becomes responsible for push and pop of state during use of
214 a sequence of P48 and P64 ops.
215
216 ----
217
218 Can bit 60 of P64 be put to use (in all but the FR4 case)?
219