(no commit message)
[libreriscv.git] / simple_v_extension / vblock_format.mdwn
1 # Simple-V (Parallelism Extension Proposal) Vector Block Format
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 29 jun 2019
6
7 [[!toc ]]
8
9 # Vector Block Format <a name="vliw-format"></a>
10
11 This is a way to give Vector and Predication Context to a group of
12 standard scalar RISC-V instructions, in a highly compact form.
13
14 The format is:
15
16 * the standard RISC-V 80 to 192 bit encoding sequence, with bits
17 defining the options to follow within the block
18 * An optional VL Block (16-bit)
19 * Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
20 * Optional register entries (8/16-bit blocks: see Register Table, above)
21 * finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
22
23 Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
24 as follows:
25
26 [[!inline raw="yes" pages="simple_v_extension/vblock_format_table" ]]
27
28 Note: The VL Block format is similar to that used in [[sv_prefix_proposal]].
29
30 * Mode 0b00: set VL to the immediate, truncated to not exceed
31 MVL. Register rd is also set to the same value, if not x0.
32 * Mode 0b01: follow [[specification/sv.setvl]] rules, with RVC
33 style registers in the range x8-x15 for rs1 and rd.
34 * Mode 0b10: set both MVL and VL to the immediate. Register rd is also
35 set if not x0.
36 * Mode 0b11: reserved. All fields must be zero.
37
38 Mode 0b01 will typically be used to start vectorised loops, where
39 the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL"
40 sequence (in compact form).
41
42 Modes 0b00 and 0b10 will typically not be used so much for loops as they
43 will be for one-off instructions such as saving the entire register file
44 to the stack with a single one-off Vectorised and predicated LD/ST,
45 or as a way to save or restore registers in a function call with a
46 single instruction.
47
48 Unlike in RVV, VL is set (within the limits of MVL) to exactly the value
49 requested, specifically so that LD/ST-MULTI style behaviour can be done
50 in a single instruction.
51
52 # VBLOCK Prefix
53
54 * Bit 7 specifies if the prefix block format is the full 16 bit format
55 (1) or the compact less expressive format (0). In the 8 bit format,
56 pplen is multiplied by 2.
57 * 8 bit format predicate numbering is implicit and begins from x9. Thus
58 it is critical to put blocks in the correct order as required.
59 * Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit
60 (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number
61 of entries are needed the last may be set to 0x00, indicating "unused".
62 * Bit 15 specifies if the VL Block is present. If set to 1, the VL Block
63 immediately follows the VBLOCK instruction Prefix
64 * Bits 8 and 9 define how many RegCam entries (0,1,2,4 if bit 7 is 1,
65 otherwise 0,2,4,8) follow the (optional) VL Block.
66 * Bits 10 and 11 define how many PredCam entries (0,1,2,4 if bit 7 is 1,
67 otherwise 0,2,4,8) follow the (optional) RegCam entries
68 * Bits 14 to 12 (IL) define the actual length of the instruction: total
69 number of bits is 80 + 16 times IL. Standard RV32, RVC and also
70 SVPrefix (P48/64-\*-Type) instructions fit into this space, after the
71 (optional) VL / RegCam / PredCam entries
72 * In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed
73 format *MUST* have the RegCam and PredCam entries applied to the
74 operation (and the Vectorisation loop activated)
75 * P48 and P64 opcodes do **not** take their Register or predication
76 context from the VBLOCK tables: they do however have VL or SUBVL
77 applied (overridden when VLtyp or svlen are set).
78 * At the end of the VBLOCK Group, the RegCam and PredCam entries
79 *no longer apply*. VL, MAXVL and SUBVL on the other hand remain at
80 the values set by the last instruction (whether a CSRRW or the VL
81 Block header).
82 * Although an inefficient use of resources, it is fine to set the MAXVL,
83 VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.
84
85 All this would greatly reduce the amount of space utilised by Vectorised
86 instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
87 the CSR itself, a LI, and the setting up of the value into the RS
88 register of the CSR, which, again, requires a LI / LUI to get the 32
89 bit data into the CSR. To get 64-bit data into the register in order
90 to put it into the CSR(s), LOAD operations from memory are needed!
91
92 Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
93 entries), that's potentially 6 to eight 32-bit instructions, just to
94 establish the Vector State!
95
96 Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
97 bits if VL needs to be set to greater than 32). Bear in mind that in SV,
98 both MAXVL and VL need to be set.
99
100 By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is
101 only 16 bits, and as long as not too many predicates and register vector
102 qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
103 the format. If the full flexibility of the 16 bit block formats are not
104 needed, more space is saved by using the 8 bit formats.
105
106 In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
107 into a VBLOCK format makes a lot of sense.
108
109 Bear in mind the warning in an earlier section that use of VLtyp or svlen
110 in a P48 or P64 opcode within a VBLOCK Group will result in corruption
111 (use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
112 avoid this situation, the STATE CSR may be copied into a temp register
113 and restored afterwards.
114
115 # Register Table Format
116
117 The register table format is covered in the main [[specification]],
118 included here for convenience:
119
120 [[!inline raw="yes" pages="simple_v_extension/reg_table_format" ]]
121
122 # Predicate Table Format
123
124 The predicate table format is covered in the main [[specification]],
125 included here for convenience:
126
127 [[!inline raw="yes" pages="simple_v_extension/pred_table_format" ]]
128
129 # CSRs:
130
131 The CSRs needed, in addition to those from the main [[specification]] are:
132
133 * pcvblk
134 * mepcvblk
135 * sepcvblk
136 * uepcvblk
137 * hepcvblk
138
139 To greatly simplify implementations, which would otherwise require a
140 way to track (cache) VBLOCK instructions, it is required to treat the
141 VBLOCK group as a separate sub-program with its own separate PC. The
142 sub-pc advances separately whilst the main PC remains "frozen", pointing
143 at the beginning of the VBLOCK instruction (not to be confused with how
144 VL works, which is exactly the same principle, except it is VStart in
145 the STATE CSR that increments).
146
147 This has implications, namely that a new set of CSRs identical to (x)epc
148 (mepc, srpc, hepc and uepc) must be created and managed and respected
149 as being a sub extension of the (x)epc set of CSRs. Thus, (x)epcvblk CSRs
150 must be context switched and saved / restored in traps.
151
152 The srcoffs and destoffs indices in the STATE CSR may be similarly
153 regarded as another sub-execution context, giving in effect two sets of
154 nested sub-levels of the RISCV Program Counter (actually, three including
155 SUBVL and ssvoffs).
156
157 # PCVBLK CSR Format
158
159 Using PCVBLK to store the progression of decoding and subsequent execution
160 of opcodes in a VBLOCK allows a simple single issue design to only need to
161 fetch 32 or 64 bits from the instruction cache on any given clock cycle.
162
163 *(This approach also alleviates one of the main concerns with the VBLOCK
164 Format: unlike a VLIW engine, a FSM no longer requires full buffering
165 of the entire VBLOCK opcode in order to begin execution. Future versions
166 may therefore potentially lift the 192 bit limit).*
167
168 To support this option (where more complex implementations may skip some
169 of these phases), VBLOCK contains partial decode state, that allows a
170 trap to occur even part-way through decode, in order to reduce latency.
171
172 The format is as follows:
173
174 | 31:30 | 29 | 28:26 | 25:24 | 23:22 | 21 | 20:5 | 4:0 |
175 |--------|-------|-------|-------|-------|------|-------|-------|
176 | status | vlset | 16xil | pplen | rplen | mode | vlblk | opptr |
177 | 2 | 1 | 3 | 2 | 2 | 1 | 16 | 5 |
178
179 * status is the key field that effectively exposes the inner FSM (Finite
180 State Machine) directly.
181 * status = 0b00 indicates that the processor is not in "VBLOCK Mode". It
182 is instead in standard RV Scalar opcode execution mode. The processor
183 will leave this mode only after it encounters the beginning of a valid
184 VBLOCK opcode.
185 * status = 0b01 indicates that vlset, 16xil, pplen, rplen and mode have
186 all been copied directly from the VBLOCK so that they do not need to be
187 read again from the instruction stream.
188 * status=0b10 indicates that the VL Block has been read from the
189 instruction stream and decoded (and copied into vlblk).
190 * status=0b11 indicates that the Predicate and Register Blocks have been
191 read from the instruction stream (and put into internal Vector Context)
192 Simpler implementations are permitted to reset status back to 0b10 and
193 re-read the data after return from a trap that happened to occur in the
194 middle of a VBLOCK. They are not however permitted to destroy opptr in
195 the process, and after re-reading the Predicate and Register Blocks must
196 resume execution pointed to by opptr.
197 * opptr points to where instructions begin in the VBLOCK. 0 indicates
198 the start of the opcodes, and is in multiples of 16 bits (2 bytes).
199 This is the equivalent of a Program Counter, for VBLOCKs.
200 * at the end of a VBLOCK, when the last instruction executes (assuming it
201 does not change opptr to earlier in the block), status is reset to 0b00
202 to indicate exit from the VBLOCK FSM, and the current Vector Predicate
203 and Register Context destroyed (Note: the STATE CSR is **not** altered
204 purely by exit from a VBLOCK Context).
205
206 When status=0b11, opptr may be written to using CSRRWI. Doing so will
207 cause execution to jump within the block, exactly as if PC had been set
208 in normal RISC-V eexecution. Writing a value outside of the range of the
209 instruction block will cause an illegal instruction exception. Writing
210 a value (any value) when status is not 0b11 likewise causes an illegal
211 instruction exception.
212
213 In privileged modes, obviously the above rules do not apply to the
214 completely seoarate (x)ePCVBLK CSRs because these are copies of state,
215 not the actual active PCVBLK. Writing to PCVBLK during a trap however,
216 clearly the rules must apply.
217
218 If PCVBLK is written to with CSRRW, the same rules apply, however the
219 entire register in rs1 is treated as the new opptr.
220
221 Note that the value returned in the register rd is the *full* PCVBLK,
222 not just the opptr part.
223
224 # Limitations on instructions
225
226 As the pcvblk CSR is relative to the beginning of the VBLOCK, branch
227 and jump opcodes MUST NOT be used to point to a location inside a block:
228 only at the beginning of an opcode (including another VBLOCK, including
229 the current one). However, setting the PCVBLK CSR is permitted, to
230 unconditionally jump to any opcode within a block.
231
232 Also: calling subroutines is likewise not permitted, because PCVBLK
233 context cannot be atomically reestablished on return from the function.
234
235 ECALL, on the other hand, which will cause a trap that saves and restores
236 the full state, is permitted.
237
238 Prohibited instructions will cause an illegal instruction trap. If at
239 that point, software is capable of then working out how to emulate a
240 branch or function call successfully, by manipulating (x)ePCVBLK and
241 other state, it is not prohibited from doing so.
242
243 To reiterate: a normal jump, normal conditional branch and a normal
244 function call may only be taken by letting the VBLOCK group finish,
245 returning to "normal" standard RV mode, and then using standard RVC,
246 32 bit or P48/64-\*-type opcodes.
247
248 # Links
249
250 * <https://groups.google.com/d/msg/comp.arch/yIFmee-Cx-c/jRcf0evSAAAJ>
251 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001824.html>
252 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001880.html>
253
254 # Open Questions:
255
256 * Is it necessary to stick to the RISC-V 1.5 format? Why not go with
257 using the 15th bit to allow 80 + 16\*0bnnnn bits? Perhaps to be sane,
258 limit to 256 bits (16 times 0-11).
259 * Could a "hint" be used to set which operations are parallel and which
260 are sequential?
261 * Could a new sub-instruction opcode format be used, one that does not
262 conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes?
263 no need for byte or bit-alignment
264 * Could a hardware compression algorithm be deployed? Quite likely,
265 because of the sub-execution context (sub-VBLOCK PC)
266