3 # Simple-V (Parallelism Extension Proposal) Vector Block Format
5 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
7 * Last edited: 2 sep 2019
11 # Vector Block Format <a name="vliw-format"></a>
13 This is a way to give Vector and Predication Context to a group of
14 standard scalar RISC-V instructions, in a highly compact form. Program Execution Order is still preserved (unlike VLIW), just with "context" that would otherwise require much longer instructions.
18 * the standard RISC-V 80 to 192 bit encoding sequence, with bits
19 defining the options to follow within the block
20 * An optional VL Block (16-bit)
21 * Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
22 * Optional register entries (8/16-bit blocks: see Register Table, above)
23 * finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
25 Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
28 [[!inline raw="yes" pages="simple_v_extension/vblock_format_table" ]]
30 Note: The VL Block format is similar to that used in [[sv_prefix_proposal]].
32 * Mode 0b00: set VL to the immediate, truncated to not exceed
33 MVL. Register rd is also set to the same value, if not x0.
34 * Mode 0b01: follow [[specification/sv.setvl]] rules, with RVC
35 style registers in the range x8-x15 for rs1 and rd.
36 * Mode 0b10: set both MVL and VL to the immediate. Register rd is also
38 * Mode 0b11: reserved. All fields must be zero.
40 Mode 0b01 will typically be used to start vectorised loops, where
41 the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL"
42 sequence (in compact form).
44 Modes 0b00 and 0b10 will typically not be used so much for loops as they
45 will be for one-off instructions such as saving the entire register file
46 to the stack with a single one-off Vectorised and predicated LD/ST,
47 or as a way to save or restore registers in a function call with a
50 Unlike in RVV, VL is set (within the limits of MVL) to exactly the value
51 requested, specifically so that LD/ST-MULTI style behaviour can be done
52 in a single instruction.
56 The purpose of the VBLOCK Prefix is to specify the context in which a
57 block of RV Scalar instructions are "vectorised" and/or predicated.
59 As there are not very many bits available without going into a prefix
60 format longer than 16 bits, some abbreviations are used. Two bits are
61 dedicated to specifying whether the Register and Predicate formats are
64 Also, the number of entries in each table is specified with an unusual
65 encoding, on the basis that if registers are to be Vectorised, it is
66 highly likely that they will be predicated as well.
68 The VL Block is optional and also only 16 bits: this because an RVC
69 opcode is limited by comparison.
71 The format is explained as follows:
73 * Bit 7 specifies if the register prefix block format is the full 16 bit format
74 (1) or the compact less expressive format (0).
75 * 8 bit format predicate numbering is implicit and begins from x9. Thus
76 it is critical to put blocks in the correct order as required.
77 * Bit 8 specifies if the predicate block format is 16 bit (1) or 8 bit
79 * Bit 15 specifies if the VL Block is present. If set to 1, the VL Block
80 immediately follows the VBLOCK instruction Prefix
81 * Bits 10 and 11 define how many RegCam entries (0,1,2,4 if bit 7 is 1,
82 otherwise 0,2,4,8) follow the (optional) VL Block.
83 * Bit 9 define how many PredCam entries follow the (optional) RegCam block.
84 If pplen is 1, it is equal to rplen. Otherwise, half rplen, rounded up.
85 * If the exact number of entries are not required, PredCam and RegCam
86 entries may be set to all zero to indicate "unused" (no effect).
87 * Bits 14 to 12 (IL) define the actual length of the instruction: total
88 number of bits is 80 + 16 times IL. Standard RV32, RVC and also
89 SVPrefix (P32C/P48/64-\*-Type) instructions fit into this space, after the
90 (optional) VL / RegCam / PredCam entries
91 * In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed
92 format *MUST* have the RegCam and PredCam entries applied to the
93 operation (and the Vectorisation loop activated)
94 * P48 and P64 opcodes do **not** take their Register or predication
95 context from the VBLOCK tables: they do however have VL or SUBVL
96 applied (overridden when VLtyp or svlen are set).
97 * At the end of the VBLOCK Group, the RegCam and PredCam entries
98 *no longer apply*. VL, MAXVL and SUBVL on the other hand remain at
99 the values set by the last instruction (whether a CSRRW or the VL
101 * Although an inefficient use of resources, it is fine to set the MAXVL,
102 VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.
104 All this would greatly reduce the amount of space utilised by Vectorised
105 instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
106 the CSR itself, a LI, and the setting up of the value into the RS
107 register of the CSR, which, again, requires a LI / LUI to get the 32
108 bit data into the CSR. To get 64-bit data into the register in order
109 to put it into the CSR(s), LOAD operations from memory are needed!
111 Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
112 entries), that's potentially 6 to eight 32-bit instructions, just to
113 establish the Vector State!
115 Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
116 bits if VL needs to be set to greater than 32). Bear in mind that in SV,
117 both MAXVL and VL need to be set.
119 By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is
120 only 16 bits, and as long as not too many predicates and register vector
121 qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
122 the format. If the full flexibility of the 16 bit block formats are not
123 needed, more space is saved by using the 8 bit formats.
125 In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
126 into a VBLOCK format makes a lot of sense.
128 Bear in mind the warning in an earlier section that use of VLtyp or svlen
129 in a P48 or P64 opcode within a VBLOCK Group will result in corruption
130 (use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
131 avoid this situation, the STATE CSR may be copied into a temp register
132 and restored afterwards.
134 # Register Table Format
136 The register table format is covered in the main [[specification]],
137 included here for convenience:
139 [[!inline raw="yes" pages="simple_v_extension/reg_table_format" ]]
141 # Predicate Table Format
143 The predicate table format is covered in the main [[specification]],
144 included here for convenience:
146 [[!inline raw="yes" pages="simple_v_extension/pred_table_format" ]]
148 # Swizzle Table Format<a name="swizzle_format"></a>
150 The swizzle table format is included here for convenience:
152 [[!inline raw="yes" pages="simple_v_extension/swizzle_table_format" ]]
154 Swizzle blocks are only accessible using the "VBLOCK2" format.
156 The swizzles activate on SUBVL and only when used in an operation where a register matches with a SwizzleCAM register entry.
158 On a match the register element index will be redirected through the swizzle format. If however the type is set to "constants" then instead of reading the register file the relevant constant is substituted instead.
160 Setting const type on a destination element will cause an illegal instruction.
163 # REMAP Area Format<a name="remap_format"></a>
165 REMAP is an algorithmic version of in-place vector "vgather" or "swizzle".
167 The REMAP area is divided into two areas:
169 * Register-to-SHAPE. This defines which registers have which shapes.
170 Each entry is 8-bits in length.
171 * SHAPE Table entries. These are 32-bits in length and are aligned
172 to (start on) a 16 bit boundary.
177 | -------- | ------ |
178 | shapeidx | regnum |
180 When both shapeidx and regnum are zero, this indicates the end of the
181 REMAP Register-to-SHAPE section. The REMAP Table section size is then
182 aligned to a 16-bit boundary. 32-bit SHAPE Table Entries then fill the
183 remainder of the REMAP area, and are indexed in order by shapeidx.
185 In this way, multiple registers may share the same "shape" characteristics.
187 # SHAPE Table Format<a name="shape_format"></a>
189 The shape table format is included here for convenience. See [[simple_v_extension/remap]] for full details on how SHAPE applies,
190 including pseudo-code.
192 [[!inline raw="yes" pages="simple_v_extension/shape_table_format" ]]
194 REMAP Shape blocks are only accessible using the "VBLOCK2" format.
198 The CSRs needed, in addition to those from the main [[specification]] are:
206 To greatly simplify implementations, which would otherwise require a
207 way to track (cache) VBLOCK instructions, it is required to treat the
208 VBLOCK group as a separate sub-program with its own separate PC. The
209 sub-pc advances separately whilst the main PC remains "frozen", pointing
210 at the beginning of the VBLOCK instruction (not to be confused with how
211 VL works, which is exactly the same principle, except it is VStart in
212 the STATE CSR that increments).
214 This has implications, namely that a new set of CSRs identical to (x)epc
215 (mepc, srpc, hepc and uepc) must be created and managed and respected
216 as being a sub extension of the (x)epc set of CSRs. Thus, (x)epcvblk CSRs
217 must be context switched and saved / restored in traps.
219 The srcoffs and destoffs indices in the STATE CSR may be similarly
220 regarded as another sub-execution context, giving in effect two sets of
221 nested sub-levels of the RISCV Program Counter (actually, three including
226 Using PCVBLK to store the progression of decoding and subsequent execution
227 of opcodes in a VBLOCK allows a simple single issue design to only need to
228 fetch 32 or 64 bits from the instruction cache on any given clock cycle.
230 *(This approach also alleviates one of the main concerns with the VBLOCK
231 Format: unlike a VLIW engine, a FSM no longer requires full buffering
232 of the entire VBLOCK opcode in order to begin execution. Future versions
233 may therefore potentially lift the 192 bit limit).*
235 To support this option (where more complex implementations may skip some
236 of these phases), VBLOCK contains partial decode state, that allows a
237 trap to occur even part-way through decode, in order to reduce latency.
239 The format is as follows:
241 | 31:30 | 29 | 28:26 | 25:24 | 23:22 | 21 | 20:5 | 4:0 |
242 |--------|-------|-------|-------|-------|------|---------|-------|
243 | status | vlset | 16xil | pplen | rplen | mode | vblock2 | opptr |
244 | 2 | 1 | 3 | 2 | 2 | 1 | 16 | 5 |
246 * status is the key field that effectively exposes the inner FSM (Finite
247 State Machine) directly.
248 * status = 0b00 indicates that the processor is not in "VBLOCK Mode". It
249 is instead in standard RV Scalar opcode execution mode. The processor
250 will leave this mode only after it encounters the beginning of a valid
252 * status=0b01 indicates that vlset, 16xil, pplen, rplen and mode have
253 all been copied directly from the VBLOCK so that they do not need to be
254 read again from the instruction stream, and that VBLOCK2 has also been
255 read and stored, if 16xil was equal to 0b111.
256 * status=0b10 indicates that the VL Block has been read from the instruction
257 stream and actioned. (This means that a SETVL instruction has been
258 created and executed). It also indicates that reading of the
259 Predicate, Register and Swizzle Blocks are now being read.
260 * status=0b11 indicates that the Predicate and Register Blocks have been
261 read from the instruction stream (and put into internal Vector Context)
262 Simpler implementations are permitted to reset status back to 0b10 and
263 re-read the data after return from a trap that happened to occur in the
264 middle of a VBLOCK. They are not however permitted to destroy opptr in
265 the process, and after re-reading the Predicate and Register Blocks must
266 resume execution pointed to by opptr.
267 * opptr points to where instructions begin in the VBLOCK. 0 indicates
268 the start of the opcodes
269 (not the start of the VBLOCK),
270 and is in multiples of 16 bits (2 bytes).
271 This is the equivalent of a Program Counter, for VBLOCKs.
272 * at the end of a VBLOCK, when the last instruction executes (assuming it
273 does not change opptr to earlier in the block), status is reset to 0b00
274 to indicate exit from the VBLOCK FSM, and the current Vector Predicate
275 and Register Context destroyed (Note: the STATE CSR is **not** altered
276 purely by exit from a VBLOCK Context).
278 During the transition from status=0b00 to status=0b01, it is assumed
279 that the instruction stream is being read at a mininum of 32 bits at
280 a time. Therefore it is reasonable to expect that VBLOCK2 would be
281 successfully read simultaneously with the initial VBLOCK header.
282 For this reason there is no separate state in the FSM for updating
283 of the vblock2 field in PCVBLK.
285 When the transition from status=0b01 to status=0b10 occurs, actioning the
286 VL Block state *actually* and literally **must** be as if a SETVL instruction
287 had occurred. This can result in updating of the VL and MVL CSRs (and
288 the VL destination register target). Note, below, that this means that
289 a context-switch may save/restore VL and MVL (and the integer register file),
290 where the remaining tables have no such opportunity.
292 When status=0b10, and before status=0b11, there is no external indicator
293 as to how far the hardware has got in the process of reading the
294 Predicate, Register, and Swizzle Blocks. Implementations are free to use
295 any internal means to track progress, however given that if a trap occurs
296 the read process will need to be restarted (in simpler implementations),
297 there is no point having external indicators of progress. By complete
298 contrast, given that a SETVL actually writes to VL (and MVL), the VL
299 Block state *has* been actioned and thus would be successfully restored
302 When status=0b11, opptr may be written to using CSRRWI. Doing so will
303 cause execution to jump within the block, exactly as if PC had been set
304 in normal RISC-V execution. Writing a value outside of the range of the
305 instruction block will cause an illegal instruction exception. Writing
306 a value (any value) when status is not 0b11 likewise causes an illegal
307 instruction exception. To be clear: CSRRWI PCVBLK does **not** have the same
308 behaviour as CSRRW PCVBLK.
310 In privileged modes, obviously the above rules do not apply to the completely
311 separate (x)ePCVBLK CSRs because these are (inactive) *copies* of state,
312 not the actual active PCVBLK. Writing to PCVBLK during a trap however,
313 clearly the rules must apply.
315 If PCVBLK is written to with CSRRW, the same rules apply, however the
316 entire register in rs1 is treated as the new opptr.
318 Note that the value returned in the register rd is the *full* PCVBLK,
319 not just the opptr part.
321 # Limitations on instructions
323 As the pcvblk CSR is relative to the beginning of the VBLOCK, branch
324 and jump opcodes MUST NOT be used to point to a location inside a block:
325 only at the beginning of an opcode (including another VBLOCK, including
326 the current one). However, setting the PCVBLK CSR is permitted, to
327 unconditionally jump to any opcode within a block.
329 Also: calling subroutines is likewise not permitted, because PCVBLK
330 context cannot be atomically reestablished on return from the function.
332 ECALL, on the other hand, which will cause a trap that saves and restores
333 the full state, is permitted.
335 Prohibited instructions will cause an illegal instruction trap. If at
336 that point, software is capable of then working out how to emulate a
337 branch or function call successfully, by manipulating (x)ePCVBLK and
338 other state, it is not prohibited from doing so.
340 To reiterate: a normal jump, normal conditional branch and a normal
341 function call may only be taken by letting the VBLOCK group finish,
342 returning to "normal" standard RV mode, and then using standard RVC,
343 32 bit or P48/64-\*-type opcodes.
345 The exception to this rule is if the branch or jump within the VBLOCK is back to the start of the same VBLOCK. If this is used, the VBLOCK is, clearly, to be re-executed, including any optional VL blocks and any predication, register table context etc.
347 Given however that the tables are already established, it is only the VL block that needs to be re-run. The other tables may be left as-is.
351 * <https://groups.google.com/d/msg/comp.arch/yIFmee-Cx-c/jRcf0evSAAAJ>
352 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001824.html>
353 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-June/001880.html>
357 * Is it necessary to stick to the RISC-V 1.5 format? Why not go with
358 using the 15th bit to allow 80 + 16\*0bnnnn bits? Perhaps to be sane,
359 limit to 256 bits (16 times 0-11).
360 * Could a "hint" be used to set which operations are parallel and which
362 * Could a new sub-instruction opcode format be used, one that does not
363 conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes?
364 no need for byte or bit-alignment
365 * Could a hardware compression algorithm be deployed? Quite likely,
366 because of the sub-execution context (sub-VBLOCK PC)