update to rst table format
[libreriscv.git] / simple_v_extension / vblock_format.mdwn
1 # Simple-V (Parallelism Extension Proposal) Vector Block Format
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 28 jun 2019
6
7 [[!toc ]]
8
9 # Vector Block Format <a name="vliw-format"></a>
10
11 This is a way to give Vector and Predication Context to a group of
12 standard scalar RISC-V instructions, in a highly compact form.
13
14 The format is:
15
16 * the standard RISC-V 80 to 192 bit encoding sequence, with bits
17 defining the options to follow within the block
18 * An optional VL Block (16-bit)
19 * Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
20 * Optional register entries (8/16-bit blocks: see Register Table, above)
21 * finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
22
23 Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
24 as follows:
25
26 [[!inline raw="yes" pages="simple_v_extension/vblock_format_table" ]]
27
28 Note: this format is very similar to that used in [[sv_prefix_proposal]]
29
30 If vlt is 0, VLEN is a 5 bit immediate value, offset by one (i.e
31 a bit sequence of 0b00000 represents VL=1 and so on). If vlt is 1,
32 it specifies the scalar register from which VL is set by this VBLOCK
33 instruction group. VL, whether set from the register or the immediate,
34 is then modified (truncated) to be MIN(VL, MAXVL), and the result stored
35 in the scalar register specified in VLdest. If VLdest is zero, no store
36 in the regfile occurs (however VL is still set).
37
38 This option will typically be used to start vectorised loops, where
39 the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL"
40 sequence (in compact form).
41
42 When bit 15 is set to 1, MAXVL and VL are both set to the immediate,
43 VLEN (again, offset by one), which is 6 bits in length, and the same
44 value stored in scalar register VLdest (if that register is nonzero).
45 A value of 0b000000 will set MAXVL=VL=1, a value of 0b000001 will
46 set MAXVL=VL= 2 and so on.
47
48 This option will typically not be used so much for loops as it will be
49 for one-off instructions such as saving the entire register file to the
50 stack with a single one-off Vectorised and predicated LD/ST, or as a way
51 to save or restore registers in a function call with a single instruction.
52
53 Notes:
54
55 * Bit 7 specifies if the prefix block format is the full 16 bit format
56 (1) or the compact less expressive format (0). In the 8 bit format,
57 pplen is multiplied by 2.
58 * 8 bit format predicate numbering is implicit and begins from x9. Thus
59 it is critical to put blocks in the correct order as required.
60 * Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit
61 (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number
62 of entries are needed the last may be set to 0x00, indicating "unused".
63 * Bit 15 specifies if the VL Block is present. If set to 1, the VL Block
64 immediately follows the VBLOCK instruction Prefix
65 * Bits 8 and 9 define how many RegCam entries (0 to 3 if bit 15 is 1,
66 otherwise 0 to 6) follow the (optional) VL Block.
67 * Bits 10 and 11 define how many PredCam entries (0 to 3 if bit 7 is 1,
68 otherwise 0 to 6) follow the (optional) RegCam entries
69 * Bits 14 to 12 (IL) define the actual length of the instruction: total
70 number of bits is 80 + 16 times IL. Standard RV32, RVC and also
71 SVPrefix (P48/64-\*-Type) instructions fit into this space, after the
72 (optional) VL / RegCam / PredCam entries
73 * In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed
74 format *MUST* have the RegCam and PredCam entries applied to the
75 operation (and the Vectorisation loop activated)
76 * P48 and P64 opcodes do **not** take their Register or predication
77 context from the VBLOCK tables: they do however have VL or SUBVL
78 applied (unless VLtyp or svlen are set).
79 * At the end of the VBLOCK Group, the RegCam and PredCam entries
80 *no longer apply*. VL, MAXVL and SUBVL on the other hand remain at
81 the values set by the last instruction (whether a CSRRW or the VL
82 Block header).
83 * Although an inefficient use of resources, it is fine to set the MAXVL,
84 VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.
85
86 All this would greatly reduce the amount of space utilised by Vectorised
87 instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
88 the CSR itself, a LI, and the setting up of the value into the RS
89 register of the CSR, which, again, requires a LI / LUI to get the 32
90 bit data into the CSR. To get 64-bit data into the register in order
91 to put it into the CSR(s), LOAD operations from memory are needed!
92
93 Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
94 entries), that's potentially 6 to eight 32-bit instructions, just to
95 establish the Vector State!
96
97 Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
98 bits if VL needs to be set to greater than 32). Bear in mind that in SV,
99 both MAXVL and VL need to be set.
100
101 By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is
102 only 16 bits, and as long as not too many predicates and register vector
103 qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
104 the format. If the full flexibility of the 16 bit block formats are not
105 needed, more space is saved by using the 8 bit formats.
106
107 In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
108 into a VBLOCK format makes a lot of sense.
109
110 Bear in mind the warning in an earlier section that use of VLtyp or svlen
111 in a P48 or P64 opcode within a VBLOCK Group will result in corruption
112 (use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
113 avoid this situation, the STATE CSR may be copied into a temp register
114 and restored afterwards.
115
116 # Register Table Format
117
118 The register table format is covered in the main [[specification]],
119 included here for convenience:
120
121 [[!inline raw="yes" pages="simple_v_extension/reg_table_format" ]]
122
123 # Predicate Table Format
124
125 The predicate table format is covered in the main [[specification]],
126 included here for convenience:
127
128 [[!inline raw="yes" pages="simple_v_extension/pred_table_format" ]]
129
130 # CSRs:
131
132 The CSRs needed, in addition to those from the main [[specification]] are:
133
134 * pcvblk
135 * mepcvblk
136 * sepcvblk
137 * uepcvblk
138 * hepcvblk
139
140 To greatly simplify implementations, which would otherwise require a
141 way to track (cache) VBLOCK instructions, it is required to treat the
142 VBLOCK group as a separate sub-program with its own separate PC. The
143 sub-pc advances separately whilst the main PC remains "frozen", pointing
144 at the beginning of the VBLOCK instruction (not to be confused with how
145 VL works, which is exactly the same principle, except it is VStart in
146 the STATE CSR that increments).
147
148 This has implications, namely that a new set of CSRs identical to (x)epc
149 (mepc, srpc, hepc and uepc) must be created and managed and respected
150 as being a sub extension of the (x)epc set of CSRs. Thus, (x)epcvblk CSRs
151 must be context switched and saved / restored in traps.
152
153 The srcoffs and destoffs indices in the STATE CSR may be similarly
154 regarded as another sub-execution context, giving in effect two sets of
155 nested sub-levels of the RISCV Program Counter (actually, three including
156 SUBVL and ssvoffs).
157
158 # PCVBLK CSR Format
159
160 Using PCVBLK to store the progression of decoding and subsequent execution of opcodes in a VBLOCK allows a simple single issue design to only need to fetch 32 or 64 bits from the instruction cache on any given clock cycle.
161
162 *(This approach also alleviates one of the main concerns with the VBLOCK Format: unlike a VLIW engine, a FSM no longer requires full buffering of the entire VBLOCK opcode in order to begin execution. Future versions may therefore potentially lift the 192 bit limit).*
163
164 To support this option (where more complex implementations may skip some of these phases), VBLOCK contains partial decode state, that allows a trap to occur even part-way through decode, in order to reduce latency.
165
166 The format is as follows:
167
168 | 31:30 | 29 | 28:26 | 25:24 | 23:22 | 21 | 20:5 | 4:0 |
169 |--------|-------|-------|-------|-------|------|-------|-------|
170 | status | vlset | 16xil | pplen | rplen | mode | vlblk | opptr |
171 | 2 | 1 | 3 | 2 | 2 | 1 | 16 | 5 |
172
173 * status is the key field that effectively exposes the inner FSM (Finite State Machine) directly.
174 * status = 0b00 indicates that the processor is not in "VBLOCK Mode". It is instead in standard RV Scalar opcode execution mode. The processor will leave this mode only after it encounters the beginning of a valid VBLOCK opcode.
175 * status = 0b01 indicates that vlset, 16xil, pplen, rplen and mode have all been copied directly from the VBLOCK so that they do not need to be read again from the instruction stream.
176 * status=0b10 indicates that the VL Block has been read from the instruction stream and decoded (and copied into vlblk).
177 * status=0b11 indicates that the Predicate and Register Blocks have been read from the instruction stream (and put into internal Vector Context) Simpler implementations are permitted to reset status back to 0b10 and re-read the data after return from a trap that happened to occur in the middle of a VBLOCK. They are not however permitted to destroy opptr in the process, and after re-reading the Predicate and Register Blocks must resume execution pointed to by opptr.
178 * opptr points to where instructions begin in the VBLOCK. 0 indicates the start of the opcodes, and is in multiples of 16 bits (2 bytes). This is the equivalent of a Program Counter, for VBLOCKs.
179 * at the end of a VBLOCK, when the last instruction executes (assuming it does not change opptr to earlier in the block), status is reset to 0b00 to indicate exit from the VBLOCK FSM, and the current Vector Predicate and Register Context destroyed (Note: the STATE CSR is **not** altered purely by exit from a VBLOCK Context).
180
181 When status=0b11, opptr may be written to using CSRRWI. Doing so will cause execution to jump within the block, exactly as if PC had been set in normal RISC-V eexecution. Writing a value outside of the range of the instruction block will cause an illegal instruction exception. Writing a value (any value) when status is not 0b11 likewise causes an illegal instruction exception.
182
183 In privileged modes, obviously the above rules do not apply to the completely seoarate (x)ePCVBLK CSRs because these are copies of state, not the actual active PCVBLK. Writing to PCVBLK during a trap however, clearly the rules must apply.
184
185 If PCVBLK is written to with CSRRW, the same rules apply, however the entire register in rs1 is treated as the new opptr.
186
187 Note that the value returned in the register rd is the *full* PCVBLK, not just the opptr part.
188
189 # Limitations on instructions
190
191 As the pcvblk CSR is relative to the beginning of the VBLOCK,
192 branch and jump opcodes MUST NOT be used to point to a location inside a block: only at the beginning of an opcode (including another VBLOCK, including the current one). However, setting the PCVBLK CSR is permitted, to unconditionally jump to any opcode within a block.
193
194 Also: calling subroutines is likewise not permitted, because PCVBLK context cannot be atomically reestablished on return from the function.
195
196 ECALL, on the other hand, which will cause a trap that saves and restores the full state, is permitted.
197
198 Prohibited instructions will cause an illegal instruction trap. If at that point, software is capable of then working out how to emulate a branch or function call successfully, by manipulating (x)ePCVBLK and other state, it is not prohibited from doing so.
199
200 To reiterate: a normal jump, normal conditional branch and a normal function call may only be taken
201 by letting the VBLOCK group finish, returning to "normal" standard RV mode,
202 and then using standard RVC, 32 bit or P48/64-\*-type opcodes.
203
204 # Links
205
206 * <https://groups.google.com/d/msg/comp.arch/yIFmee-Cx-c/jRcf0evSAAAJ>
207
208 # Open Questions:
209
210 * Is it necessary to stick to the RISC-V 1.5 format? Why not go with
211 using the 15th bit to allow 80 + 16\*0bnnnn bits? Perhaps to be sane,
212 limit to 256 bits (16 times 0-11).
213 * Could a "hint" be used to set which operations are parallel and which
214 are sequential?
215 * Could a new sub-instruction opcode format be used, one that does not
216 conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes?
217 no need for byte or bit-alignment
218 * Could a hardware compression algorithm be deployed? Quite likely,
219 because of the sub-execution context (sub-VBLOCK PC)
220