add abridged spec, split out vblock format to own file
[libreriscv.git] / simple_v_extension / vblock_format.mdwn
1 # Simple-V (Parallelism Extension Proposal) Vector Block Format
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 21 jun 2019
6
7 [[!toc ]]
8
9 # Vector Block Format <a name="vliw-format"></a>
10
11 This is a way to give Vector and Predication Context to a group of
12 standard scalar RISC-V instructions, in a highly compact form.
13
14 The format is:
15
16 * the standard RISC-V 80 to 192 bit encoding sequence, with bits
17 defining the options to follow within the block
18 * An optional VL Block (16-bit)
19 * Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
20 * Optional register entries (8/16-bit blocks: see Register Table, above)
21 * finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
22
23 Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
24 as follows:
25
26 | base+4 ... base+2 | base | number of bits |
27 | ------ ----------------- | ---------------- | -------------------------- |
28 | ..xxxx xxxxxxxxxxxxxxxx | xnnnxxxxx1111111 | (80+16\*nnn)-bit, nnn!=111 |
29 | {ops}{Pred}{Reg}{VL Block} | SV Prefix | |
30
31 A suitable prefix, which fits the Expanded Instruction-Length encoding
32 for "(80 + 16 times instruction-length)", as defined in Section 1.5
33 of the RISC-V ISA, is as follows:
34
35 | 15 | 14:12 | 11:10 | 9:8 | 7 | 6:0 |
36 | - | ----- | ----- | ----- | --- | ------- |
37 | vlset | 16xil | pplen | rplen | mode | 1111111 |
38
39 The VL/MAXVL/SubVL Block format:
40
41 | 31-30 | 29:28 | 27:22 | 21:17 - 16 |
42 | - | ----- | ------ | ------ - - |
43 | 0 | SubVL | VLdest | VLEN vlt |
44 | 1 | SubVL | VLdest | VLEN |
45
46 Note: this format is very similar to that used in [[sv_prefix_proposal]]
47
48 If vlt is 0, VLEN is a 5 bit immediate value, offset by one (i.e
49 a bit sequence of 0b00000 represents VL=1 and so on). If vlt is 1,
50 it specifies the scalar register from which VL is set by this VBLOCK
51 instruction group. VL, whether set from the register or the immediate,
52 is then modified (truncated) to be MIN(VL, MAXVL), and the result stored
53 in the scalar register specified in VLdest. If VLdest is zero, no store
54 in the regfile occurs (however VL is still set).
55
56 This option will typically be used to start vectorised loops, where
57 the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL"
58 sequence (in compact form).
59
60 When bit 15 is set to 1, MAXVL and VL are both set to the immediate,
61 VLEN (again, offset by one), which is 6 bits in length, and the same
62 value stored in scalar register VLdest (if that register is nonzero).
63 A value of 0b000000 will set MAXVL=VL=1, a value of 0b000001 will
64 set MAXVL=VL= 2 and so on.
65
66 This option will typically not be used so much for loops as it will be
67 for one-off instructions such as saving the entire register file to the
68 stack with a single one-off Vectorised and predicated LD/ST, or as a way
69 to save or restore registers in a function call with a single instruction.
70
71 CSRs needed:
72
73 * mepcvliw
74 * sepcvliw
75 * uepcvliw
76 * hepcvliw
77
78 Notes:
79
80 * Bit 7 specifies if the prefix block format is the full 16 bit format
81 (1) or the compact less expressive format (0). In the 8 bit format,
82 pplen is multiplied by 2.
83 * 8 bit format predicate numbering is implicit and begins from x9. Thus
84 it is critical to put blocks in the correct order as required.
85 * Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit
86 (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number
87 of entries are needed the last may be set to 0x00, indicating "unused".
88 * Bit 15 specifies if the VL Block is present. If set to 1, the VL Block
89 immediately follows the VBLOCK instruction Prefix
90 * Bits 8 and 9 define how many RegCam entries (0 to 3 if bit 15 is 1,
91 otherwise 0 to 6) follow the (optional) VL Block.
92 * Bits 10 and 11 define how many PredCam entries (0 to 3 if bit 7 is 1,
93 otherwise 0 to 6) follow the (optional) RegCam entries
94 * Bits 14 to 12 (IL) define the actual length of the instruction: total
95 number of bits is 80 + 16 times IL. Standard RV32, RVC and also
96 SVPrefix (P48/64-\*-Type) instructions fit into this space, after the
97 (optional) VL / RegCam / PredCam entries
98 * In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed
99 format *MUST* have the RegCam and PredCam entries applied to the
100 operation (and the Vectorisation loop activated)
101 * P48 and P64 opcodes do **not** take their Register or predication
102 context from the VBLOCK tables: they do however have VL or SUBVL
103 applied (unless VLtyp or svlen are set).
104 * At the end of the VBLOCK Group, the RegCam and PredCam entries
105 *no longer apply*. VL, MAXVL and SUBVL on the other hand remain at
106 the values set by the last instruction (whether a CSRRW or the VL
107 Block header).
108 * Although an inefficient use of resources, it is fine to set the MAXVL,
109 VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.
110
111 All this would greatly reduce the amount of space utilised by Vectorised
112 instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
113 the CSR itself, a LI, and the setting up of the value into the RS
114 register of the CSR, which, again, requires a LI / LUI to get the 32
115 bit data into the CSR. To get 64-bit data into the register in order
116 to put it into the CSR(s), LOAD operations from memory are needed!
117
118 Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
119 entries), that's potentially 6 to eight 32-bit instructions, just to
120 establish the Vector State!
121
122 Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
123 bits if VL needs to be set to greater than 32). Bear in mind that in SV,
124 both MAXVL and VL need to be set.
125
126 By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is
127 only 16 bits, and as long as not too many predicates and register vector
128 qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
129 the format. If the full flexibility of the 16 bit block formats are not
130 needed, more space is saved by using the 8 bit formats.
131
132 In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
133 into a VBLOCK format makes a lot of sense.
134
135 Bear in mind the warning in an earlier section that use of VLtyp or svlen
136 in a P48 or P64 opcode within a VBLOCK Group will result in corruption
137 (use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
138 avoid this situation, the STATE CSR may be copied into a temp register
139 and restored afterwards.
140
141 Open Questions:
142
143 * Is it necessary to stick to the RISC-V 1.5 format? Why not go with
144 using the 15th bit to allow 80 + 16\*0bnnnn bits? Perhaps to be sane,
145 limit to 256 bits (16 times 0-11).
146 * Could a "hint" be used to set which operations are parallel and which
147 are sequential?
148 * Could a new sub-instruction opcode format be used, one that does not
149 conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes?
150 no need for byte or bit-alignment
151 * Could a hardware compression algorithm be deployed? Quite likely,
152 because of the sub-execution context (sub-VBLOCK PC)
153
154 ## Limitations on instructions.
155
156 To greatly simplify implementations, it is required to treat the VBLOCK
157 group as a separate sub-program with its own separate PC. The sub-pc
158 advances separately whilst the main PC remains pointing at the beginning
159 of the VBLOCK instruction (not to be confused with how VL works, which
160 is exactly the same principle, except it is VStart in the STATE CSR
161 that increments).
162
163 This has implications, namely that a new set of CSRs identical to xepc
164 (mepc, srpc, hepc and uepc) must be created and managed and respected
165 as being a sub extension of the xepc set of CSRs. Thus, xepcvliw CSRs
166 must be context switched and saved / restored in traps.
167
168 The srcoffs and destoffs indices in the STATE CSR may be similarly
169 regarded as another sub-execution context, giving in effect two sets of
170 nested sub-levels of the RISCV Program Counter (actually, three including
171 SUBVL and ssvoffs).
172
173 In addition, as xepcvliw CSRs are relative to the beginning of the VBLOCK,
174 branches MUST be restricted to within (relative to) the block,
175 i.e. addressing is now restricted to the start (and very short) length
176 of the block.
177
178 Also: calling subroutines is inadviseable, unless they can be entirely
179 accomplished within a block.
180
181 A normal jump, normal branch and a normal function call may only be taken
182 by letting the VBLOCK group end, returning to "normal" standard RV mode,
183 and then using standard RVC, 32 bit or P48/64-\*-type opcodes.
184
185 ## Links
186
187 * <https://groups.google.com/d/msg/comp.arch/yIFmee-Cx-c/jRcf0evSAAAJ>
188