split simple-v specification into separate page
[libreriscv.git] / simple_v_extension / specification.mdwn
1 # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
2
3 Key insight: Simple-V is intended as an abstraction layer to provide
4 a consistent "API" to parallelisation of existing *and future* operations.
5 *Actual* internal hardware-level parallelism is *not* required, such
6 that Simple-V may be viewed as providing a "compact" or "consolidated"
7 means of issuing multiple near-identical arithmetic instructions to an
8 instruction queue (FIFO), pending execution.
9
10 *Actual* parallelism, if added independently of Simple-V in the form
11 of Out-of-order restructuring (including parallel ALU lanes) or VLIW
12 implementations, or SIMD, or anything else, would then benefit from
13 the uniformity of a consistent API.
14
15 **No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E.
16
17 Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
18
19 [[!toc ]]
20
21 # Introduction
22
23 # CSRs <a name="csrs"></a>
24
25 There are two CSR tables needed to create lookup tables which are used at
26 the register decode phase.
27
28 ## MAXVECTORLENGTH
29
30 MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V,
31 given that its primary (base, unextended) purpose is for 3D, Video and
32 other purposes (not requiring supercomputing capability), it makes sense
33 to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
34 and so on).
35
36 The reason for setting this limit is so that predication registers, when
37 marked as such, may fit into a single register as opposed to fanning out
38 over several registers. This keeps the implementation a little simpler.
39 Note also (as also described in the VSETVL section) that the *minimum*
40 for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
41 and 31 for RV32 or RV64).
42
43 Note that RVV on top of Simple-V may choose to over-ride this decision.
44
45 ## MAXVECTORLENGTH
46
47 MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V,
48 given that its primary (base, unextended) purpose is for 3D, Video and
49 other purposes (not requiring supercomputing capability), it makes sense
50 to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
51 and so on).
52
53 The reason for setting this limit is so that predication registers, when
54 marked as such, may fit into a single register as opposed to fanning out
55 over several registers. This keeps the implementation a little simpler.
56 Note also (as also described in the VSETVL section) that the *minimum*
57 for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
58 and 31 for RV32 or RV64).
59
60 Note that RVV on top of Simple-V may choose to over-ride this decision.
61
62 ## Predication CSR <a name="predication_csr_table"></a>
63
64 The Predication CSR is a key-value store indicating whether, if a given
65 destination register (integer or floating-point) is referred to in an
66 instruction, it is to be predicated. However it is important to note
67 that the *actual* register is *different* from the one that ends up
68 being used, due to the level of indirection through the lookup table.
69
70 * regidx is the actual register that in combination with the
71 i/f flag, if that integer or floating-point register is referred to,
72 results in the lookup table being referenced to find the predication
73 mask to use on the operation in which that (regidx) register has
74 been used
75 * predidx (in combination with the bank bit in the future) is the
76 *actual* register to be used for the predication mask. Note:
77 in effect predidx is actually a 6-bit register address, as the bank
78 bit is the MSB (and is nominally set to zero for now).
79 * inv indicates that the predication mask bits are to be inverted
80 prior to use *without* actually modifying the contents of the
81 register itself.
82 * zeroing is either 1 or 0, and if set to 1, the operation must
83 place zeros in any element position where the predication mask is
84 set to zero. If zeroing is set to 0, unpredicated elements *must*
85 be left alone. Some microarchitectures may choose to interpret
86 this as skipping the operation entirely. Others which wish to
87 stick more closely to a SIMD architecture may choose instead to
88 interpret unpredicated elements as an internal "copy element"
89 operation (which would be necessary in SIMD microarchitectures
90 that perform register-renaming)
91
92 | PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
93 | ----- | - | - | - | - | ------- | ------- |
94 | 0 | bank0 | zero0 | inv0 | i/f | regidx | predkey |
95 | 1 | bank1 | zero1 | inv1 | i/f | regidx | predkey |
96 | .. | bank.. | zero.. | inv.. | i/f | regidx | predkey |
97 | 15 | bank15 | zero15 | inv15 | i/f | regidx | predkey |
98
99 The Predication CSR Table is a key-value store, so implementation-wise
100 it will be faster to turn the table around (maintain topologically
101 equivalent state):
102
103 struct pred {
104 bool zero;
105 bool inv;
106 bool bank; // 0 for now, 1=rsvd
107 bool enabled;
108 int predidx; // redirection: actual int register to use
109 }
110
111 struct pred fp_pred_reg[32]; // 64 in future (bank=1)
112 struct pred int_pred_reg[32]; // 64 in future (bank=1)
113
114 for (i = 0; i < 16; i++)
115 tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
116 idx = CSRpred[i].regidx
117 tb[idx].zero = CSRpred[i].zero
118 tb[idx].inv = CSRpred[i].inv
119 tb[idx].bank = CSRpred[i].bank
120 tb[idx].predidx = CSRpred[i].predidx
121 tb[idx].enabled = true
122
123 So when an operation is to be predicated, it is the internal state that
124 is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
125 pseudo-code for operations is given, where p is the explicit (direct)
126 reference to the predication register to be used:
127
128 for (int i=0; i<vl; ++i)
129 if ([!]preg[p][i])
130 (d ? vreg[rd][i] : sreg[rd]) =
131 iop(s1 ? vreg[rs1][i] : sreg[rs1],
132 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
133
134 This instead becomes an *indirect* reference using the *internal* state
135 table generated from the Predication CSR key-value store, which iwws used
136 as follows.
137
138 if type(iop) == INT:
139 preg = int_pred_reg[rd]
140 else:
141 preg = fp_pred_reg[rd]
142
143 for (int i=0; i<vl; ++i)
144 predidx = preg[rd].predidx; // the indirection takes place HERE
145 if (!preg[rd].enabled)
146 predicate = ~0x0; // all parallel ops enabled
147 else:
148 predicate = intregfile[predidx]; // get actual reg contents HERE
149 if (preg[rd].inv) // invert if requested
150 predicate = ~predicate;
151 if (predicate && (1<<i))
152 (d ? regfile[rd+i] : regfile[rd]) =
153 iop(s1 ? regfile[rs1+i] : regfile[rs1],
154 s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
155 else if (preg[rd].zero)
156 // TODO: place zero in dest reg
157
158 Note:
159
160 * d, s1 and s2 are booleans indicating whether destination,
161 source1 and source2 are vector or scalar
162 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
163 above, for clarity. rd, rs1 and rs2 all also must ALSO go through
164 register-level redirection (from the Register CSR table) if they are
165 vectors.
166
167 If written as a function, obtaining the predication mask (but not whether
168 zeroing takes place) may be done as follows:
169
170 def get_pred_val(bool is_fp_op, int reg):
171 tb = int_pred if is_fp_op else fp_pred
172 if (!tb[reg].enabled):
173 return ~0x0 // all ops enabled
174 predidx = tb[reg].predidx // redirection occurs HERE
175 predicate = intreg[predidx] // actual predicate HERE
176 if (tb[reg].inv):
177 predicate = ~predicate // invert ALL bits
178 return predicate
179
180 ## Register CSR key-value (CAM) table
181
182 The purpose of the Register CSR table is four-fold:
183
184 * To mark integer and floating-point registers as requiring "redirection"
185 if it is ever used as a source or destination in any given operation.
186 This involves a level of indirection through a 5-to-6-bit lookup table
187 (where the 6th bit - bank - is always set to 0 for now).
188 * To indicate whether, after redirection through the lookup table, the
189 register is a vector (or remains a scalar).
190 * To over-ride the implicit or explicit bitwidth that the operation would
191 normally give the register.
192 * To indicate if the register is to be interpreted as "packed" (SIMD)
193 i.e. containing multiple contiguous elements of size equal to "bitwidth".
194
195 | RgCSR | 15 | 14 | 13 | (12..11) | 10 | (9..5) | (4..0) |
196 | ----- | - | - | - | - | - | ------- | ------- |
197 | 0 | simd0 | bank0 | isvec0 | vew0 | i/f | regidx | predidx |
198 | 1 | simd1 | bank1 | isvec1 | vew1 | i/f | regidx | predidx |
199 | .. | simd.. | bank.. | isvec.. | vew.. | i/f | regidx | predidx |
200 | 15 | simd15 | bank15 | isvec15 | vew15 | i/f | regidx | predidx |
201
202 vew may be one of the following (giving a table "bytestable", used below):
203
204 | vew | bitwidth |
205 | --- | --------- |
206 | 00 | default |
207 | 01 | default/2 |
208 | 10 | 8 |
209 | 11 | 16 |
210
211 Extending this table (with extra bits) is covered in the section
212 "Implementing RVV on top of Simple-V".
213
214 As the above table is a CAM (key-value store) it may be appropriate
215 to expand it as follows:
216
217 struct vectorised fp_vec[32], int_vec[32]; // 64 in future
218
219 for (i = 0; i < 16; i++) // 16 CSRs?
220 tb = int_vec if CSRvec[i].type == 0 else fp_vec
221 idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
222 tb[idx].elwidth = CSRvec[i].elwidth
223 tb[idx].regidx = CSRvec[i].regidx // indirection
224 tb[idx].isvector = CSRvec[i].isvector // 0=scalar
225 tb[idx].packed = CSRvec[i].packed // SIMD or not
226 tb[idx].bank = CSRvec[i].bank // 0 (1=rsvd)
227
228 TODO: move elsewhere
229
230 # TODO: use elsewhere (retire for now)
231 vew = CSRbitwidth[rs1]
232 if (vew == 0)
233 bytesperreg = (XLEN/8) # or FLEN as appropriate
234 elif (vew == 1)
235 bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
236 else:
237 bytesperreg = bytestable[vew] # 8 or 16
238 simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
239 vlen = CSRvectorlen[rs1] * simdmult
240 CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
241
242 The reason for multiplying the vector length by the number of SIMD elements
243 (in each individual register) is so that each SIMD element may optionally be
244 predicated.
245
246 An example of how to subdivide the register file when bitwidth != default
247 is given in the section "Bitwidth Virtual Register Reordering".
248
249 # Instructions
250
251 Despite being a 98% complete and accurate topological remap of RVV
252 concepts and functionality, no new instructions are needed.
253 *All* RVV instructions can be re-mapped, however xBitManip
254 becomes a critical dependency for efficient manipulation of predication
255 masks (as a bit-field). Despite the removal of all but VSETVL and VGETVL,
256 *all instructions from RVV are topologically re-mapped and retain their
257 complete functionality, intact*.
258
259 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
260 equivalents, so are left out of Simple-V. VSELECT could be included if
261 there existed a MV.X instruction in RV (MV.X is a hypothetical
262 non-immediate variant of MV that would allow another register to
263 specify which register was to be copied). Note that if any of these three
264 instructions are added to any given RV extension, their functionality
265 will be inherently parallelised.
266
267 ## Instruction Format
268
269 The instruction format for Simple-V does not actually have *any* explicit
270 compare operations, *any* arithmetic, floating point or *any*
271 memory instructions. There are in fact **no operations added at all**.
272 Instead it *overloads* pre-existing branch operations into predicated
273 variants, and implicitly overloads arithmetic operations, MV,
274 FCVT, and LOAD/STORE
275 depending on CSR configurations for bitwidth and
276 predication. **Everything** becomes parallelised. *This includes
277 Compressed instructions* as well as any
278 future instructions and Custom Extensions.
279
280 ## VSETVL
281
282 NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
283 with the instruction format remaining the same.
284
285 VSETVL is slightly different from RVV in that the minimum vector length
286 is required to be at least the number of registers in the register file,
287 and no more than XLEN. This allows vector LOAD/STORE to be used to switch
288 the entire bank of registers using a single instruction (see Appendix,
289 "Context Switch Example"). The reason for limiting VSETVL to XLEN is
290 down to the fact that predication bits fit into a single register of length
291 XLEN bits.
292
293 The second change is that when VSETVL is requested to be stored
294 into x0, it is *ignored* silently (VSETVL x0, x5, #4)
295
296 The third change is that there is an additional immediate added to VSETVL,
297 to which VL is set after first going through MIN-filtering.
298 So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
299
300 VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
301
302 where RegfileLen <= MAXVECTORDEPTH < XLEN
303
304 This has implication for the microarchitecture, as VL is required to be
305 set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
306 requested in the #immediate parameter. RVV has the option to set VL
307 to an arbitrary value that suits the conditions and the micro-architecture:
308 SV does *not* permit that.
309
310 The reason is so that if SV is to be used for a context-switch or as a
311 substitute for LOAD/STORE-Multiple, the operation can be done with only
312 2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
313 single LD/ST operation). If VL does *not* get set to the register file
314 length when VSETVL is called, then a software-loop would be needed.
315 To avoid this need, VL *must* be set to exactly what is requested
316 (limits notwithstanding).
317
318 Therefore, in turn, unlike RVV, implementors *must* provide
319 pseudo-parallelism (using sequential loops in hardware) if actual
320 hardware-parallelism in the ALUs is not deployed. A hybrid is also
321 permitted (as used in Broadcom's VideoCore-IV) however this must be
322 *entirely* transparent to the ISA.
323
324 ## Branch Instruction:
325
326 Branch operations use standard RV opcodes that are reinterpreted to
327 be "predicate variants" in the instance where either of the two src
328 registers are marked as vectors (isvector=1). When this reinterpretation
329 is enabled the "immediate" field of the branch operation is taken to be a
330 predication target register, rs3. The predicate target register rs3 is
331 to be treated as a bitfield (up to a maximum of XLEN bits corresponding
332 to a maximum of XLEN elements).
333
334 If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
335 goes ahead as vector-scalar or scalar-vector. Implementors should note that
336 this could require considerable multi-porting of the register file in order
337 to parallelise properly, so may have to involve the use of register cacheing
338 and transparent copying (see Multiple-Banked Register File Architectures
339 paper).
340
341 In instances where no vectorisation is detected on either src registers
342 the operation is treated as an absolutely standard scalar branch operation.
343
344 This is the overloaded table for Integer-base Branch operations. Opcode
345 (bits 6..0) is set in all cases to 1100011.
346
347 [[!table data="""
348 31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
349 imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
350 7 | 5 | 5 | 3 | 4 | 1 | 7 |
351 reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
352 reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
353 reserved | src2 | src1 | 001 | predicate rs3 || BNE |
354 reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
355 reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
356 reserved | src2 | src1 | 100 | predicate rs3 || BLT |
357 reserved | src2 | src1 | 101 | predicate rs3 || BGE |
358 reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
359 reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
360 """]]
361
362 Note that just as with the standard (scalar, non-predicated) branch
363 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
364 src1 and src2.
365
366 Below is the overloaded table for Floating-point Predication operations.
367 Interestingly no change is needed to the instruction format because
368 FP Compare already stores a 1 or a zero in its "rd" integer register
369 target, i.e. it's not actually a Branch at all: it's a compare.
370 The target needs to simply change to be a predication bitfield (done
371 implicitly).
372
373 As with
374 Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
375 Likewise Single-precision, fmt bits 26..25) is still set to 00.
376 Double-precision is still set to 01, whilst Quad-precision
377 appears not to have a definition in V2.3-Draft (but should be unaffected).
378
379 It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
380 and whilst in ordinary branch code this is fine because the standard
381 RVF compare can always be followed up with an integer BEQ or a BNE (or
382 a compressed comparison to zero or non-zero), in predication terms that
383 becomes more of an impact. To deal with this, SV's predication has
384 had "invert" added to it.
385
386 [[!table data="""
387 31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
388 funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
389 5 | 2 | 5 | 5 | 3 | 4 | 7 |
390 10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
391 10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | rsvd |
392 10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
393 10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
394 """]]
395
396 Note (**TBD**): floating-point exceptions will need to be extended
397 to cater for multiple exceptions (and statuses of the same). The
398 usual approach is to have an array of status codes and bit-fields,
399 and one exception, rather than throw separate exceptions for each
400 Vector element.
401
402 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
403 for predicated compare operations of function "cmp":
404
405 for (int i=0; i<vl; ++i)
406 if ([!]preg[p][i])
407 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
408 s2 ? vreg[rs2][i] : sreg[rs2]);
409
410 With associated predication, vector-length adjustments and so on,
411 and temporarily ignoring bitwidth (which makes the comparisons more
412 complex), this becomes:
413
414 if I/F == INT: # integer type cmp
415 preg = int_pred_reg[rd]
416 reg = int_regfile
417 else:
418 preg = fp_pred_reg[rd]
419 reg = fp_regfile
420
421 s1 = reg_is_vectorised(src1);
422 s2 = reg_is_vectorised(src2);
423 if (!s2 && !s1) goto branch;
424 for (int i = 0; i < VL; ++i)
425 if (cmp(s1 ? reg[src1+i]:reg[src1],
426 s2 ? reg[src2+i]:reg[src2])
427 preg[rs3] |= 1<<i; # bitfield not vector
428
429 Notes:
430
431 * Predicated SIMD comparisons would break src1 and src2 further down
432 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
433 Reordering") setting Vector-Length times (number of SIMD elements) bits
434 in Predicate Register rs3 as opposed to just Vector-Length bits.
435 * Predicated Branches do not actually have an adjustment to the Program
436 Counter, so all of bits 25 through 30 in every case are not needed.
437 * There are plenty of reserved opcodes for which bits 25 through 30 could
438 be put to good use if there is a suitable use-case.
439 FLT and FLE may be inverted to FGT and FGE if needed by swapping
440 src1 and src2 (likewise the integer counterparts).
441
442 ## Compressed Branch Instruction:
443
444 Compressed Branch instructions are likewise re-interpreted as predicated
445 2-register operations, with the result going into rs3. All the bits of
446 the immediate are re-interpreted for different purposes, to extend the
447 number of comparator operations to beyond the original specification,
448 but also to cater for floating-point comparisons as well as integer ones.
449
450 [[!table data="""
451 15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
452 funct3 | imm | rs10 | imm | | op | |
453 3 | 3 | 3 | 2 | 3 | 2 | |
454 C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
455 110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
456 111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
457 110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
458 111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
459 """]]
460
461 Notes:
462
463 * Bits 5 13 14 and 15 make up the comparator type
464 * Bit 6 indicates whether to use integer or floating-point comparisons
465 * In both floating-point and integer cases there are four predication
466 comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
467 src1 and src2).
468
469 ## LOAD / STORE Instructions <a name="load_store"></a>
470
471 For full analysis of topological adaptation of RVV LOAD/STORE
472 see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
473 may be implicitly overloaded into the one base RV LOAD instruction,
474 and likewise for STORE.
475
476 Revised LOAD:
477
478 [[!table data="""
479 31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
480 imm[11:0] |||| rs1 | funct3 | rd | opcode |
481 1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
482 ? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
483 """]]
484
485 The exact same corresponding adaptation is also carried out on the single,
486 double and quad precision floating-point LOAD-FP and STORE-FP operations,
487 which fit the exact same instruction format. Thus all three types
488 (unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
489 as well as FSW, FSD and FSQ.
490
491 Notes:
492
493 * LOAD remains functionally (topologically) identical to RVV LOAD
494 (for both integer and floating-point variants).
495 * Predication CSR-marking register is not explicitly shown in instruction, it's
496 implicit based on the CSR predicate state for the rd (destination) register
497 * rs2, the source, may *also be marked as a vector*, which implicitly
498 is taken to indicate "Indexed Load" (LD.X)
499 * Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
500 * Bit 31 is reserved (ideas under consideration: auto-increment)
501 * **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
502 * **TODO**: clarify where width maps to elsize
503
504 Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
505
506 if (unit-strided) stride = elsize;
507 else stride = areg[as2]; // constant-strided
508
509 preg = int_pred_reg[rd]
510
511 for (int i=0; i<vl; ++i)
512 if ([!]preg[rd] & 1<<i)
513 for (int j=0; j<seglen+1; j++)
514 {
515 if CSRvectorised[rs2])
516 offs = vreg[rs2+i]
517 else
518 offs = i*(seglen+1)*stride;
519 vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
520 }
521
522 Taking CSR (SIMD) bitwidth into account involves using the vector
523 length and register encoding according to the "Bitwidth Virtual Register
524 Reordering" scheme shown in the Appendix (see function "regoffs").
525
526 A similar instruction exists for STORE, with identical topological
527 translation of all features. **TODO**
528
529 ## Compressed LOAD / STORE Instructions
530
531 Compressed LOAD and STORE are of the same format, where bits 2-4 are
532 a src register instead of dest:
533
534 [[!table data="""
535 15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
536 funct3 | imm | rs10 | imm | rd0 | op |
537 3 | 3 | 3 | 2 | 3 | 2 |
538 C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
539 """]]
540
541 Unfortunately it is not possible to fit the full functionality
542 of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
543 require another operand (rs2) in addition to the operand width
544 (which is also missing), offset, base, and src/dest.
545
546 However a close approximation may be achieved by taking the top bit
547 of the offset in each of the five types of LD (and ST), reducing the
548 offset to 4 bits and utilising the 5th bit to indicate whether "stride"
549 is to be enabled. In this way it is at least possible to introduce
550 that functionality.
551
552 (**TODO**: *assess whether the loss of one bit from offset is worth having
553 "stride" capability.*)
554
555 We also assume (including for the "stride" variant) that the "width"
556 parameter, which is missing, is derived and implicit, just as it is
557 with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
558 and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
559 C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
560
561 Interestingly we note that the Vectorised Simple-V variant of
562 LOAD/STORE (Compressed and otherwise), due to it effectively using the
563 standard register file(s), is the direct functional equivalent of
564 standard load-multiple and store-multiple instructions found in other
565 processors.
566
567 In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
568 page 76, "For virtual memory systems some data accesses could be resident
569 in physical memory and some not". The interesting question then arises:
570 how does RVV deal with the exact same scenario?
571 Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
572 of detecting early page / segmentation faults and adjusting the TLB
573 in advance, accordingly: other strategies are explored in the Appendix
574 Section "Virtual Memory Page Faults".
575
576 ## Vectorised Copy/Move (and conversion) instructions
577
578 There is a series of 2-operand instructions involving copying (and
579 alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ. These operations all
580 follow the same pattern, as it is *both* the source *and* destination
581 predication masks that are taken into account. This is different from
582 the three-operand arithmetic instructions, where the predication mask
583 is taken from the *destination* register, and applied uniformly to the
584 elements of the source register(s), element-for-element.
585
586 ### C.MV Instruction <a name="c_mv"></a>
587
588 There is no MV instruction in RV however there is a C.MV instruction.
589 It is used for copying integer-to-integer registers (vectorised FMV
590 is used for copying floating-point).
591
592 If either the source or the destination register are marked as vectors
593 C.MV is reinterpreted to be a vectorised (multi-register) predicated
594 move operation. The actual instruction's format does not change:
595
596 [[!table data="""
597 15 12 | 11 7 | 6 2 | 1 0 |
598 funct4 | rd | rs | op |
599 4 | 5 | 5 | 2 |
600 C.MV | dest | src | C0 |
601 """]]
602
603 A simplified version of the pseudocode for this operation is as follows:
604
605 function op_mv(rd, rs) # MV not VMV!
606  rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
607  rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
608  ps = get_pred_val(FALSE, rs); # predication on src
609  pd = get_pred_val(FALSE, rd); # ... AND on dest
610  for (int i = 0, int j = 0; i < VL && j < VL;):
611 if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
612 if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
613 ireg[rd+j] <= ireg[rs+i];
614 if (int_vec[rs].isvec) i++;
615 if (int_vec[rd].isvec) j++;
616
617 Note that:
618
619 * elwidth (SIMD) is not covered above
620 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
621 not covered
622
623 There are several different instructions from RVV that are covered by
624 this one opcode:
625
626 [[!table data="""
627 src | dest | predication | op |
628 scalar | vector | none | VSPLAT |
629 scalar | vector | destination | sparse VSPLAT |
630 scalar | vector | 1-bit dest | VINSERT |
631 vector | scalar | 1-bit? src | VEXTRACT |
632 vector | vector | none | VCOPY |
633 vector | vector | src | Vector Gather |
634 vector | vector | dest | Vector Scatter |
635 vector | vector | src & dest | Gather/Scatter |
636 vector | vector | src == dest | sparse VCOPY |
637 """]]
638
639 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
640 operations with inversion on the src and dest predication for one of the
641 two C.MV operations.
642
643 Note that in the instance where the Compressed Extension is not implemented,
644 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
645 Note that the behaviour is **different** from C.MV because with addi the
646 predication mask to use is taken **only** from rd and is applied against
647 all elements: rs[i] = rd[i].
648
649 ### FMV, FNEG and FABS Instructions
650
651 These are identical in form to C.MV, except covering floating-point
652 register copying. The same double-predication rules also apply.
653 However when elwidth is not set to default the instruction is implicitly
654 and automatic converted to a (vectorised) floating-point type conversion
655 operation of the appropriate size covering the source and destination
656 register bitwidths.
657
658 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
659
660 ### FVCT Instructions
661
662 These are again identical in form to C.MV, except that they cover
663 floating-point to integer and integer to floating-point. When element
664 width in each vector is set to default, the instructions behave exactly
665 as they are defined for standard RV (scalar) operations, except vectorised
666 in exactly the same fashion as outlined in C.MV.
667
668 However when the source or destination element width is not set to default,
669 the opcode's explicit element widths are *over-ridden* to new definitions,
670 and the opcode's element width is taken as indicative of the SIMD width
671 (if applicable i.e. if packed SIMD is requested) instead.
672
673 For example FCVT.S.L would normally be used to convert a 64-bit
674 integer in register rs1 to a 64-bit floating-point number in rd.
675 If however the source rs1 is set to be a vector, where elwidth is set to
676 default/2 and "packed SIMD" is enabled, then the first 32 bits of
677 rs1 are converted to a floating-point number to be stored in rd's
678 first element and the higher 32-bits *also* converted to floating-point
679 and stored in the second. The 32 bit size comes from the fact that
680 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
681 divide that by two it means that rs1 element width is to be taken as 32.
682
683 Similar rules apply to the destination register.
684
685 # Exceptions
686
687 > What does an ADD of two different-sized vectors do in simple-V?
688
689 * if the two source operands are not the same, throw an exception.
690 * if the destination operand is also a vector, and the source is longer
691 than the destination, throw an exception.
692
693 > And what about instructions like JALR? 
694 > What does jumping to a vector do?
695
696 * Throw an exception. Whether that actually results in spawning threads
697 as part of the trap-handling remains to be seen.
698
699 # Under consideration <a name="issues"></a>
700
701 ## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
702
703 One of the goals of this parallelism proposal is to avoid instruction
704 duplication. However, with the base ISA having been designed explictly
705 to *avoid* condition-codes entirely, shoe-horning predication into it
706 bcomes quite challenging.
707
708 However what if all branch instructions, if referencing a vectorised
709 register, were instead given *completely new analogous meanings* that
710 resulted in a parallel bit-wise predication register being set? This
711 would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
712 BLT and BGE.
713
714 We might imagine that FEQ, FLT and FLT would also need to be converted,
715 however these are effectively *already* in the precise form needed and
716 do not need to be converted *at all*! The difference is that FEQ, FLT
717 and FLE *specifically* write a 1 to an integer register if the condition
718 holds, and 0 if not. All that needs to be done here is to say, "if
719 the integer register is tagged with a bit that says it is a predication
720 register, the **bit** in the integer register is set based on the
721 current vector index" instead.
722
723 There is, in the standard Conditional Branch instruction, more than
724 adequate space to interpret it in a similar fashion:
725
726 [[!table data="""
727 31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
728 imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
729 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
730 offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
731 """]]
732
733 This would become:
734
735 [[!table data="""
736 31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
737 imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
738 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
739 reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
740 """]]
741
742 Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
743 with the interesting side-effect that there is space within what is presently
744 the "immediate offset" field to reinterpret that to add in not only a bit
745 field to distinguish between floating-point compare and integer compare,
746 not only to add in a second source register, but also use some of the bits as
747 a predication target as well.
748
749 [[!table data="""
750 15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
751 funct3 | imm | rs10 | imm | op |
752 3 | 3 | 3 | 5 | 2 |
753 C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
754 """]]
755
756 Now uses the CS format:
757
758 [[!table data="""
759 15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
760 funct3 | imm | rs10 | imm | | op |
761 3 | 3 | 3 | 2 | 3 | 2 |
762 C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
763 """]]
764
765 Bit 6 would be decoded as "operation refers to Integer or Float" including
766 interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
767 "C" Standard, version 2.0,
768 whilst Bit 5 would allow the operation to be extended, in combination with
769 funct3 = 110 or 111: a combination of four distinct (predicated) comparison
770 operators. In both floating-point and integer cases those could be
771 EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
772
773