1 # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
3 Key insight: Simple-V is intended as an abstraction layer to provide
4 a consistent "API" to parallelisation of existing *and future* operations.
5 *Actual* internal hardware-level parallelism is *not* required, such
6 that Simple-V may be viewed as providing a "compact" or "consolidated"
7 means of issuing multiple near-identical arithmetic instructions to an
8 instruction queue (FIFO), pending execution.
10 *Actual* parallelism, if added independently of Simple-V in the form
11 of Out-of-order restructuring (including parallel ALU lanes) or VLIW
12 implementations, or SIMD, or anything else, would then benefit from
13 the uniformity of a consistent API.
15 **No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E.
17 Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
23 # CSRs <a name="csrs"></a>
25 There are two CSR tables needed to create lookup tables which are used at
26 the register decode phase.
30 MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V,
31 given that its primary (base, unextended) purpose is for 3D, Video and
32 other purposes (not requiring supercomputing capability), it makes sense
33 to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
36 The reason for setting this limit is so that predication registers, when
37 marked as such, may fit into a single register as opposed to fanning out
38 over several registers. This keeps the implementation a little simpler.
39 Note also (as also described in the VSETVL section) that the *minimum*
40 for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
41 and 31 for RV32 or RV64).
43 Note that RVV on top of Simple-V may choose to over-ride this decision.
47 MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V,
48 given that its primary (base, unextended) purpose is for 3D, Video and
49 other purposes (not requiring supercomputing capability), it makes sense
50 to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
53 The reason for setting this limit is so that predication registers, when
54 marked as such, may fit into a single register as opposed to fanning out
55 over several registers. This keeps the implementation a little simpler.
56 Note also (as also described in the VSETVL section) that the *minimum*
57 for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
58 and 31 for RV32 or RV64).
60 Note that RVV on top of Simple-V may choose to over-ride this decision.
62 ## Predication CSR <a name="predication_csr_table"></a>
64 The Predication CSR is a key-value store indicating whether, if a given
65 destination register (integer or floating-point) is referred to in an
66 instruction, it is to be predicated. However it is important to note
67 that the *actual* register is *different* from the one that ends up
68 being used, due to the level of indirection through the lookup table.
70 * regidx is the actual register that in combination with the
71 i/f flag, if that integer or floating-point register is referred to,
72 results in the lookup table being referenced to find the predication
73 mask to use on the operation in which that (regidx) register has
75 * predidx (in combination with the bank bit in the future) is the
76 *actual* register to be used for the predication mask. Note:
77 in effect predidx is actually a 6-bit register address, as the bank
78 bit is the MSB (and is nominally set to zero for now).
79 * inv indicates that the predication mask bits are to be inverted
80 prior to use *without* actually modifying the contents of the
82 * zeroing is either 1 or 0, and if set to 1, the operation must
83 place zeros in any element position where the predication mask is
84 set to zero. If zeroing is set to 0, unpredicated elements *must*
85 be left alone. Some microarchitectures may choose to interpret
86 this as skipping the operation entirely. Others which wish to
87 stick more closely to a SIMD architecture may choose instead to
88 interpret unpredicated elements as an internal "copy element"
89 operation (which would be necessary in SIMD microarchitectures
90 that perform register-renaming)
92 | PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
93 | ----- | - | - | - | - | ------- | ------- |
94 | 0 | bank0 | zero0 | inv0 | i/f | regidx | predkey |
95 | 1 | bank1 | zero1 | inv1 | i/f | regidx | predkey |
96 | .. | bank.. | zero.. | inv.. | i/f | regidx | predkey |
97 | 15 | bank15 | zero15 | inv15 | i/f | regidx | predkey |
99 The Predication CSR Table is a key-value store, so implementation-wise
100 it will be faster to turn the table around (maintain topologically
106 bool bank; // 0 for now, 1=rsvd
108 int predidx; // redirection: actual int register to use
111 struct pred fp_pred_reg[32]; // 64 in future (bank=1)
112 struct pred int_pred_reg[32]; // 64 in future (bank=1)
114 for (i = 0; i < 16; i++)
115 tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
116 idx = CSRpred[i].regidx
117 tb[idx].zero = CSRpred[i].zero
118 tb[idx].inv = CSRpred[i].inv
119 tb[idx].bank = CSRpred[i].bank
120 tb[idx].predidx = CSRpred[i].predidx
121 tb[idx].enabled = true
123 So when an operation is to be predicated, it is the internal state that
124 is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
125 pseudo-code for operations is given, where p is the explicit (direct)
126 reference to the predication register to be used:
128 for (int i=0; i<vl; ++i)
130 (d ? vreg[rd][i] : sreg[rd]) =
131 iop(s1 ? vreg[rs1][i] : sreg[rs1],
132 s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
134 This instead becomes an *indirect* reference using the *internal* state
135 table generated from the Predication CSR key-value store, which iwws used
139 preg = int_pred_reg[rd]
141 preg = fp_pred_reg[rd]
143 for (int i=0; i<vl; ++i)
144 predidx = preg[rd].predidx; // the indirection takes place HERE
145 if (!preg[rd].enabled)
146 predicate = ~0x0; // all parallel ops enabled
148 predicate = intregfile[predidx]; // get actual reg contents HERE
149 if (preg[rd].inv) // invert if requested
150 predicate = ~predicate;
151 if (predicate && (1<<i))
152 (d ? regfile[rd+i] : regfile[rd]) =
153 iop(s1 ? regfile[rs1+i] : regfile[rs1],
154 s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
155 else if (preg[rd].zero)
156 // TODO: place zero in dest reg
160 * d, s1 and s2 are booleans indicating whether destination,
161 source1 and source2 are vector or scalar
162 * key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
163 above, for clarity. rd, rs1 and rs2 all also must ALSO go through
164 register-level redirection (from the Register CSR table) if they are
167 If written as a function, obtaining the predication mask (but not whether
168 zeroing takes place) may be done as follows:
170 def get_pred_val(bool is_fp_op, int reg):
171 tb = int_pred if is_fp_op else fp_pred
172 if (!tb[reg].enabled):
173 return ~0x0 // all ops enabled
174 predidx = tb[reg].predidx // redirection occurs HERE
175 predicate = intreg[predidx] // actual predicate HERE
177 predicate = ~predicate // invert ALL bits
180 ## Register CSR key-value (CAM) table
182 The purpose of the Register CSR table is four-fold:
184 * To mark integer and floating-point registers as requiring "redirection"
185 if it is ever used as a source or destination in any given operation.
186 This involves a level of indirection through a 5-to-6-bit lookup table
187 (where the 6th bit - bank - is always set to 0 for now).
188 * To indicate whether, after redirection through the lookup table, the
189 register is a vector (or remains a scalar).
190 * To over-ride the implicit or explicit bitwidth that the operation would
191 normally give the register.
192 * To indicate if the register is to be interpreted as "packed" (SIMD)
193 i.e. containing multiple contiguous elements of size equal to "bitwidth".
195 | RgCSR | 15 | 14 | 13 | (12..11) | 10 | (9..5) | (4..0) |
196 | ----- | - | - | - | - | - | ------- | ------- |
197 | 0 | simd0 | bank0 | isvec0 | vew0 | i/f | regidx | predidx |
198 | 1 | simd1 | bank1 | isvec1 | vew1 | i/f | regidx | predidx |
199 | .. | simd.. | bank.. | isvec.. | vew.. | i/f | regidx | predidx |
200 | 15 | simd15 | bank15 | isvec15 | vew15 | i/f | regidx | predidx |
202 vew may be one of the following (giving a table "bytestable", used below):
211 Extending this table (with extra bits) is covered in the section
212 "Implementing RVV on top of Simple-V".
214 As the above table is a CAM (key-value store) it may be appropriate
215 to expand it as follows:
217 struct vectorised fp_vec[32], int_vec[32]; // 64 in future
219 for (i = 0; i < 16; i++) // 16 CSRs?
220 tb = int_vec if CSRvec[i].type == 0 else fp_vec
221 idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
222 tb[idx].elwidth = CSRvec[i].elwidth
223 tb[idx].regidx = CSRvec[i].regidx // indirection
224 tb[idx].isvector = CSRvec[i].isvector // 0=scalar
225 tb[idx].packed = CSRvec[i].packed // SIMD or not
226 tb[idx].bank = CSRvec[i].bank // 0 (1=rsvd)
230 # TODO: use elsewhere (retire for now)
231 vew = CSRbitwidth[rs1]
233 bytesperreg = (XLEN/8) # or FLEN as appropriate
235 bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
237 bytesperreg = bytestable[vew] # 8 or 16
238 simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
239 vlen = CSRvectorlen[rs1] * simdmult
240 CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
242 The reason for multiplying the vector length by the number of SIMD elements
243 (in each individual register) is so that each SIMD element may optionally be
246 An example of how to subdivide the register file when bitwidth != default
247 is given in the section "Bitwidth Virtual Register Reordering".
251 Despite being a 98% complete and accurate topological remap of RVV
252 concepts and functionality, no new instructions are needed.
253 *All* RVV instructions can be re-mapped, however xBitManip
254 becomes a critical dependency for efficient manipulation of predication
255 masks (as a bit-field). Despite the removal of all but VSETVL and VGETVL,
256 *all instructions from RVV are topologically re-mapped and retain their
257 complete functionality, intact*.
259 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
260 equivalents, so are left out of Simple-V. VSELECT could be included if
261 there existed a MV.X instruction in RV (MV.X is a hypothetical
262 non-immediate variant of MV that would allow another register to
263 specify which register was to be copied). Note that if any of these three
264 instructions are added to any given RV extension, their functionality
265 will be inherently parallelised.
267 ## Instruction Format
269 The instruction format for Simple-V does not actually have *any* explicit
270 compare operations, *any* arithmetic, floating point or *any*
271 memory instructions. There are in fact **no operations added at all**.
272 Instead it *overloads* pre-existing branch operations into predicated
273 variants, and implicitly overloads arithmetic operations, MV,
275 depending on CSR configurations for bitwidth and
276 predication. **Everything** becomes parallelised. *This includes
277 Compressed instructions* as well as any
278 future instructions and Custom Extensions.
282 NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
283 with the instruction format remaining the same.
285 VSETVL is slightly different from RVV in that the minimum vector length
286 is required to be at least the number of registers in the register file,
287 and no more than XLEN. This allows vector LOAD/STORE to be used to switch
288 the entire bank of registers using a single instruction (see Appendix,
289 "Context Switch Example"). The reason for limiting VSETVL to XLEN is
290 down to the fact that predication bits fit into a single register of length
293 The second change is that when VSETVL is requested to be stored
294 into x0, it is *ignored* silently (VSETVL x0, x5, #4)
296 The third change is that there is an additional immediate added to VSETVL,
297 to which VL is set after first going through MIN-filtering.
298 So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
300 VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
302 where RegfileLen <= MAXVECTORDEPTH < XLEN
304 This has implication for the microarchitecture, as VL is required to be
305 set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
306 requested in the #immediate parameter. RVV has the option to set VL
307 to an arbitrary value that suits the conditions and the micro-architecture:
308 SV does *not* permit that.
310 The reason is so that if SV is to be used for a context-switch or as a
311 substitute for LOAD/STORE-Multiple, the operation can be done with only
312 2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
313 single LD/ST operation). If VL does *not* get set to the register file
314 length when VSETVL is called, then a software-loop would be needed.
315 To avoid this need, VL *must* be set to exactly what is requested
316 (limits notwithstanding).
318 Therefore, in turn, unlike RVV, implementors *must* provide
319 pseudo-parallelism (using sequential loops in hardware) if actual
320 hardware-parallelism in the ALUs is not deployed. A hybrid is also
321 permitted (as used in Broadcom's VideoCore-IV) however this must be
322 *entirely* transparent to the ISA.
324 ## Branch Instruction:
326 Branch operations use standard RV opcodes that are reinterpreted to
327 be "predicate variants" in the instance where either of the two src
328 registers are marked as vectors (isvector=1). When this reinterpretation
329 is enabled the "immediate" field of the branch operation is taken to be a
330 predication target register, rs3. The predicate target register rs3 is
331 to be treated as a bitfield (up to a maximum of XLEN bits corresponding
332 to a maximum of XLEN elements).
334 If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
335 goes ahead as vector-scalar or scalar-vector. Implementors should note that
336 this could require considerable multi-porting of the register file in order
337 to parallelise properly, so may have to involve the use of register cacheing
338 and transparent copying (see Multiple-Banked Register File Architectures
341 In instances where no vectorisation is detected on either src registers
342 the operation is treated as an absolutely standard scalar branch operation.
344 This is the overloaded table for Integer-base Branch operations. Opcode
345 (bits 6..0) is set in all cases to 1100011.
348 31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
349 imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
350 7 | 5 | 5 | 3 | 4 | 1 | 7 |
351 reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
352 reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
353 reserved | src2 | src1 | 001 | predicate rs3 || BNE |
354 reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
355 reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
356 reserved | src2 | src1 | 100 | predicate rs3 || BLT |
357 reserved | src2 | src1 | 101 | predicate rs3 || BGE |
358 reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
359 reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
362 Note that just as with the standard (scalar, non-predicated) branch
363 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
366 Below is the overloaded table for Floating-point Predication operations.
367 Interestingly no change is needed to the instruction format because
368 FP Compare already stores a 1 or a zero in its "rd" integer register
369 target, i.e. it's not actually a Branch at all: it's a compare.
370 The target needs to simply change to be a predication bitfield (done
374 Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
375 Likewise Single-precision, fmt bits 26..25) is still set to 00.
376 Double-precision is still set to 01, whilst Quad-precision
377 appears not to have a definition in V2.3-Draft (but should be unaffected).
379 It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
380 and whilst in ordinary branch code this is fine because the standard
381 RVF compare can always be followed up with an integer BEQ or a BNE (or
382 a compressed comparison to zero or non-zero), in predication terms that
383 becomes more of an impact. To deal with this, SV's predication has
384 had "invert" added to it.
387 31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
388 funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
389 5 | 2 | 5 | 5 | 3 | 4 | 7 |
390 10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
391 10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | rsvd |
392 10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
393 10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
396 Note (**TBD**): floating-point exceptions will need to be extended
397 to cater for multiple exceptions (and statuses of the same). The
398 usual approach is to have an array of status codes and bit-fields,
399 and one exception, rather than throw separate exceptions for each
402 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
403 for predicated compare operations of function "cmp":
405 for (int i=0; i<vl; ++i)
407 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
408 s2 ? vreg[rs2][i] : sreg[rs2]);
410 With associated predication, vector-length adjustments and so on,
411 and temporarily ignoring bitwidth (which makes the comparisons more
412 complex), this becomes:
414 if I/F == INT: # integer type cmp
415 preg = int_pred_reg[rd]
418 preg = fp_pred_reg[rd]
421 s1 = reg_is_vectorised(src1);
422 s2 = reg_is_vectorised(src2);
423 if (!s2 && !s1) goto branch;
424 for (int i = 0; i < VL; ++i)
425 if (cmp(s1 ? reg[src1+i]:reg[src1],
426 s2 ? reg[src2+i]:reg[src2])
427 preg[rs3] |= 1<<i; # bitfield not vector
431 * Predicated SIMD comparisons would break src1 and src2 further down
432 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
433 Reordering") setting Vector-Length times (number of SIMD elements) bits
434 in Predicate Register rs3 as opposed to just Vector-Length bits.
435 * Predicated Branches do not actually have an adjustment to the Program
436 Counter, so all of bits 25 through 30 in every case are not needed.
437 * There are plenty of reserved opcodes for which bits 25 through 30 could
438 be put to good use if there is a suitable use-case.
439 FLT and FLE may be inverted to FGT and FGE if needed by swapping
440 src1 and src2 (likewise the integer counterparts).
442 ## Compressed Branch Instruction:
444 Compressed Branch instructions are likewise re-interpreted as predicated
445 2-register operations, with the result going into rs3. All the bits of
446 the immediate are re-interpreted for different purposes, to extend the
447 number of comparator operations to beyond the original specification,
448 but also to cater for floating-point comparisons as well as integer ones.
451 15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
452 funct3 | imm | rs10 | imm | | op | |
453 3 | 3 | 3 | 2 | 3 | 2 | |
454 C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
455 110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
456 111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
457 110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
458 111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
463 * Bits 5 13 14 and 15 make up the comparator type
464 * Bit 6 indicates whether to use integer or floating-point comparisons
465 * In both floating-point and integer cases there are four predication
466 comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
469 ## LOAD / STORE Instructions <a name="load_store"></a>
471 For full analysis of topological adaptation of RVV LOAD/STORE
472 see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
473 may be implicitly overloaded into the one base RV LOAD instruction,
474 and likewise for STORE.
479 31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
480 imm[11:0] |||| rs1 | funct3 | rd | opcode |
481 1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
482 ? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
485 The exact same corresponding adaptation is also carried out on the single,
486 double and quad precision floating-point LOAD-FP and STORE-FP operations,
487 which fit the exact same instruction format. Thus all three types
488 (unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
489 as well as FSW, FSD and FSQ.
493 * LOAD remains functionally (topologically) identical to RVV LOAD
494 (for both integer and floating-point variants).
495 * Predication CSR-marking register is not explicitly shown in instruction, it's
496 implicit based on the CSR predicate state for the rd (destination) register
497 * rs2, the source, may *also be marked as a vector*, which implicitly
498 is taken to indicate "Indexed Load" (LD.X)
499 * Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
500 * Bit 31 is reserved (ideas under consideration: auto-increment)
501 * **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
502 * **TODO**: clarify where width maps to elsize
504 Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
506 if (unit-strided) stride = elsize;
507 else stride = areg[as2]; // constant-strided
509 preg = int_pred_reg[rd]
511 for (int i=0; i<vl; ++i)
512 if ([!]preg[rd] & 1<<i)
513 for (int j=0; j<seglen+1; j++)
515 if CSRvectorised[rs2])
518 offs = i*(seglen+1)*stride;
519 vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
522 Taking CSR (SIMD) bitwidth into account involves using the vector
523 length and register encoding according to the "Bitwidth Virtual Register
524 Reordering" scheme shown in the Appendix (see function "regoffs").
526 A similar instruction exists for STORE, with identical topological
527 translation of all features. **TODO**
529 ## Compressed LOAD / STORE Instructions
531 Compressed LOAD and STORE are of the same format, where bits 2-4 are
532 a src register instead of dest:
535 15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
536 funct3 | imm | rs10 | imm | rd0 | op |
537 3 | 3 | 3 | 2 | 3 | 2 |
538 C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
541 Unfortunately it is not possible to fit the full functionality
542 of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
543 require another operand (rs2) in addition to the operand width
544 (which is also missing), offset, base, and src/dest.
546 However a close approximation may be achieved by taking the top bit
547 of the offset in each of the five types of LD (and ST), reducing the
548 offset to 4 bits and utilising the 5th bit to indicate whether "stride"
549 is to be enabled. In this way it is at least possible to introduce
552 (**TODO**: *assess whether the loss of one bit from offset is worth having
553 "stride" capability.*)
555 We also assume (including for the "stride" variant) that the "width"
556 parameter, which is missing, is derived and implicit, just as it is
557 with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
558 and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
559 C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
561 Interestingly we note that the Vectorised Simple-V variant of
562 LOAD/STORE (Compressed and otherwise), due to it effectively using the
563 standard register file(s), is the direct functional equivalent of
564 standard load-multiple and store-multiple instructions found in other
567 In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
568 page 76, "For virtual memory systems some data accesses could be resident
569 in physical memory and some not". The interesting question then arises:
570 how does RVV deal with the exact same scenario?
571 Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
572 of detecting early page / segmentation faults and adjusting the TLB
573 in advance, accordingly: other strategies are explored in the Appendix
574 Section "Virtual Memory Page Faults".
576 ## Vectorised Copy/Move (and conversion) instructions
578 There is a series of 2-operand instructions involving copying (and
579 alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ. These operations all
580 follow the same pattern, as it is *both* the source *and* destination
581 predication masks that are taken into account. This is different from
582 the three-operand arithmetic instructions, where the predication mask
583 is taken from the *destination* register, and applied uniformly to the
584 elements of the source register(s), element-for-element.
586 ### C.MV Instruction <a name="c_mv"></a>
588 There is no MV instruction in RV however there is a C.MV instruction.
589 It is used for copying integer-to-integer registers (vectorised FMV
590 is used for copying floating-point).
592 If either the source or the destination register are marked as vectors
593 C.MV is reinterpreted to be a vectorised (multi-register) predicated
594 move operation. The actual instruction's format does not change:
597 15 12 | 11 7 | 6 2 | 1 0 |
598 funct4 | rd | rs | op |
600 C.MV | dest | src | C0 |
603 A simplified version of the pseudocode for this operation is as follows:
605 function op_mv(rd, rs) # MV not VMV!
606 rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
607 rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
608 ps = get_pred_val(FALSE, rs); # predication on src
609 pd = get_pred_val(FALSE, rd); # ... AND on dest
610 for (int i = 0, int j = 0; i < VL && j < VL;):
611 if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
612 if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
613 ireg[rd+j] <= ireg[rs+i];
614 if (int_vec[rs].isvec) i++;
615 if (int_vec[rd].isvec) j++;
619 * elwidth (SIMD) is not covered above
620 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
623 There are several different instructions from RVV that are covered by
627 src | dest | predication | op |
628 scalar | vector | none | VSPLAT |
629 scalar | vector | destination | sparse VSPLAT |
630 scalar | vector | 1-bit dest | VINSERT |
631 vector | scalar | 1-bit? src | VEXTRACT |
632 vector | vector | none | VCOPY |
633 vector | vector | src | Vector Gather |
634 vector | vector | dest | Vector Scatter |
635 vector | vector | src & dest | Gather/Scatter |
636 vector | vector | src == dest | sparse VCOPY |
639 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
640 operations with inversion on the src and dest predication for one of the
643 Note that in the instance where the Compressed Extension is not implemented,
644 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
645 Note that the behaviour is **different** from C.MV because with addi the
646 predication mask to use is taken **only** from rd and is applied against
647 all elements: rs[i] = rd[i].
649 ### FMV, FNEG and FABS Instructions
651 These are identical in form to C.MV, except covering floating-point
652 register copying. The same double-predication rules also apply.
653 However when elwidth is not set to default the instruction is implicitly
654 and automatic converted to a (vectorised) floating-point type conversion
655 operation of the appropriate size covering the source and destination
658 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
660 ### FVCT Instructions
662 These are again identical in form to C.MV, except that they cover
663 floating-point to integer and integer to floating-point. When element
664 width in each vector is set to default, the instructions behave exactly
665 as they are defined for standard RV (scalar) operations, except vectorised
666 in exactly the same fashion as outlined in C.MV.
668 However when the source or destination element width is not set to default,
669 the opcode's explicit element widths are *over-ridden* to new definitions,
670 and the opcode's element width is taken as indicative of the SIMD width
671 (if applicable i.e. if packed SIMD is requested) instead.
673 For example FCVT.S.L would normally be used to convert a 64-bit
674 integer in register rs1 to a 64-bit floating-point number in rd.
675 If however the source rs1 is set to be a vector, where elwidth is set to
676 default/2 and "packed SIMD" is enabled, then the first 32 bits of
677 rs1 are converted to a floating-point number to be stored in rd's
678 first element and the higher 32-bits *also* converted to floating-point
679 and stored in the second. The 32 bit size comes from the fact that
680 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
681 divide that by two it means that rs1 element width is to be taken as 32.
683 Similar rules apply to the destination register.
687 > What does an ADD of two different-sized vectors do in simple-V?
689 * if the two source operands are not the same, throw an exception.
690 * if the destination operand is also a vector, and the source is longer
691 than the destination, throw an exception.
693 > And what about instructions like JALR?
694 > What does jumping to a vector do?
696 * Throw an exception. Whether that actually results in spawning threads
697 as part of the trap-handling remains to be seen.
699 # Under consideration <a name="issues"></a>
701 ## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
703 One of the goals of this parallelism proposal is to avoid instruction
704 duplication. However, with the base ISA having been designed explictly
705 to *avoid* condition-codes entirely, shoe-horning predication into it
706 bcomes quite challenging.
708 However what if all branch instructions, if referencing a vectorised
709 register, were instead given *completely new analogous meanings* that
710 resulted in a parallel bit-wise predication register being set? This
711 would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
714 We might imagine that FEQ, FLT and FLT would also need to be converted,
715 however these are effectively *already* in the precise form needed and
716 do not need to be converted *at all*! The difference is that FEQ, FLT
717 and FLE *specifically* write a 1 to an integer register if the condition
718 holds, and 0 if not. All that needs to be done here is to say, "if
719 the integer register is tagged with a bit that says it is a predication
720 register, the **bit** in the integer register is set based on the
721 current vector index" instead.
723 There is, in the standard Conditional Branch instruction, more than
724 adequate space to interpret it in a similar fashion:
727 31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
728 imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
729 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
730 offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
736 31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
737 imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
738 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
739 reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
742 Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
743 with the interesting side-effect that there is space within what is presently
744 the "immediate offset" field to reinterpret that to add in not only a bit
745 field to distinguish between floating-point compare and integer compare,
746 not only to add in a second source register, but also use some of the bits as
747 a predication target as well.
750 15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
751 funct3 | imm | rs10 | imm | op |
753 C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
756 Now uses the CS format:
759 15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
760 funct3 | imm | rs10 | imm | | op |
761 3 | 3 | 3 | 2 | 3 | 2 |
762 C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
765 Bit 6 would be decoded as "operation refers to Integer or Float" including
766 interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
767 "C" Standard, version 2.0,
768 whilst Bit 5 would allow the operation to be extended, in combination with
769 funct3 = 110 or 111: a combination of four distinct (predicated) comparison
770 operators. In both floating-point and integer cases those could be
771 EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).