30f147d61537a024c2916b5bf235fd65531e1740
[libreriscv.git] / simple_v_extension / abridged_spec.mdwn
1 # Simple-V (Parallelism Extension Proposal) Specification (Abridged)
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 25 jun 2019
6
7 [[!toc ]]
8
9 # Introduction
10
11 Simple-V is a uniform parallelism API for RISC-V hardware that allows
12 the Program Counter to enter "sub-contexts" in which, ultimately, standard
13 RISC-V scalar opcodes are executed.
14
15 The sub-context execution is "nested" in "re-entrant" form, in the
16 following order:
17
18 * Main standard RISC-V Program Counter (PC)
19 * VBLOCK sub-execution context (PCVBLK increments whilst PC is paused).
20 * VL element loops (STATE srcoffs and destoffs increment, PC and PCVBLK pause).
21 Predication bits may be individually applied per element.
22 * SUBVL element loops (STATE svsrcoffs/svdestoffs increment, VL pauses).
23 Individual predicate bits from VL loops apply to the *group* of SUBVL
24 elements.
25
26 An ancillary "SVPrefix" Format (P48/P64) [[sv_prefix_proposal]] may
27 run its own VL/SUBVL "loops" and specifies its own Register and Predication
28 format on the 32-bit RV scalar opcode embedded within it.
29
30 The [[vblock_format]] specifies how VBLOCK sub-execution contexts
31 operate.
32
33 SV is never actually switched "off". VL or SUBVL may be equal to 1, and
34 Register or Predicate over-ride tables may be empty: under such circumstances
35 the behaviour becomes effectively identical to standard RV execution, however
36 SV is never truly actually "off".
37
38 Note: **there are *no* new opcodes**. The scheme works *entirely*
39 on hidden context that augments *scalar* RISC-V instructions. Thus it
40 may cover existing, future and custom scalar extensions, turning all
41 existing, all future and all custom scalar operations parallel, without
42 requiring any special opcodes to do so.
43
44 # CSRs <a name="csrs"></a>
45
46 There are five CSRs, available in any privilege level:
47
48 * MVL (the Maximum Vector Length)
49 * VL (which has different characteristics from standard CSRs)
50 * SUBVL (effectively a kind of SIMD)
51 * STATE (containing copies of MVL, VL and SUBVL as well as context information)
52 * PCVBLK (the current operation being executed within a VBLOCK Group)
53
54 For Privilege Levels (trap handling) there are the following CSRs,
55 where x may be u, m, s or h for User, Machine, Supervisor or Hypervisor
56 Modes respectively:
57
58 * (x)ePCVBLK (a copy of the sub-execution Program Counter, that is relative
59 to the start of the current VBLOCK Group, set on a trap).
60 * (x) eSTATE (useful for saving and restoring during context switch,
61 and for providing fast transitions)
62
63 The u/m/s CSRs are treated and handled exactly like their (x)epc
64 equivalents. On entry to or exit from a privilege level, the contents
65 of its (x)eSTATE are swapped with STATE.
66
67 (x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
68 equivalents. See VBLOCK section for details.
69
70 ## MAXVECTORLENGTH (MVL) <a name="mvl" />
71
72 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
73 is variable length and may be dynamically set. MVL is
74 however limited to the regfile bitwidth XLEN (1-32 for RV32,
75 1-64 for RV64 and so on).
76
77 ## Vector Length (VL) <a name="vl" />
78
79 VSETVL is slightly different from RVV. Similar to RVV, VL is set to be within
80 the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
81
82 VL = rd = MIN(vlen, MVL)
83
84 where 1 <= MVL <= XLEN
85
86 ## SUBVL - Sub Vector Length
87
88 This is a "group by quantity" that effectivrly asks each iteration
89 of the hardware loop to load SUBVL elements of width elwidth at a
90 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
91 operation issued, SUBVL operations are issued.
92
93 The main effect of SUBVL is that predication bits are applied per
94 **group**, rather than by individual element.
95
96 ## STATE
97
98 This is a standard CSR that contains sufficient information for a
99 full context save/restore. It contains (and permits setting of):
100
101 * MVL
102 * VL
103 * destoffs - the destination element offset of the current parallel
104 instruction being executed
105 * srcoffs - for twin-predication, the source element offset as well.
106 * SUBVL
107 * svdestoffs - the subvector destination element offset of the current
108 parallel instruction being executed
109 * svsrcoffs - for twin-predication, the subvector source element offset
110 as well.
111
112 The format of the STATE CSR is as follows:
113
114 | (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
115 | ------- | -------- | -------- | -------- | -------- | ------- | ------- |
116 | dsvoffs | ssvoffs | subvl | destoffs | srcoffs | vl | maxvl |
117
118 Notes:
119
120 * The entries are truncated to be within range. Attempts to set VL to
121 greater than MAXVL will truncate VL.
122 * Both VL and MAXVL are stored offset by one. 0b000000 represents VL=1,
123 0b000001 represents VL=2. This allows the full range 1 to XLEN instead
124 of 0 to only 63.
125
126 ## VL, MVL and SUBVL instruction aliases
127
128 This table contains pseudo-assembly instruction aliases. Note the
129 subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
130 reduced range of the 5 bit immediate.
131
132 | alias | CSR |
133 | - | - |
134 | SETVL rd, rs | CSRRW VL, rd, rs |
135 | SETVLi rd, #n | CSRRWI VL, rd, #n-1 |
136 | GETVL rd | CSRRW VL, rd, x0 |
137 | SETMVL rd, rs | CSRRW MVL, rd, rs |
138 | SETMVLi rd, #n | CSRRWI MVL,rd, #n-1 |
139 | GETMVL rd | CSRRW MVL, rd, x0 |
140
141 Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
142
143 ## Register key-value (CAM) table <a name="regcsrtable" />
144
145 The purpose of the Register table is to mark which registers change behaviour
146 if used in a "Standard" (normally scalar) opcode.
147
148 16 bit format:
149
150 | RegCAM | 15 | (14..8) | 7 | (6..5) | (4..0) |
151 | ------ | - | - | - | ------ | ------- |
152 | 0 | isvec0 | regidx0 | i/f | vew0 | regkey |
153 | 1 | isvec1 | regidx1 | i/f | vew1 | regkey |
154 | 2 | isvec2 | regidx2 | i/f | vew2 | regkey |
155 | 3 | isvec3 | regidx3 | i/f | vew3 | regkey |
156
157 8 bit format:
158
159 | RegCAM | | 7 | (6..5) | (4..0) |
160 | ------ | | - | ------ | ------- |
161 | 0 | | i/f | vew0 | regnum |
162
163 Mapping the 8-bit to 16-bit format:
164
165 | RegCAM | 15 | (14..8) | 7 | (6..5) | (4..0) |
166 | ------ | - | - | - | ------ | ------- |
167 | 0 | isvec=1 | regnum0<<2 | i/f | vew0 | regnum0 |
168 | 1 | isvec=1 | regnum1<<2 | i/f | vew1 | regnum1 |
169 | 2 | isvec=1 | regnum2<<2 | i/f | vew2 | regnum2 |
170 | 3 | isvec=1 | regnum2<<2 | i/f | vew3 | regnum3 |
171
172 Fields:
173
174 * i/f is set to "1" to indicate that the redirection/tag entry is to
175 be applied to integer registers; 0 indicates that it is relevant to
176 floating-point registers.
177 * isvec indicates that the register (whether a src or dest) is to progress
178 incrementally forward on each loop iteration. this gives the "effect"
179 of vectorisation. isvec is zero indicates "do not progress", giving
180 the "effect" of that register being scalar.
181 * vew overrides the operation's default width. See table below
182 * regkey is the register which, if encountered in an op (as src or dest)
183 is to be "redirected"
184 * in the 16-bit format, regidx is the *actual* register to be used
185 for the operation (note that it is 7 bits wide)
186
187 | vew | bitwidth |
188 | --- | ------------------- |
189 | 00 | default (XLEN/FLEN) |
190 | 01 | 8 bit |
191 | 10 | 16 bit |
192 | 11 | 32 bit |
193
194 As the above table is a CAM (key-value store) it may be appropriate
195 (faster, less gates, implementation-wise) to expand it as follows:
196
197 struct vectorised {
198 bool isvector:1;
199 int vew:2;
200 bool enabled:1;
201 int predidx:7;
202 }
203
204 struct vectorised fp_vec[32], int_vec[32];
205
206 for (i = 0; i < len; i++) // from VBLOCK Format
207 tb = int_vec if CSRvec[i].type == 0 else fp_vec
208 idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
209 tb[idx].elwidth = CSRvec[i].elwidth
210 tb[idx].regidx = CSRvec[i].regidx // indirection
211 tb[idx].isvector = CSRvec[i].isvector // 0=scalar
212 tb[idx].enabled = true;
213
214 ## Predication Table <a name="predication_csr_table"></a>
215
216 The Predication Table is a key-value store indicating whether, if a
217 given destination register (integer or floating-point) is referred to
218 in an instruction, it is to be predicated. Like the Register table, it
219 is an indirect lookup that allows the RV opcodes to not need modification.
220
221 * regidx is the register that in combination with the
222 i/f flag, if that integer or floating-point register is referred to in a
223 (standard RV) instruction results in the lookup table being referenced
224 to find the predication mask to use for this operation.
225 * predidx is the *actual* (full, 7 bit) register to be used for the
226 predication mask.
227 * inv indicates that the predication mask bits are to be inverted
228 prior to use *without* actually modifying the contents of the
229 register from which those bits originated.
230 * zeroing is either 1 or 0, and if set to 1, the operation must
231 place zeros in any element position where the predication mask is
232 set to zero. If zeroing is set to 0, unpredicated elements *must*
233 be left alone (unaltered), even when elwidth != default.
234 * ffirst is a special mode that stops sequential element processing when
235 a data-dependent condition occurs, whether a trap or a conditional test.
236 The handling of each (trap or conditional test) is slightly different:
237 see Instruction sections for further details
238
239 16 bit format:
240
241 | PrCSR | (15..11) | 10 | 9 | 8 | (7..1) | 0 |
242 | ----- | - | - | - | - | ------- | ------- |
243 | 0 | predidx | zero0 | inv0 | i/f | regidx | ffirst0 |
244 | 1 | predidx | zero1 | inv1 | i/f | regidx | ffirst1 |
245 | 2 | predidx | zero2 | inv2 | i/f | regidx | ffirst2 |
246 | 3 | predidx | zero3 | inv3 | i/f | regidx | ffirst3 |
247
248 Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding. Its use must
249 generate an illegal instruction trap.
250
251 8 bit format:
252
253 | PrCSR | 7 | 6 | 5 | (4..0) |
254 | ----- | - | - | - | ------- |
255 | 0 | zero0 | inv0 | i/f | regnum |
256
257 Mapping from 8 to 16 bit format, the table becomes:
258
259 | PrCSR | (15..11) | 10 | 9 | 8 | (7..1) | 0 |
260 | ----- | - | - | - | - | ------- | ------- |
261 | 0 | x9 | zero0 | inv0 | i/f | regnum | ff=0 |
262 | 1 | x10 | zero1 | inv1 | i/f | regnum | ff=0 |
263 | 2 | x11 | zero2 | inv2 | i/f | regnum | ff=0 |
264 | 3 | x12 | zero3 | inv3 | i/f | regnum | ff=0 |
265
266 Pseudocode for predication:
267
268 struct pred {
269 bool zero; // zeroing
270 bool inv; // register at predidx is inverted
271 bool ffirst; // fail-on-first
272 bool enabled; // use this to tell if the table-entry is active
273 int predidx; // redirection: actual int register to use
274 }
275
276 struct pred fp_pred_reg[32];
277 struct pred int_pred_reg[32];
278
279 for (i = 0; i < len; i++) // number of Predication entries in VBLOCK
280 tb = int_pred_reg if PredicateTable[i].type == 0 else fp_pred_reg;
281 idx = VBLOCKPredicateTable[i].regidx
282 tb[idx].zero = CSRpred[i].zero
283 tb[idx].inv = CSRpred[i].inv
284 tb[idx].ffirst = CSRpred[i].ffirst
285 tb[idx].predidx = CSRpred[i].predidx
286 tb[idx].enabled = true
287
288 def get_pred_val(bool is_fp_op, int reg):
289 tb = int_reg if is_fp_op else fp_reg
290 if (!tb[reg].enabled):
291 return ~0x0, False // all enabled; no zeroing
292 tb = int_pred if is_fp_op else fp_pred
293 if (!tb[reg].enabled):
294 return ~0x0, False // all enabled; no zeroing
295 predidx = tb[reg].predidx // redirection occurs HERE
296 predicate = intreg[predidx] // actual predicate HERE
297 if (tb[reg].inv):
298 predicate = ~predicate // invert ALL bits
299 return predicate, tb[reg].zero
300
301 ## Fail-on-First Mode <a name="ffirst-mode"></a>
302
303 ffirst is a special data-dependent predicate mode. There are two
304 variants: one is for faults: typically for LOAD/STORE operations,
305 which may encounter end of page faults during a series of operations.
306 The other variant is comparisons such as FEQ (or the augmented behaviour
307 of Branch), and any operation that returns a result of zero (whether
308 integer or floating-point). In the FP case, this includes negative-zero.
309
310 Note that the execution order must "appear" to be sequential for ffirst
311 mode to work correctly. An in-order architecture must execute the element
312 operations in sequence, whilst an out-of-order architecture must *commit*
313 the element operations in sequence (giving the appearance of in-order
314 execution).
315
316 Note also, that if ffirst mode is needed without predication, a special
317 "always-on" Predicate Table Entry may be constructed by setting
318 inverse-on and using x0 as the predicate register. This
319 will have the effect of creating a mask of all ones, allowing ffirst
320 to be set.
321
322 ### Fail-on-first traps
323
324 Except for the first element, ffault stops sequential element processing
325 when a trap occurs. The first element is treated normally (as if ffirst
326 is clear). Should any subsequent element instruction require a trap,
327 instead it and subsequent indexed elements are ignored (or cancelled in
328 out-of-order designs), and VL is set to the *last* instruction that did
329 not take the trap.
330
331 Note that predicated-out elements (where the predicate mask bit is zero)
332 are clearly excluded (i.e. the trap will not occur). However, note that
333 the loop still had to test the predicate bit: thus on return,
334 VL is set to include elements that did not take the trap *and* includes
335 the elements that were predicated (masked) out (not tested up to the
336 point where the trap occurred).
337
338 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
339 will cause a trap as normal (as if ffirst is not set); subsequently,
340 the trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
341 be modified.
342
343 Given that predication bits apply to SUBVL groups, the same rules apply
344 to predicated-out (masked-out) sub-groups in calculating the value that VL
345 is set to.
346
347 ### Fail-on-first conditional tests
348
349 ffault stops sequential element conditional testing on the first element result
350 being zero. VL is set to the number of elements that were processed before
351 the fail-condition was encountered.
352
353 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
354 will cause the processing to end, and, even if there were elements within
355 the *sub-group* that passed the test, that sub-group is still (entirely)
356 excluded from the count (from setting VL). i.e. VL is set to the total
357 number of *sub-groups* that had no fail-condition up until execution was
358 stopped.
359
360 Note again that, just as with traps, predicated-out (masked-out) elements
361 are included in the count leading up to the fail-condition, even though they
362 were not tested.
363
364 The pseudo-code for Predication makes this clearer and simpler than it is
365 in words (the loop ends, VL is set to the current element index, "i").
366
367 # Instructions <a name="instructions" />
368
369 To illustrate how Scalar operations are turned "vector" and "predicated",
370 simplified example pseudo-code for an integer ADD operation is shown below.
371 Floating-point would use the FP Register Table.
372
373 function op_add(rd, rs1, rs2) # add not VADD!
374  int i, id=0, irs1=0, irs2=0;
375  predval = get_pred_val(FALSE, rd);
376  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
377  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
378  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
379  for (i = 0; i < VL; i++)
380 xSTATE.srcoffs = i # save context
381 if (predval & 1<<i) # predication uses intregs
382    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
383 if (!int_vec[rd ].isvector) break;
384 if (int_vec[rd ].isvector)  { id += 1; }
385 if (int_vec[rs1].isvector)  { irs1 += 1; }
386 if (int_vec[rs2].isvector)  { irs2 += 1; }
387
388 Note that for simplicity there is quite a lot missing from the above
389 pseudo-code.
390
391 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
392
393 Adding in support for SUBVL is a matter of adding in an extra inner
394 for-loop, where register src and dest are still incremented inside the
395 inner part. Not that the predication is still taken from the VL index.
396
397 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
398 indexed by "(i)"
399
400 function op_add(rd, rs1, rs2) # add not VADD!
401  int i, id=0, irs1=0, irs2=0;
402  predval = get_pred_val(FALSE, rd);
403  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
404  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
405  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
406  for (i = 0; i < VL; i++)
407 xSTATE.srcoffs = i # save context
408 for (s = 0; s < SUBVL; s++)
409 xSTATE.ssvoffs = s # save context
410 if (predval & 1<<i) # predication uses intregs
411 # actual add is here (at last)
412    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
413 if (!int_vec[rd ].isvector) break;
414 if (int_vec[rd ].isvector)  { id += 1; }
415 if (int_vec[rs1].isvector)  { irs1 += 1; }
416 if (int_vec[rs2].isvector)  { irs2 += 1; }
417 if (id == VL or irs1 == VL or irs2 == VL) {
418 # end VL hardware loop
419 xSTATE.srcoffs = 0; # reset
420 xSTATE.ssvoffs = 0; # reset
421 return;
422 }
423
424 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
425 elwidth handling etc. all left out.
426
427 ## Instruction Format
428
429 It is critical to appreciate that there are
430 **no operations added to SV, at all**.
431
432 Examples are given below where "standard" RV scalar behaviour is augmented.
433
434 ## Branch Instructions
435
436 Branch operations are augmented slightly to be a little more like FP
437 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
438 of multiple comparisons into a register (taken indirectly from the predicate
439 table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
440 See ffirst mode in the Predication Table section.
441
442 ### Standard Branch <a name="standard_branch"></a>
443
444 Branch operations use standard RV opcodes that are reinterpreted to
445 be "predicate variants" in the instance where either of the two src
446 registers are marked as vectors (active=1, vector=1).
447
448 Note that the predication register to use (if one is enabled) is taken from
449 the *first* src register, and that this is used, just as with predicated
450 arithmetic operations, to mask whether the comparison operations take
451 place or not. If the second register is also marked as predicated,
452 that (scalar) predicate register is used as a **destination** to store
453 the results of all the comparisons.
454
455 In instances where no vectorisation is detected on either src registers
456 the operation is treated as an absolutely standard scalar branch operation.
457 Where vectorisation is present on either or both src registers, the
458 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
459 those tests that are predicated out).
460
461 Pseudo-code for branch:
462
463 s1 = reg_is_vectorised(src1);
464 s2 = reg_is_vectorised(src2);
465
466 if not s1 && not s2
467 if cmp(rs1, rs2) # scalar compare
468 goto branch
469 return
470
471 preg = int_pred_reg[rd]
472 reg = int_regfile
473
474 ps = get_pred_val(I/F==INT, rs1);
475 rd = get_pred_val(I/F==INT, rs2); # this may not exist
476
477 if not exists(rd) or zeroing:
478 result = 0
479 else
480 result = preg[rd]
481
482 for (int i = 0; i < VL; ++i)
483 if (zeroing)
484 if not (ps & (1<<i))
485 result &= ~(1<<i);
486 else if (ps & (1<<i))
487 if (cmp(s1 ? reg[src1+i]:reg[src1],
488 s2 ? reg[src2+i]:reg[src2])
489 result |= 1<<i;
490 else
491 result &= ~(1<<i);
492
493 if not exists(rd)
494 if result == ps
495 goto branch
496 else
497 preg[rd] = result # store in destination
498 if preg[rd] == ps
499 goto branch
500
501 Notes:
502
503 * Predicated SIMD comparisons would break src1 and src2 further down
504 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
505 Reordering") setting Vector-Length times (number of SIMD elements) bits
506 in Predicate Register rd, as opposed to just Vector-Length bits.
507 * The execution of "parallelised" instructions **must** be implemented
508 as "re-entrant" (to use a term from software). If an exception (trap)
509 occurs during the middle of a vectorised
510 Branch (now a SV predicated compare) operation, the partial results
511 of any comparisons must be written out to the destination
512 register before the trap is permitted to begin. If however there
513 is no predicate, the **entire** set of comparisons must be **restarted**,
514 with the offset loop indices set back to zero. This is because
515 there is no place to store the temporary result during the handling
516 of traps.
517
518 Note also that where normally, predication requires that there must
519 also be a CSR register entry for the register being used in order
520 for the **predication** CSR register entry to also be active,
521 for branches this is **not** the case. src2 does **not** have
522 to have its CSR register entry marked as active in order for
523 predication on src2 to be active.
524
525 ### Floating-point Comparisons
526
527 There does not exist floating-point branch operations, only compare.
528 Interestingly no change is needed to the instruction format because
529 FP Compare already stores a 1 or a zero in its "rd" integer register
530 target, i.e. it's not actually a Branch at all: it's a compare.
531
532 As RV Scalar does not have "FNE", predication inversion must be used.
533 Also: note that FP Compare may be predicated, using the destination
534 integer register (rd) to determine the predicate. FP Compare is **not**
535 a twin-predication operation, as, again, just as with SV Branches,
536 there are three registers involved: FP src1, FP src2 and INT rd.
537
538 Also: note that ffirst (fail first mode) applies directly to this operation.
539
540 ### Compressed Branch Instruction
541
542 Compressed Branch instructions are, just like standard Branch instructions,
543 reinterpreted to be vectorised and predicated based on the source register
544 (rs1s) CSR entries. As however there is only the one source register,
545 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
546 to store the results of the comparisions is taken from CSR predication
547 table entries for **x0**.
548
549 The specific required use of x0 is, with a little thought, quite obvious,
550 but is counterintuitive. Clearly it is **not** recommended to redirect
551 x0 with a CSR register entry, however as a means to opaquely obtain
552 a predication target it is the only sensible option that does not involve
553 additional special CSRs (or, worse, additional special opcodes).
554
555 Note also that, just as with standard branches, the 2nd source
556 (in this case x0 rather than src2) does **not** have to have its CSR
557 register table marked as "active" in order for predication to work.
558
559 ## Vectorised Dual-operand instructions
560
561 There is a series of 2-operand instructions involving copying (and
562 sometimes alteration):
563
564 * C.MV
565 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
566 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
567 * LOAD(-FP) and STORE(-FP)
568
569 All of these operations follow the same two-operand pattern, so it is
570 *both* the source *and* destination predication masks that are taken into
571 account. This is different from
572 the three-operand arithmetic instructions, where the predication mask
573 is taken from the *destination* register, and applied uniformly to the
574 elements of the source register(s), element-for-element.
575
576 The pseudo-code pattern for twin-predicated operations is as
577 follows:
578
579 function op(rd, rs):
580  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
581  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
582  ps = get_pred_val(FALSE, rs); # predication on src
583  pd = get_pred_val(FALSE, rd); # ... AND on dest
584  for (int i = 0, int j = 0; i < VL && j < VL;):
585 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
586 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
587 xSTATE.srcoffs = i # save context
588 xSTATE.destoffs = j # save context
589 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
590 if (int_csr[rs].isvec) i++;
591 if (int_csr[rd].isvec) j++; else break
592
593 This pattern covers scalar-scalar, scalar-vector, vector-scalar
594 and vector-vector, and predicated variants of all of those.
595 Zeroing is not presently included (TODO). As such, when compared
596 to RVV, the twin-predicated variants of C.MV and FMV cover
597 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
598 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
599
600 ### C.MV Instruction <a name="c_mv"></a>
601
602 There is no MV instruction in RV however there is a C.MV instruction.
603 It is used for copying integer-to-integer registers (vectorised FMV
604 is used for copying floating-point).
605
606 If either the source or the destination register are marked as vectors
607 C.MV is reinterpreted to be a vectorised (multi-register) predicated
608 move operation. The actual instruction's format does not change.
609
610 There are several different instructions from RVV that are covered by
611 this one opcode:
612
613 [[!table data="""
614 src | dest | predication | op |
615 scalar | vector | none | VSPLAT |
616 scalar | vector | destination | sparse VSPLAT |
617 scalar | vector | 1-bit dest | VINSERT |
618 vector | scalar | 1-bit? src | VEXTRACT |
619 vector | vector | none | VCOPY |
620 vector | vector | src | Vector Gather |
621 vector | vector | dest | Vector Scatter |
622 vector | vector | src & dest | Gather/Scatter |
623 vector | vector | src == dest | sparse VCOPY |
624 """]]
625
626 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
627 operations with zeroing off, and inversion on the src and dest
628 predication for one of the two C.MV operations.
629
630 ### FMV, FNEG and FABS Instructions
631
632 These are identical in form to C.MV, except covering floating-point
633 register copying. The same double-predication rules also apply.
634 However when elwidth is not set to default the instruction is implicitly
635 and automatic converted to a (vectorised) floating-point type conversion
636 operation of the appropriate size covering the source and destination
637 register bitwidths.
638
639 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
640
641 ### FVCT Instructions
642
643 These are again identical in form to C.MV, except that they cover
644 floating-point to integer and integer to floating-point. When element
645 width in each vector is set to default, the instructions behave exactly
646 as they are defined for standard RV (scalar) operations, except vectorised
647 in exactly the same fashion as outlined in C.MV.
648
649 However when the source or destination element width is not set to default,
650 the opcode's explicit element widths are *over-ridden* to new definitions,
651 and the opcode's element width is taken as indicative of the SIMD width
652 (if applicable i.e. if packed SIMD is requested) instead.
653
654 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
655
656 In vectorised architectures there are usually at least two different modes
657 for LOAD/STORE:
658
659 * Read (or write for STORE) from sequential locations, where one
660 register specifies the address, and the one address is incremented
661 by a fixed amount. This is usually known as "Unit Stride" mode.
662 * Read (or write) from multiple indirected addresses, where the
663 vector elements each specify separate and distinct addresses.
664
665 To support these different addressing modes, the CSR Register "isvector"
666 bit is used. So, for a LOAD, when the src register is set to
667 scalar, the LOADs are sequentially incremented by the src register
668 element width, and when the src register is set to "vector", the
669 elements are treated as indirection addresses. Simplified
670 pseudo-code would look like this:
671
672 function op_ld(rd, rs) # LD not VLD!
673  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
674  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
675  ps = get_pred_val(FALSE, rs); # predication on src
676  pd = get_pred_val(FALSE, rd); # ... AND on dest
677  for (int i = 0, int j = 0; i < VL && j < VL;):
678 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
679 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
680 if (int_csr[rd].isvec)
681 # indirect mode (multi mode)
682 srcbase = ireg[rsv+i];
683 else
684 # unit stride mode
685 srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
686 ireg[rdv+j] <= mem[srcbase + imm_offs];
687 if (!int_csr[rs].isvec &&
688 !int_csr[rd].isvec) break # scalar-scalar LD
689 if (int_csr[rs].isvec) i++;
690 if (int_csr[rd].isvec) j++;
691
692 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
693
694 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
695 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
696 It is therefore possible to use predicated C.LWSP to efficiently
697 pop registers off the stack (by predicating x2 as the source), cherry-picking
698 which registers to store to (by predicating the destination). Likewise
699 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
700
701 **Note**: it is still possible to redirect x2 to an alternative target
702 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
703 general-purpose LOAD/STORE operations.
704
705 ## Compressed LOAD / STORE Instructions
706
707 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
708 where the same rules apply and the same pseudo-code apply as for
709 non-compressed LOAD/STORE. Again: setting scalar or vector mode
710 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
711 to "Multi-indirection", respectively.
712
713 # Element bitwidth polymorphism <a name="elwidth"></a>
714
715 Element bitwidth is best covered as its own special section, as it
716 is quite involved and applies uniformly across-the-board. SV restricts
717 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
718
719 The effect of setting an element bitwidth is to re-cast each entry
720 in the register table, and for all memory operations involving
721 load/stores of certain specific sizes, to a completely different width.
722 Thus In c-style terms, on an RV64 architecture, effectively each register
723 now looks like this:
724
725 typedef union {
726 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
727 uint8_t b[0]; // array of type uint8_t
728 uint16_t s[0];
729 uint32_t i[0];
730 uint64_t l[0];
731 uint128_t d[0];
732 } reg_t;
733
734 reg_t int_regfile[128];
735
736 Implementors must ensure that over-runs of the register file throw
737 an exception.
738
739 The pseudo-code is as follows, to demonstrate how the sign-extending
740 and width-extending works:
741
742 typedef union {
743 uint8_t b;
744 uint16_t s;
745 uint32_t i;
746 uint64_t l;
747 } el_reg_t;
748
749 bw(elwidth):
750 if elwidth == 0: return xlen
751 if elwidth == 1: return 8
752 if elwidth == 2: return 16
753 // elwidth == 3:
754 return 32
755
756 get_max_elwidth(rs1, rs2):
757 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
758 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
759
760 get_polymorphed_reg(reg, bitwidth, offset):
761 el_reg_t res;
762 res.l = 0; // TODO: going to need sign-extending / zero-extending
763 if bitwidth == 8:
764 reg.b = int_regfile[reg].b[offset]
765 elif bitwidth == 16:
766 reg.s = int_regfile[reg].s[offset]
767 elif bitwidth == 32:
768 reg.i = int_regfile[reg].i[offset]
769 elif bitwidth == 64:
770 reg.l = int_regfile[reg].l[offset]
771 return res
772
773 set_polymorphed_reg(reg, bitwidth, offset, val):
774 if (!int_csr[reg].isvec):
775 # sign/zero-extend depending on opcode requirements, from
776 # the reg's bitwidth out to the full bitwidth of the regfile
777 val = sign_or_zero_extend(val, bitwidth, xlen)
778 int_regfile[reg].l[0] = val
779 elif bitwidth == 8:
780 int_regfile[reg].b[offset] = val
781 elif bitwidth == 16:
782 int_regfile[reg].s[offset] = val
783 elif bitwidth == 32:
784 int_regfile[reg].i[offset] = val
785 elif bitwidth == 64:
786 int_regfile[reg].l[offset] = val
787
788 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
789 destwid = int_csr[rs1].elwidth # destination element width
790  for (i = 0; i < VL; i++)
791 if (predval & 1<<i) # predication uses intregs
792 // TODO, calculate if over-run occurs, for each elwidth
793 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
794 // TODO, sign/zero-extend src1 and src2 as operation requires
795 if (op_requires_sign_extend_src1)
796 src1 = sign_extend(src1, maxsrcwid)
797 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
798 result = src1 + src2 # actual add here
799 // TODO, sign/zero-extend result, as operation requires
800 if (op_requires_sign_extend_dest)
801 result = sign_extend(result, maxsrcwid)
802 set_polymorphed_reg(rd, destwid, ird, result)
803 if (!int_vec[rd].isvector) break
804 if (int_vec[rd ].isvector)  { id += 1; }
805 if (int_vec[rs1].isvector)  { irs1 += 1; }
806 if (int_vec[rs2].isvector)  { irs2 += 1; }
807
808 ## Polymorphic floating-point operation exceptions and error-handling
809
810 For floating-point operations, conversion takes place without
811 raising any kind of exception. Exactly as specified in the standard
812 RV specification, NAN (or appropriate) is stored if the result
813 is beyond the range of the destination, and, again, exactly as
814 with the standard RV specification just as with scalar
815 operations, the floating-point flag is raised (FCSR). And, again, just as
816 with scalar operations, it is software's responsibility to check this flag.
817 Given that the FCSR flags are "accrued", the fact that multiple element
818 operations could have occurred is not a problem.
819
820 Note that it is perfectly legitimate for floating-point bitwidths of
821 only 8 to be specified. However whilst it is possible to apply IEEE 754
822 principles, no actual standard yet exists. Implementors wishing to
823 provide hardware-level 8-bit support rather than throw a trap to emulate
824 in software should contact the author of this specification before
825 proceeding.
826
827 ## Polymorphic shift operators
828
829 A special note is needed for changing the element width of left and right
830 shift operators, particularly right-shift.
831
832 For SV, where each operand's element bitwidth may be over-ridden, the
833 rule about determining the operation's bitwidth *still applies*, being
834 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
835 **also applies to the truncation of RS2**. In other words, *after*
836 determining the maximum bitwidth, RS2's range must **also be truncated**
837 to ensure a correct answer. Example:
838
839 * RS1 is over-ridden to a 16-bit width
840 * RS2 is over-ridden to an 8-bit width
841 * RD is over-ridden to a 64-bit width
842 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
843 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
844
845 Pseudocode (in spike) for this example would therefore be:
846
847 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
848
849 ## Polymorphic MULH/MULHU/MULHSU
850
851 MULH is designed to take the top half MSBs of a multiply that
852 does not fit within the range of the source operands, such that
853 smaller width operations may produce a full double-width multiply
854 in two cycles. The issue is: SV allows the source operands to
855 have variable bitwidth.
856
857 Here again special attention has to be paid to the rules regarding
858 bitwidth, which, again, are that the operation is performed at
859 the maximum bitwidth of the **source** registers. Therefore:
860
861 * An 8-bit x 8-bit multiply will create a 16-bit result that must
862 be shifted down by 8 bits
863 * A 16-bit x 8-bit multiply will create a 24-bit result that must
864 be shifted down by 16 bits (top 8 bits being zero)
865 * A 16-bit x 16-bit multiply will create a 32-bit result that must
866 be shifted down by 16 bits
867 * A 32-bit x 16-bit multiply will create a 48-bit result that must
868 be shifted down by 32 bits
869 * A 32-bit x 8-bit multiply will create a 40-bit result that must
870 be shifted down by 32 bits
871
872 So again, just as with shift-left and shift-right, the result
873 is shifted down by the maximum of the two source register bitwidths.
874 And, exactly again, truncation or sign-extension is performed on the
875 result. If sign-extension is to be carried out, it is performed
876 from the same maximum of the two source register bitwidths out
877 to the result element's bitwidth.
878
879 If truncation occurs, i.e. the top MSBs of the result are lost,
880 this is "Officially Not Our Problem", i.e. it is assumed that the
881 programmer actually desires the result to be truncated. i.e. if the
882 programmer wanted all of the bits, they would have set the destination
883 elwidth to accommodate them.
884
885 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
886
887 Polymorphic element widths in vectorised form means that the data
888 being loaded (or stored) across multiple registers needs to be treated
889 (reinterpreted) as a contiguous stream of elwidth-wide items, where
890 the source register's element width is **independent** from the destination's.
891
892 This makes for a slightly more complex algorithm when using indirection
893 on the "addressed" register (source for LOAD and destination for STORE),
894 particularly given that the LOAD/STORE instruction provides important
895 information about the width of the data to be reinterpreted.
896
897 As LOAD/STORE may be twin-predicated, it is important to note that
898 the rules on twin predication still apply. Where in previous
899 pseudo-code (elwidth=default for both source and target) it was
900 the *registers* that the predication was applied to, it is now the
901 **elements** that the predication is applied to.
902
903 The pseudocode for all LD operations may be written out
904 as follows:
905
906 function LBU(rd, rs):
907 load_elwidthed(rd, rs, 8, true)
908 function LB(rd, rs):
909 load_elwidthed(rd, rs, 8, false)
910 function LH(rd, rs):
911 load_elwidthed(rd, rs, 16, false)
912 ...
913 ...
914 function LQ(rd, rs):
915 load_elwidthed(rd, rs, 128, false)
916
917 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
918 function load_memory(rs, imm, i, opwidth):
919 elwidth = int_csr[rs].elwidth
920 bitwidth = bw(elwidth);
921 elsperblock = min(1, opwidth / bitwidth)
922 srcbase = ireg[rs+i/(elsperblock)];
923 offs = i % elsperblock;
924 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
925
926 function load_elwidthed(rd, rs, opwidth, unsigned):
927 destwid = int_csr[rd].elwidth # destination element width
928  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
929  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
930  ps = get_pred_val(FALSE, rs); # predication on src
931  pd = get_pred_val(FALSE, rd); # ... AND on dest
932  for (int i = 0, int j = 0; i < VL && j < VL;):
933 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
934 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
935 val = load_memory(rs, imm, i, opwidth)
936 if unsigned:
937 val = zero_extend(val, min(opwidth, bitwidth))
938 else:
939 val = sign_extend(val, min(opwidth, bitwidth))
940 set_polymorphed_reg(rd, bitwidth, j, val)
941 if (int_csr[rs].isvec) i++;
942 if (int_csr[rd].isvec) j++; else break;
943
944 # Predication Element Zeroing
945
946 The decision to add the *option* to zero unpredicated (masked-out)
947 elements was based on whether it would be useful, rather than on
948 how the microarchitecture is implemented (or optimised). Therefore,
949 both zeroing and non-zeroing are mandatory.
950
951 ## Single-predication (based on destination register)
952
953 Zeroing on predication for arithmetic operations is taken from
954 the destination register's predicate. i.e. the predication *and*
955 zeroing settings to be applied to the whole operation come from the
956 CSR Predication table entry for the destination register.
957
958 Thus when zeroing is set on predication of a destination element,
959 if the predication bit is clear, then the destination element is *set*
960 to zero (twin-predication is slightly different, and is covered below)
961
962 Thus the pseudo-code loop for a predicated arithmetic operation
963 is modified to as follows:
964
965  for (i = 0; i < VL; i++)
966 if not zeroing: # an optimisation
967 while (!(predval & 1<<i) && i < VL)
968 if (int_vec[rd ].isvector)  { id += 1; }
969 if (int_vec[rs1].isvector)  { irs1 += 1; }
970 if (int_vec[rs2].isvector)  { irs2 += 1; }
971 if i == VL:
972 return
973 if (predval & 1<<i)
974 src1 = ....
975 src2 = ...
976 else:
977 result = src1 + src2 # actual add (or other op) here
978 set_polymorphed_reg(rd, destwid, ird, result)
979 if int_vec[rd].ffirst and result == 0:
980 VL = i # result was zero, end loop early, return VL
981 return
982 if (!int_vec[rd].isvector) return
983 else if zeroing:
984 result = 0
985 set_polymorphed_reg(rd, destwid, ird, result)
986 if (int_vec[rd ].isvector)  { id += 1; }
987 else if (predval & 1<<i) return
988 if (int_vec[rs1].isvector)  { irs1 += 1; }
989 if (int_vec[rs2].isvector)  { irs2 += 1; }
990 if (rd == VL or rs1 == VL or rs2 == VL): return
991
992 ## Twin-predication (based on source and destination register)
993
994 In twin-predication, the source is independently zero-predicated from
995 the destination. This means that the source may be zero-predicated *or*
996 the destination zero-predicated *or both*, or neither.
997
998 When with twin-predication, zeroing is set on the source and not
999 the destination, if a predicate bit is set it indicates that a zero
1000 data element is passed through the operation (the exception being:
1001 if the source data element is to be treated as an address - a LOAD -
1002 then the data returned *from* the LOAD is zero, rather than looking up an
1003 *address* of zero.
1004
1005 When zeroing is set on the destination and not the source, then just
1006 as with single-predicated operations, a zero is stored into the destination
1007 element (or target memory address for a STORE).
1008
1009 Zeroing on both source and destination effectively result in a bitwise
1010 NOR operation of the source and destination predicate: the result is that
1011 where either source predicate OR destination predicate is set to 0,
1012 a zero element will ultimately end up in the destination register.
1013
1014 However: this may not necessarily be the case for all operations;
1015 implementors, particularly of custom instructions, clearly need to
1016 think through the implications in each and every case.
1017
1018 Here is (simplified) pseudo-code for a twin zero-predicated MV operation:
1019
1020 function op_mv(rd, rs) # MV, not VMV!
1021  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1022  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1023  ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1024  pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1025  for (int i = 0, int j = 0; i < VL && j < VL):
1026 if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1027 if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1028 if ((pd & 1<<j))
1029 ireg[rd+j] <= (pd & 1<<j) ? ireg[rs+1] : 0
1030 else if (zerodst)
1031 ireg[rd+j] <= 0
1032 if (int_csr[rs].isvec) i++;
1033 if (int_csr[rd].isvec) j++;
1034 else if ((pd & 1<<j)) break;
1035
1036 Note that in the instance where the destination is a scalar, the hardware
1037 loop is ended the moment a value *or a zero* is placed into the destination
1038 register/element. Also note that, for clarity, variable element widths
1039 have been left out of the above.
1040
1041 # Exceptions
1042
1043 TODO: expand.
1044
1045 # Hints
1046
1047 With Simple-V being capable of issuing *parallel* instructions where
1048 rd=x0, the space for possible HINTs is expanded considerably. VL
1049 could be used to indicate different hints. In addition, if predication
1050 is set, the predication register itself could hypothetically be passed
1051 in as a *parameter* to the HINT operation.
1052
1053 No specific hints are yet defined in Simple-V
1054
1055 # Vector Block Format <a name="vliw-format"></a>
1056
1057 See ancillary resource: [[vblock_format]]
1058
1059 # Subsets of RV functionality
1060
1061 It is permitted to only implement SVprefix and not the VBLOCK instruction
1062 format option, and vice-versa. UNIX Platforms **MUST** raise illegal
1063 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1064 traps may emulate the format.
1065
1066 It is permitted in SVprefix to either not implement VL or not implement
1067 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1068 *MUST* raise illegal instruction on implementations that do not support
1069 VL or SUBVL.
1070
1071 It is permitted to limit the size of either (or both) the register files
1072 down to the original size of the standard RV architecture. However, below
1073 the mandatory limits set in the RV standard will result in non-compliance
1074 with the SV Specification.
1075