add branch clarification
[libreriscv.git] / simple_v_extension / appendix.mdwn
1 # Simple-V (Parallelism Extension Proposal) Appendix
2
3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
4 * Status: DRAFTv0.6
5 * Last edited: 30 jun 2019
6 * main spec [[specification]]
7
8 [[!toc ]]
9
10 # Fail-on-first modes <a name="ffirst"></a>
11
12 Fail-on-first data dependency has different behaviour for traps than
13 for conditional testing. "Conditional" is taken to mean "anything
14 that is zero", however with traps, the first element has to
15 be given the opportunity to throw the exact same trap that would
16 be thrown if this were a scalar operation (when VL=1).
17
18 Note that implementors are required to mutually exclusively choose one
19 or the other modes: an instruction is **not** permitted to fail on a
20 trap *and* fail a conditional test at the same time. This advice to
21 custom opcode writers as well as future extension writers.
22
23 ## Fail-on-first traps
24
25 Except for the first element, ffirst stops sequential element processing
26 when a trap occurs. The first element is treated normally (as if ffirst
27 is clear). Should any subsequent element instruction require a trap,
28 instead it and subsequent indexed elements are ignored (or cancelled in
29 out-of-order designs), and VL is set to the *last* in-sequence instruction
30 that did not take the trap.
31
32 Note that predicated-out elements (where the predicate mask bit is
33 zero) are clearly excluded (i.e. the trap will not occur). However,
34 note that the loop still had to test the predicate bit: thus on return,
35 VL is set to include elements that did not take the trap *and* includes
36 the elements that were predicated (masked) out (not tested up to the
37 point where the trap occurred).
38
39 Unlike conditional tests, "fail-on-first trap" instruction behaviour is
40 unaltered by setting zero or non-zero predication mode.
41
42 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
43 will cause a trap as normal (as if ffirst is not set); subsequently, the
44 trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
45 be modified. Traps must analyse (x)eSTATE (subvl offset indices) to
46 determine the element that caused the trap.
47
48 Given that predication bits apply to SUBVL groups, the same rules apply
49 to predicated-out (masked-out) sub-groups in calculating the value that
50 VL is set to.
51
52 ## Fail-on-first conditional tests
53
54 ffirst stops sequential (or sequentially-appearing in the case of
55 out-of-order designs) element conditional testing on the first element
56 result being zero (or other "fail" condition). VL is set to the number
57 of elements that were (sequentially) processed before the fail-condition
58 was encountered.
59
60 Unlike trap fail-on-first, fail-on-first conditional testing behaviour
61 responds to changes in the zero or non-zero predication mode. Whilst
62 in non-zeroing mode, masked-out elements are simply not tested (and
63 thus considered "never to fail"), in zeroing mode, masked-out elements
64 may be viewed as *always* (unconditionally) failing. This effectively
65 turns VL into something akin to a software-controlled loop.
66
67 Note that just as with traps, if SUBVL!=1, the first trap in the
68 *sub-group* will cause the processing to end, and, even if there were
69 elements within the *sub-group* that passed the test, that sub-group is
70 still (entirely) excluded from the count (from setting VL). i.e. VL is
71 set to the total number of *sub-groups* that had no fail-condition up
72 until execution was stopped. However, again: SUBVL must not be modified:
73 traps must analyse (x)eSTATE (subvl offset indices) to determine the
74 element that caused the trap.
75
76 Note again that, just as with traps, predicated-out (masked-out) elements
77 are included in the (sequential) count leading up to the fail-condition,
78 even though they were not tested.
79
80 # Instructions <a name="instructions" />
81
82 Despite being a 98% complete and accurate topological remap of RVV
83 concepts and functionality, no new instructions are needed.
84 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
85 becomes a critical dependency for efficient manipulation of predication
86 masks (as a bit-field). Despite the removal of all operations,
87 with the exception of CLIP and VSELECT.X
88 *all instructions from RVV Base are topologically re-mapped and retain their
89 complete functionality, intact*. Note that if RV64G ever had
90 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
91 be obtained in SV.
92
93 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
94 equivalents, so are left out of Simple-V. VSELECT could be included if
95 there existed a MV.X instruction in RV (MV.X is a hypothetical
96 non-immediate variant of MV that would allow another register to
97 specify which register was to be copied). Note that if any of these three
98 instructions are added to any given RV extension, their functionality
99 will be inherently parallelised.
100
101 With some exceptions, where it does not make sense or is simply too
102 challenging, all RV-Base instructions are parallelised:
103
104 * CSR instructions, whilst a case could be made for fast-polling of
105 a CSR into multiple registers, or for being able to copy multiple
106 contiguously addressed CSRs into contiguous registers, and so on,
107 are the fundamental core basis of SV. If parallelised, extreme
108 care would need to be taken. Additionally, CSR reads are done
109 using x0, and it is *really* inadviseable to tag x0.
110 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
111 left as scalar.
112 * LR/SC could hypothetically be parallelised however their purpose is
113 single (complex) atomic memory operations where the LR must be followed
114 up by a matching SC. A sequence of parallel LR instructions followed
115 by a sequence of parallel SC instructions therefore is guaranteed to
116 not be useful. Not least: the guarantees of a Multi-LR/SC
117 would be impossible to provide if emulated in a trap.
118 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
119 paralleliseable anyway.
120
121 All other operations using registers are automatically parallelised.
122 This includes AMOMAX, AMOSWAP and so on, where particular care and
123 attention must be paid.
124
125 Example pseudo-code for an integer ADD operation (including scalar
126 operations). Floating-point uses the FP Register Table.
127
128 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
129
130 Note that for simplicity there is quite a lot missing from the above
131 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
132 reshaping and offsets and so on. However it demonstrates the basic
133 principle. Augmentations that produce the full pseudo-code are covered in
134 other sections.
135
136 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
137
138 Adding in support for SUBVL is a matter of adding in an extra inner
139 for-loop, where register src and dest are still incremented inside the
140 inner part. Note that the predication is still taken from the VL index.
141
142 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
143 indexed by "(i)"
144
145 function op_add(rd, rs1, rs2) # add not VADD!
146  int i, id=0, irs1=0, irs2=0;
147  predval = get_pred_val(FALSE, rd);
148  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
149  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
150  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
151  for (i = 0; i < VL; i++)
152 xSTATE.srcoffs = i # save context
153 for (s = 0; s < SUBVL; s++)
154 xSTATE.ssvoffs = s # save context
155 if (predval & 1<<i) # predication uses intregs
156 # actual add is here (at last)
157    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
158 if (!int_vec[rd ].isvector) break;
159 if (int_vec[rd ].isvector)  { id += 1; }
160 if (int_vec[rs1].isvector)  { irs1 += 1; }
161 if (int_vec[rs2].isvector)  { irs2 += 1; }
162 if (id == VL or irs1 == VL or irs2 == VL) {
163 # end VL hardware loop
164 xSTATE.srcoffs = 0; # reset
165 xSTATE.ssvoffs = 0; # reset
166 return;
167 }
168
169
170 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
171 elwidth handling etc. all left out.
172
173 ## Instruction Format
174
175 It is critical to appreciate that there are
176 **no operations added to SV, at all**.
177
178 Instead, by using CSRs to tag registers as an indication of "changed
179 behaviour", SV *overloads* pre-existing branch operations into predicated
180 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
181 LOAD/STORE depending on CSR configurations for bitwidth and predication.
182 **Everything** becomes parallelised. *This includes Compressed
183 instructions* as well as any future instructions and Custom Extensions.
184
185 Note: CSR tags to change behaviour of instructions is nothing new, including
186 in RISC-V. UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
187 FRM changes the behaviour of the floating-point unit, to alter the rounding
188 mode. Other architectures change the LOAD/STORE byte-order from big-endian
189 to little-endian on a per-instruction basis. SV is just a little more...
190 comprehensive in its effect on instructions.
191
192 ## Branch Instructions
193
194 Branch operations are augmented slightly to be a little more like FP
195 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
196 of multiple comparisons into a register (taken indirectly from the predicate
197 table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
198 See ffirst mode in the Predication Table section.
199
200 There are two registers for the comparison operation, therefore there is
201 the opportunity to associate two predicate registers. The first is a
202 "normal" predicate register, which acts just as it does on any other
203 single-predicated operation: masks out elements where a bit is zero,
204 applies an inversion to the predicate mask, and enables zeroing / non-zeroing
205 mode.
206
207 The second is utilised to indicate where the results of each comparison
208 are to be stored, as a bitmask. Additionally, the behaviour of the branch
209 - when it occurs - may also be modified depending on whether the predicate
210 "invert" bit is set.
211
212 * If the "invert" bit is zero, then the branch will occur if and only
213 all tests pass
214 * If the "invert" bit is set, the branch will occur if and only if all
215 tests *fail*.
216
217 This inversion capability, with some careful boolean logic manipulation,
218 covers AND, OR, NAND and NOR branching based on multiple element comparisons.
219 Note that unlike normal computer programming early-termination of chains
220 of AND or OR conditional tests, the chain does *not* terminate early except
221 if fail-on-first is set, and even then ffirst ends on the first data-dependent
222 zero. When ffirst mode is not set, *all* conditional element tests must be
223 performed (and the result optionally stored in the result mask), with a
224 "post-analysis" phase carried out which checks whether to branch.
225
226 ### Standard Branch <a name="standard_branch"></a>
227
228 Branch operations use standard RV opcodes that are reinterpreted to
229 be "predicate variants" in the instance where either of the two src
230 registers are marked as vectors (active=1, vector=1).
231
232 Note that the predication register to use (if one is enabled) is taken from
233 the *first* src register, and that this is used, just as with predicated
234 arithmetic operations, to mask whether the comparison operations take
235 place or not. The target (destination) predication register
236 to use (if one is enabled) is taken from the *second* src register.
237
238 If either of src1 or src2 are scalars (whether by there being no
239 CSR register entry or whether by the CSR entry specifically marking
240 the register as "scalar") the comparison goes ahead as vector-scalar
241 or scalar-vector.
242
243 In instances where no vectorisation is detected on either src registers
244 the operation is treated as an absolutely standard scalar branch operation.
245 Where vectorisation is present on either or both src registers, the
246 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
247 those tests that are predicated out).
248
249 Note that when zero-predication is enabled (from source rs1),
250 a cleared bit in the predicate indicates that the result
251 of the compare is set to "false", i.e. that the corresponding
252 destination bit (or result)) be set to zero. Contrast this with
253 when zeroing is not set: bits in the destination predicate are
254 only *set*; they are **not** cleared. This is important to appreciate,
255 as there may be an expectation that, going into the hardware-loop,
256 the destination predicate is always expected to be set to zero:
257 this is **not** the case. The destination predicate is only set
258 to zero if **zeroing** is enabled.
259
260 Note that just as with the standard (scalar, non-predicated) branch
261 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
262 src1 and src2, however note that in doing so, the predicate table
263 setup must also be correspondingly adjusted.
264
265 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
266 for predicated compare operations of function "cmp":
267
268 for (int i=0; i<vl; ++i)
269 if ([!]preg[p][i])
270 preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
271 s2 ? vreg[rs2][i] : sreg[rs2]);
272
273 With associated predication, vector-length adjustments and so on,
274 and temporarily ignoring bitwidth (which makes the comparisons more
275 complex), this becomes:
276
277 s1 = reg_is_vectorised(src1);
278 s2 = reg_is_vectorised(src2);
279
280 if not s1 && not s2
281 if cmp(rs1, rs2) # scalar compare
282 goto branch
283 return
284
285 preg = int_pred_reg[rd]
286 reg = int_regfile
287
288 ps = get_pred_val(I/F==INT, rs1);
289 rd = get_pred_val(I/F==INT, rs2); # this may not exist
290
291 ffirst_mode, zeroing = get_pred_flags(rs1)
292 if exists(rd):
293 pred_inversion = get_pred_invert(rs2)
294 else
295 pred_inversion = False
296
297 if not exists(rd) or zeroing:
298 result = (1<<VL)-1 # all 1s
299 else
300 result = preg[rd]
301
302 for (int i = 0; i < VL; ++i)
303 if (zeroing)
304 if not (ps & (1<<i))
305 result &= ~(1<<i);
306 else if (ps & (1<<i))
307 if (cmp(s1 ? reg[src1+i]:reg[src1],
308 s2 ? reg[src2+i]:reg[src2])
309 result |= 1<<i;
310 else
311 result &= ~(1<<i);
312 if ffirst_mode:
313 break
314
315 if exists(rd):
316 preg[rd] = result # store in destination
317
318 if pred_inversion:
319 if result == 0:
320 goto branch
321 else:
322 if (result & ps) == result:
323 goto branch
324
325 Notes:
326
327 * Predicated SIMD comparisons would break src1 and src2 further down
328 into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
329 Reordering") setting Vector-Length times (number of SIMD elements) bits
330 in Predicate Register rd, as opposed to just Vector-Length bits.
331 * The execution of "parallelised" instructions **must** be implemented
332 as "re-entrant" (to use a term from software). If an exception (trap)
333 occurs during the middle of a vectorised
334 Branch (now a SV predicated compare) operation, the partial results
335 of any comparisons must be written out to the destination
336 register before the trap is permitted to begin. If however there
337 is no predicate, the **entire** set of comparisons must be **restarted**,
338 with the offset loop indices set back to zero. This is because
339 there is no place to store the temporary result during the handling
340 of traps.
341
342 TODO: predication now taken from src2. also branch goes ahead
343 if all compares are successful.
344
345 Note also that where normally, predication requires that there must
346 also be a CSR register entry for the register being used in order
347 for the **predication** CSR register entry to also be active,
348 for branches this is **not** the case. src2 does **not** have
349 to have its CSR register entry marked as active in order for
350 predication on src2 to be active.
351
352 Also note: SV Branch operations are **not** twin-predicated
353 (see Twin Predication section). This would require three
354 element offsets: one to track src1, one to track src2 and a third
355 to track where to store the accumulation of the results. Given
356 that the element offsets need to be exposed via CSRs so that
357 the parallel hardware looping may be made re-entrant on traps
358 and exceptions, the decision was made not to make SV Branches
359 twin-predicated.
360
361 ### Floating-point Comparisons
362
363 There does not exist floating-point branch operations, only compare.
364 Interestingly no change is needed to the instruction format because
365 FP Compare already stores a 1 or a zero in its "rd" integer register
366 target, i.e. it's not actually a Branch at all: it's a compare.
367
368 In RV (scalar) Base, a branch on a floating-point compare is
369 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
370 This does extend to SV, as long as x1 (in the example sequence given)
371 is vectorised. When that is the case, x1..x(1+VL-1) will also be
372 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
373 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
374 so on. Consequently, unlike integer-branch, FP Compare needs no
375 modification in its behaviour.
376
377 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
378 missing, and whilst in ordinary branch code this is fine because the
379 standard RVF compare can always be followed up with an integer BEQ or
380 a BNE (or a compressed comparison to zero or non-zero), in predication
381 terms that becomes more of an impact. To deal with this, SV's predication
382 has had "invert" added to it.
383
384 Also: note that FP Compare may be predicated, using the destination
385 integer register (rd) to determine the predicate. FP Compare is **not**
386 a twin-predication operation, as, again, just as with SV Branches,
387 there are three registers involved: FP src1, FP src2 and INT rd.
388
389 Also: note that ffirst (fail first mode) applies directly to this operation.
390
391 ### Compressed Branch Instruction
392
393 Compressed Branch instructions are, just like standard Branch instructions,
394 reinterpreted to be vectorised and predicated based on the source register
395 (rs1s) CSR entries. As however there is only the one source register,
396 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
397 to store the results of the comparisions is taken from CSR predication
398 table entries for **x0**.
399
400 The specific required use of x0 is, with a little thought, quite obvious,
401 but is counterintuitive. Clearly it is **not** recommended to redirect
402 x0 with a CSR register entry, however as a means to opaquely obtain
403 a predication target it is the only sensible option that does not involve
404 additional special CSRs (or, worse, additional special opcodes).
405
406 Note also that, just as with standard branches, the 2nd source
407 (in this case x0 rather than src2) does **not** have to have its CSR
408 register table marked as "active" in order for predication to work.
409
410 ## Vectorised Dual-operand instructions
411
412 There is a series of 2-operand instructions involving copying (and
413 sometimes alteration):
414
415 * C.MV
416 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
417 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
418 * LOAD(-FP) and STORE(-FP)
419
420 All of these operations follow the same two-operand pattern, so it is
421 *both* the source *and* destination predication masks that are taken into
422 account. This is different from
423 the three-operand arithmetic instructions, where the predication mask
424 is taken from the *destination* register, and applied uniformly to the
425 elements of the source register(s), element-for-element.
426
427 The pseudo-code pattern for twin-predicated operations is as
428 follows:
429
430 function op(rd, rs):
431  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
432  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
433  ps = get_pred_val(FALSE, rs); # predication on src
434  pd = get_pred_val(FALSE, rd); # ... AND on dest
435  for (int i = 0, int j = 0; i < VL && j < VL;):
436 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
437 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
438 xSTATE.srcoffs = i # save context
439 xSTATE.destoffs = j # save context
440 reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
441 if (int_csr[rs].isvec) i++;
442 if (int_csr[rd].isvec) j++; else break
443
444 This pattern covers scalar-scalar, scalar-vector, vector-scalar
445 and vector-vector, and predicated variants of all of those.
446 Zeroing is not presently included (TODO). As such, when compared
447 to RVV, the twin-predicated variants of C.MV and FMV cover
448 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
449 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
450
451 Note that:
452
453 * elwidth (SIMD) is not covered in the pseudo-code above
454 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
455 not covered
456 * zero predication is also not shown (TODO).
457
458 ### C.MV Instruction <a name="c_mv"></a>
459
460 There is no MV instruction in RV however there is a C.MV instruction.
461 It is used for copying integer-to-integer registers (vectorised FMV
462 is used for copying floating-point).
463
464 If either the source or the destination register are marked as vectors
465 C.MV is reinterpreted to be a vectorised (multi-register) predicated
466 move operation. The actual instruction's format does not change:
467
468 [[!table data="""
469 15 12 | 11 7 | 6 2 | 1 0 |
470 funct4 | rd | rs | op |
471 4 | 5 | 5 | 2 |
472 C.MV | dest | src | C0 |
473 """]]
474
475 A simplified version of the pseudocode for this operation is as follows:
476
477 function op_mv(rd, rs) # MV not VMV!
478  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
479  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
480  ps = get_pred_val(FALSE, rs); # predication on src
481  pd = get_pred_val(FALSE, rd); # ... AND on dest
482  for (int i = 0, int j = 0; i < VL && j < VL;):
483 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
484 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
485 xSTATE.srcoffs = i # save context
486 xSTATE.destoffs = j # save context
487 ireg[rd+j] <= ireg[rs+i];
488 if (int_csr[rs].isvec) i++;
489 if (int_csr[rd].isvec) j++; else break
490
491 There are several different instructions from RVV that are covered by
492 this one opcode:
493
494 [[!table data="""
495 src | dest | predication | op |
496 scalar | vector | none | VSPLAT |
497 scalar | vector | destination | sparse VSPLAT |
498 scalar | vector | 1-bit dest | VINSERT |
499 vector | scalar | 1-bit? src | VEXTRACT |
500 vector | vector | none | VCOPY |
501 vector | vector | src | Vector Gather |
502 vector | vector | dest | Vector Scatter |
503 vector | vector | src & dest | Gather/Scatter |
504 vector | vector | src == dest | sparse VCOPY |
505 """]]
506
507 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
508 operations with zeroing off, and inversion on the src and dest predication
509 for one of the two C.MV operations. The non-inverted C.MV will place
510 one set of registers into the destination, and the inverted one the other
511 set. With predicate-inversion, copying and inversion of the predicate mask
512 need not be done as a separate (scalar) instruction.
513
514 Note that in the instance where the Compressed Extension is not implemented,
515 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
516 Note that the behaviour is **different** from C.MV because with addi the
517 predication mask to use is taken **only** from rd and is applied against
518 all elements: rs[i] = rd[i].
519
520 ### FMV, FNEG and FABS Instructions
521
522 These are identical in form to C.MV, except covering floating-point
523 register copying. The same double-predication rules also apply.
524 However when elwidth is not set to default the instruction is implicitly
525 and automatic converted to a (vectorised) floating-point type conversion
526 operation of the appropriate size covering the source and destination
527 register bitwidths.
528
529 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
530
531 ### FVCT Instructions
532
533 These are again identical in form to C.MV, except that they cover
534 floating-point to integer and integer to floating-point. When element
535 width in each vector is set to default, the instructions behave exactly
536 as they are defined for standard RV (scalar) operations, except vectorised
537 in exactly the same fashion as outlined in C.MV.
538
539 However when the source or destination element width is not set to default,
540 the opcode's explicit element widths are *over-ridden* to new definitions,
541 and the opcode's element width is taken as indicative of the SIMD width
542 (if applicable i.e. if packed SIMD is requested) instead.
543
544 For example FCVT.S.L would normally be used to convert a 64-bit
545 integer in register rs1 to a 64-bit floating-point number in rd.
546 If however the source rs1 is set to be a vector, where elwidth is set to
547 default/2 and "packed SIMD" is enabled, then the first 32 bits of
548 rs1 are converted to a floating-point number to be stored in rd's
549 first element and the higher 32-bits *also* converted to floating-point
550 and stored in the second. The 32 bit size comes from the fact that
551 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
552 divide that by two it means that rs1 element width is to be taken as 32.
553
554 Similar rules apply to the destination register.
555
556 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
557
558 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
559 the interpretation of the instruction fields). This
560 actually undermined the fundamental principle of SV, namely that there
561 be no modifications to the scalar behaviour (except where absolutely
562 necessary), in order to simplify an implementor's task if considering
563 converting a pre-existing scalar design to support parallelism.
564
565 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
566 do not change in SV, however just as with C.MV it is important to note
567 that dual-predication is possible.
568
569 In vectorised architectures there are usually at least two different modes
570 for LOAD/STORE:
571
572 * Read (or write for STORE) from sequential locations, where one
573 register specifies the address, and the one address is incremented
574 by a fixed amount. This is usually known as "Unit Stride" mode.
575 * Read (or write) from multiple indirected addresses, where the
576 vector elements each specify separate and distinct addresses.
577
578 To support these different addressing modes, the CSR Register "isvector"
579 bit is used. So, for a LOAD, when the src register is set to
580 scalar, the LOADs are sequentially incremented by the src register
581 element width, and when the src register is set to "vector", the
582 elements are treated as indirection addresses. Simplified
583 pseudo-code would look like this:
584
585 function op_ld(rd, rs) # LD not VLD!
586  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
587  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
588  ps = get_pred_val(FALSE, rs); # predication on src
589  pd = get_pred_val(FALSE, rd); # ... AND on dest
590  for (int i = 0, int j = 0; i < VL && j < VL;):
591 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
592 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
593 if (int_csr[rd].isvec)
594 # indirect mode (multi mode)
595 srcbase = ireg[rsv+i];
596 else
597 # unit stride mode
598 srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
599 ireg[rdv+j] <= mem[srcbase + imm_offs];
600 if (!int_csr[rs].isvec &&
601 !int_csr[rd].isvec) break # scalar-scalar LD
602 if (int_csr[rs].isvec) i++;
603 if (int_csr[rd].isvec) j++;
604
605 Notes:
606
607 * For simplicity, zeroing and elwidth is not included in the above:
608 the key focus here is the decision-making for srcbase; vectorised
609 rs means use sequentially-numbered registers as the indirection
610 address, and scalar rs is "offset" mode.
611 * The test towards the end for whether both source and destination are
612 scalar is what makes the above pseudo-code provide the "standard" RV
613 Base behaviour for LD operations.
614 * The offset in bytes (XLEN/8) changes depending on whether the
615 operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
616 (8 bytes), and also whether the element width is over-ridden
617 (see special element width section).
618
619 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
620
621 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
622 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
623 It is therefore possible to use predicated C.LWSP to efficiently
624 pop registers off the stack (by predicating x2 as the source), cherry-picking
625 which registers to store to (by predicating the destination). Likewise
626 for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
627
628 The two modes ("unit stride" and multi-indirection) are still supported,
629 as with standard LD/ST. Essentially, the only difference is that the
630 use of x2 is hard-coded into the instruction.
631
632 **Note**: it is still possible to redirect x2 to an alternative target
633 register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
634 general-purpose LOAD/STORE operations.
635
636 ## Compressed LOAD / STORE Instructions
637
638 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
639 where the same rules apply and the same pseudo-code apply as for
640 non-compressed LOAD/STORE. Again: setting scalar or vector mode
641 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
642 to "Multi-indirection", respectively.
643
644 # Element bitwidth polymorphism <a name="elwidth"></a>
645
646 Element bitwidth is best covered as its own special section, as it
647 is quite involved and applies uniformly across-the-board. SV restricts
648 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
649
650 The effect of setting an element bitwidth is to re-cast each entry
651 in the register table, and for all memory operations involving
652 load/stores of certain specific sizes, to a completely different width.
653 Thus In c-style terms, on an RV64 architecture, effectively each register
654 now looks like this:
655
656 typedef union {
657 uint8_t b[8];
658 uint16_t s[4];
659 uint32_t i[2];
660 uint64_t l[1];
661 } reg_t;
662
663 // integer table: assume maximum SV 7-bit regfile size
664 reg_t int_regfile[128];
665
666 where the CSR Register table entry (not the instruction alone) determines
667 which of those union entries is to be used on each operation, and the
668 VL element offset in the hardware-loop specifies the index into each array.
669
670 However a naive interpretation of the data structure above masks the
671 fact that setting VL greater than 8, for example, when the bitwidth is 8,
672 accessing one specific register "spills over" to the following parts of
673 the register file in a sequential fashion. So a much more accurate way
674 to reflect this would be:
675
676 typedef union {
677 uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
678 uint8_t b[0]; // array of type uint8_t
679 uint16_t s[0];
680 uint32_t i[0];
681 uint64_t l[0];
682 uint128_t d[0];
683 } reg_t;
684
685 reg_t int_regfile[128];
686
687 where when accessing any individual regfile[n].b entry it is permitted
688 (in c) to arbitrarily over-run the *declared* length of the array (zero),
689 and thus "overspill" to consecutive register file entries in a fashion
690 that is completely transparent to a greatly-simplified software / pseudo-code
691 representation.
692 It is however critical to note that it is clearly the responsibility of
693 the implementor to ensure that, towards the end of the register file,
694 an exception is thrown if attempts to access beyond the "real" register
695 bytes is ever attempted.
696
697 Now we may modify pseudo-code an operation where all element bitwidths have
698 been set to the same size, where this pseudo-code is otherwise identical
699 to its "non" polymorphic versions (above):
700
701 function op_add(rd, rs1, rs2) # add not VADD!
702 ...
703 ...
704  for (i = 0; i < VL; i++)
705 ...
706 ...
707 // TODO, calculate if over-run occurs, for each elwidth
708 if (elwidth == 8) {
709    int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
710     int_regfile[rs2].i[irs2];
711 } else if elwidth == 16 {
712    int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
713     int_regfile[rs2].s[irs2];
714 } else if elwidth == 32 {
715    int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
716     int_regfile[rs2].i[irs2];
717 } else { // elwidth == 64
718    int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
719     int_regfile[rs2].l[irs2];
720 }
721 ...
722 ...
723
724 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
725 following sequentially on respectively from the same) are "type-cast"
726 to 8-bit; for 16-bit entries likewise and so on.
727
728 However that only covers the case where the element widths are the same.
729 Where the element widths are different, the following algorithm applies:
730
731 * Analyse the bitwidth of all source operands and work out the
732 maximum. Record this as "maxsrcbitwidth"
733 * If any given source operand requires sign-extension or zero-extension
734 (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
735 sign-extension / zero-extension or whatever is specified in the standard
736 RV specification, **change** that to sign-extending from the respective
737 individual source operand's bitwidth from the CSR table out to
738 "maxsrcbitwidth" (previously calculated), instead.
739 * Following separate and distinct (optional) sign/zero-extension of all
740 source operands as specifically required for that operation, carry out the
741 operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
742 this may be a "null" (copy) operation, and that with FCVT, the changes
743 to the source and destination bitwidths may also turn FVCT effectively
744 into a copy).
745 * If the destination operand requires sign-extension or zero-extension,
746 instead of a mandatory fixed size (typically 32-bit for arithmetic,
747 for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
748 etc.), overload the RV specification with the bitwidth from the
749 destination register's elwidth entry.
750 * Finally, store the (optionally) sign/zero-extended value into its
751 destination: memory for sb/sw etc., or an offset section of the register
752 file for an arithmetic operation.
753
754 In this way, polymorphic bitwidths are achieved without requiring a
755 massive 64-way permutation of calculations **per opcode**, for example
756 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
757 rd bitwidths). The pseudo-code is therefore as follows:
758
759 typedef union {
760 uint8_t b;
761 uint16_t s;
762 uint32_t i;
763 uint64_t l;
764 } el_reg_t;
765
766 bw(elwidth):
767 if elwidth == 0: return xlen
768 if elwidth == 1: return 8
769 if elwidth == 2: return 16
770 // elwidth == 3:
771 return 32
772
773 get_max_elwidth(rs1, rs2):
774 return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
775 bw(int_csr[rs2].elwidth)) # again XLEN if no entry
776
777 get_polymorphed_reg(reg, bitwidth, offset):
778 el_reg_t res;
779 res.l = 0; // TODO: going to need sign-extending / zero-extending
780 if bitwidth == 8:
781 reg.b = int_regfile[reg].b[offset]
782 elif bitwidth == 16:
783 reg.s = int_regfile[reg].s[offset]
784 elif bitwidth == 32:
785 reg.i = int_regfile[reg].i[offset]
786 elif bitwidth == 64:
787 reg.l = int_regfile[reg].l[offset]
788 return res
789
790 set_polymorphed_reg(reg, bitwidth, offset, val):
791 if (!int_csr[reg].isvec):
792 # sign/zero-extend depending on opcode requirements, from
793 # the reg's bitwidth out to the full bitwidth of the regfile
794 val = sign_or_zero_extend(val, bitwidth, xlen)
795 int_regfile[reg].l[0] = val
796 elif bitwidth == 8:
797 int_regfile[reg].b[offset] = val
798 elif bitwidth == 16:
799 int_regfile[reg].s[offset] = val
800 elif bitwidth == 32:
801 int_regfile[reg].i[offset] = val
802 elif bitwidth == 64:
803 int_regfile[reg].l[offset] = val
804
805 maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
806 destwid = int_csr[rs1].elwidth # destination element width
807  for (i = 0; i < VL; i++)
808 if (predval & 1<<i) # predication uses intregs
809 // TODO, calculate if over-run occurs, for each elwidth
810 src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
811 // TODO, sign/zero-extend src1 and src2 as operation requires
812 if (op_requires_sign_extend_src1)
813 src1 = sign_extend(src1, maxsrcwid)
814 src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
815 result = src1 + src2 # actual add here
816 // TODO, sign/zero-extend result, as operation requires
817 if (op_requires_sign_extend_dest)
818 result = sign_extend(result, maxsrcwid)
819 set_polymorphed_reg(rd, destwid, ird, result)
820 if (!int_vec[rd].isvector) break
821 if (int_vec[rd ].isvector)  { id += 1; }
822 if (int_vec[rs1].isvector)  { irs1 += 1; }
823 if (int_vec[rs2].isvector)  { irs2 += 1; }
824
825 Whilst specific sign-extension and zero-extension pseudocode call
826 details are left out, due to each operation being different, the above
827 should be clear that;
828
829 * the source operands are extended out to the maximum bitwidth of all
830 source operands
831 * the operation takes place at that maximum source bitwidth (the
832 destination bitwidth is not involved at this point, at all)
833 * the result is extended (or potentially even, truncated) before being
834 stored in the destination. i.e. truncation (if required) to the
835 destination width occurs **after** the operation **not** before.
836 * when the destination is not marked as "vectorised", the **full**
837 (standard, scalar) register file entry is taken up, i.e. the
838 element is either sign-extended or zero-extended to cover the
839 full register bitwidth (XLEN) if it is not already XLEN bits long.
840
841 Implementors are entirely free to optimise the above, particularly
842 if it is specifically known that any given operation will complete
843 accurately in less bits, as long as the results produced are
844 directly equivalent and equal, for all inputs and all outputs,
845 to those produced by the above algorithm.
846
847 ## Polymorphic floating-point operation exceptions and error-handling
848
849 For floating-point operations, conversion takes place without raising any
850 kind of exception. Exactly as specified in the standard RV specification,
851 NAN (or appropriate) is stored if the result is beyond the range of the
852 destination, and, again, exactly as with the standard RV specification
853 just as with scalar operations, the floating-point flag is raised
854 (FCSR). And, again, just as with scalar operations, it is software's
855 responsibility to check this flag. Given that the FCSR flags are
856 "accrued", the fact that multiple element operations could have occurred
857 is not a problem.
858
859 Note that it is perfectly legitimate for floating-point bitwidths of
860 only 8 to be specified. However whilst it is possible to apply IEEE 754
861 principles, no actual standard yet exists. Implementors wishing to
862 provide hardware-level 8-bit support rather than throw a trap to emulate
863 in software should contact the author of this specification before
864 proceeding.
865
866 ## Polymorphic shift operators
867
868 A special note is needed for changing the element width of left and
869 right shift operators, particularly right-shift. Even for standard RV
870 base, in order for correct results to be returned, the second operand
871 RS2 must be truncated to be within the range of RS1's bitwidth.
872 spike's implementation of sll for example is as follows:
873
874 WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
875
876 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
877 range 0..31 so that RS1 will only be left-shifted by the amount that
878 is possible to fit into a 32-bit register. Whilst this appears not
879 to matter for hardware, it matters greatly in software implementations,
880 and it also matters where an RV64 system is set to "RV32" mode, such
881 that the underlying registers RS1 and RS2 comprise 64 hardware bits
882 each.
883
884 For SV, where each operand's element bitwidth may be over-ridden, the
885 rule about determining the operation's bitwidth *still applies*, being
886 defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
887 **also applies to the truncation of RS2**. In other words, *after*
888 determining the maximum bitwidth, RS2's range must **also be truncated**
889 to ensure a correct answer. Example:
890
891 * RS1 is over-ridden to a 16-bit width
892 * RS2 is over-ridden to an 8-bit width
893 * RD is over-ridden to a 64-bit width
894 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
895 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
896
897 Pseudocode (in spike) for this example would therefore be:
898
899 WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
900
901 This example illustrates that considerable care therefore needs to be
902 taken to ensure that left and right shift operations are implemented
903 correctly. The key is that
904
905 * The operation bitwidth is determined by the maximum bitwidth
906 of the *source registers*, **not** the destination register bitwidth
907 * The result is then sign-extend (or truncated) as appropriate.
908
909 ## Polymorphic MULH/MULHU/MULHSU
910
911 MULH is designed to take the top half MSBs of a multiply that
912 does not fit within the range of the source operands, such that
913 smaller width operations may produce a full double-width multiply
914 in two cycles. The issue is: SV allows the source operands to
915 have variable bitwidth.
916
917 Here again special attention has to be paid to the rules regarding
918 bitwidth, which, again, are that the operation is performed at
919 the maximum bitwidth of the **source** registers. Therefore:
920
921 * An 8-bit x 8-bit multiply will create a 16-bit result that must
922 be shifted down by 8 bits
923 * A 16-bit x 8-bit multiply will create a 24-bit result that must
924 be shifted down by 16 bits (top 8 bits being zero)
925 * A 16-bit x 16-bit multiply will create a 32-bit result that must
926 be shifted down by 16 bits
927 * A 32-bit x 16-bit multiply will create a 48-bit result that must
928 be shifted down by 32 bits
929 * A 32-bit x 8-bit multiply will create a 40-bit result that must
930 be shifted down by 32 bits
931
932 So again, just as with shift-left and shift-right, the result
933 is shifted down by the maximum of the two source register bitwidths.
934 And, exactly again, truncation or sign-extension is performed on the
935 result. If sign-extension is to be carried out, it is performed
936 from the same maximum of the two source register bitwidths out
937 to the result element's bitwidth.
938
939 If truncation occurs, i.e. the top MSBs of the result are lost,
940 this is "Officially Not Our Problem", i.e. it is assumed that the
941 programmer actually desires the result to be truncated. i.e. if the
942 programmer wanted all of the bits, they would have set the destination
943 elwidth to accommodate them.
944
945 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
946
947 Polymorphic element widths in vectorised form means that the data
948 being loaded (or stored) across multiple registers needs to be treated
949 (reinterpreted) as a contiguous stream of elwidth-wide items, where
950 the source register's element width is **independent** from the destination's.
951
952 This makes for a slightly more complex algorithm when using indirection
953 on the "addressed" register (source for LOAD and destination for STORE),
954 particularly given that the LOAD/STORE instruction provides important
955 information about the width of the data to be reinterpreted.
956
957 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
958 was as follows, and i is the loop from 0 to VL-1:
959
960 srcbase = ireg[rs+i];
961 return mem[srcbase + imm]; // returns XLEN bits
962
963 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
964 chunks are taken from the source memory location addressed by the current
965 indexed source address register, and only when a full 32-bits-worth
966 are taken will the index be moved on to the next contiguous source
967 address register:
968
969 bitwidth = bw(elwidth); // source elwidth from CSR reg entry
970 elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
971 srcbase = ireg[rs+i/(elsperblock)]; // integer divide
972 offs = i % elsperblock; // modulo
973 return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
974
975 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
976 and 128 for LQ.
977
978 The principle is basically exactly the same as if the srcbase were pointing
979 at the memory of the *register* file: memory is re-interpreted as containing
980 groups of elwidth-wide discrete elements.
981
982 When storing the result from a load, it's important to respect the fact
983 that the destination register has its *own separate element width*. Thus,
984 when each element is loaded (at the source element width), any sign-extension
985 or zero-extension (or truncation) needs to be done to the *destination*
986 bitwidth. Also, the storing has the exact same analogous algorithm as
987 above, where in fact it is just the set\_polymorphed\_reg pseudocode
988 (completely unchanged) used above.
989
990 One issue remains: when the source element width is **greater** than
991 the width of the operation, it is obvious that a single LB for example
992 cannot possibly obtain 16-bit-wide data. This condition may be detected
993 where, when using integer divide, elsperblock (the width of the LOAD
994 divided by the bitwidth of the element) is zero.
995
996 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
997
998 elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
999
1000 The elements, if the element bitwidth is larger than the LD operation's
1001 size, will then be sign/zero-extended to the full LD operation size, as
1002 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
1003 being passed on to the second phase.
1004
1005 As LOAD/STORE may be twin-predicated, it is important to note that
1006 the rules on twin predication still apply, except where in previous
1007 pseudo-code (elwidth=default for both source and target) it was
1008 the *registers* that the predication was applied to, it is now the
1009 **elements** that the predication is applied to.
1010
1011 Thus the full pseudocode for all LD operations may be written out
1012 as follows:
1013
1014 function LBU(rd, rs):
1015 load_elwidthed(rd, rs, 8, true)
1016 function LB(rd, rs):
1017 load_elwidthed(rd, rs, 8, false)
1018 function LH(rd, rs):
1019 load_elwidthed(rd, rs, 16, false)
1020 ...
1021 ...
1022 function LQ(rd, rs):
1023 load_elwidthed(rd, rs, 128, false)
1024
1025 # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
1026 function load_memory(rs, imm, i, opwidth):
1027 elwidth = int_csr[rs].elwidth
1028 bitwidth = bw(elwidth);
1029 elsperblock = min(1, opwidth / bitwidth)
1030 srcbase = ireg[rs+i/(elsperblock)];
1031 offs = i % elsperblock;
1032 return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
1033
1034 function load_elwidthed(rd, rs, opwidth, unsigned):
1035 destwid = int_csr[rd].elwidth # destination element width
1036  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1037  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1038  ps = get_pred_val(FALSE, rs); # predication on src
1039  pd = get_pred_val(FALSE, rd); # ... AND on dest
1040  for (int i = 0, int j = 0; i < VL && j < VL;):
1041 if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
1042 if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
1043 val = load_memory(rs, imm, i, opwidth)
1044 if unsigned:
1045 val = zero_extend(val, min(opwidth, bitwidth))
1046 else:
1047 val = sign_extend(val, min(opwidth, bitwidth))
1048 set_polymorphed_reg(rd, bitwidth, j, val)
1049 if (int_csr[rs].isvec) i++;
1050 if (int_csr[rd].isvec) j++; else break;
1051
1052 Note:
1053
1054 * when comparing against for example the twin-predicated c.mv
1055 pseudo-code, the pattern of independent incrementing of rd and rs
1056 is preserved unchanged.
1057 * just as with the c.mv pseudocode, zeroing is not included and must be
1058 taken into account (TODO).
1059 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1060 take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1061 VSCATTER characteristics.
1062 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1063 a destination that is not vectorised (marked as scalar) will
1064 result in the element being fully sign-extended or zero-extended
1065 out to the full register file bitwidth (XLEN). When the source
1066 is also marked as scalar, this is how the compatibility with
1067 standard RV LOAD/STORE is preserved by this algorithm.
1068
1069 ### Example Tables showing LOAD elements
1070
1071 This section contains examples of vectorised LOAD operations, showing
1072 how the two stage process works (three if zero/sign-extension is included).
1073
1074
1075 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1076
1077 This is:
1078
1079 * a 64-bit load, with an offset of zero
1080 * with a source-address elwidth of 16-bit
1081 * into a destination-register with an elwidth of 32-bit
1082 * where VL=7
1083 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1084 * RV64, where XLEN=64 is assumed.
1085
1086 First, the memory table, which, due to the element width being 16 and the
1087 operation being LD (64), the 64-bits loaded from memory are subdivided
1088 into groups of **four** elements. And, with VL being 7 (deliberately
1089 to illustrate that this is reasonable and possible), the first four are
1090 sourced from the offset addresses pointed to by x5, and the next three
1091 from the ofset addresses pointed to by the next contiguous register, x6:
1092
1093 [[!table data="""
1094 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1095 @x5 | elem 0 || elem 1 || elem 2 || elem 3 ||
1096 @x6 | elem 4 || elem 5 || elem 6 || not loaded ||
1097 """]]
1098
1099 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1100 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1101
1102 [[!table data="""
1103 byte 3 | byte 2 | byte 1 | byte 0 |
1104 0x0 | 0x0 | elem0 ||
1105 0x0 | 0x0 | elem1 ||
1106 0x0 | 0x0 | elem2 ||
1107 0x0 | 0x0 | elem3 ||
1108 0x0 | 0x0 | elem4 ||
1109 0x0 | 0x0 | elem5 ||
1110 0x0 | 0x0 | elem6 ||
1111 0x0 | 0x0 | elem7 ||
1112 """]]
1113
1114 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1115 byte-addressable "memory". That "memory" happens to cover registers
1116 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1117
1118 [[!table data="""
1119 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1120 x8 | 0x0 | 0x0 | elem 1 || 0x0 | 0x0 | elem 0 ||
1121 x9 | 0x0 | 0x0 | elem 3 || 0x0 | 0x0 | elem 2 ||
1122 x10 | 0x0 | 0x0 | elem 5 || 0x0 | 0x0 | elem 4 ||
1123 x11 | **UNMODIFIED** |||| 0x0 | 0x0 | elem 6 ||
1124 """]]
1125
1126 Thus we have data that is loaded from the **addresses** pointed to by
1127 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1128 x8 through to half of x11.
1129 The end result is that elements 0 and 1 end up in x8, with element 8 being
1130 shifted up 32 bits, and so on, until finally element 6 is in the
1131 LSBs of x11.
1132
1133 Note that whilst the memory addressing table is shown left-to-right byte order,
1134 the registers are shown in right-to-left (MSB) order. This does **not**
1135 imply that bit or byte-reversal is carried out: it's just easier to visualise
1136 memory as being contiguous bytes, and emphasises that registers are not
1137 really actually "memory" as such.
1138
1139 ## Why SV bitwidth specification is restricted to 4 entries
1140
1141 The four entries for SV element bitwidths only allows three over-rides:
1142
1143 * 8 bit
1144 * 16 hit
1145 * 32 bit
1146
1147 This would seem inadequate, surely it would be better to have 3 bits or
1148 more and allow 64, 128 and some other options besides. The answer here
1149 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1150 default is 64 bit, so the 4 major element widths are covered anyway.
1151
1152 There is an absolutely crucial aspect oF SV here that explicitly
1153 needs spelling out, and it's whether the "vectorised" bit is set in
1154 the Register's CSR entry.
1155
1156 If "vectorised" is clear (not set), this indicates that the operation
1157 is "scalar". Under these circumstances, when set on a destination (RD),
1158 then sign-extension and zero-extension, whilst changed to match the
1159 override bitwidth (if set), will erase the **full** register entry
1160 (64-bit if RV64).
1161
1162 When vectorised is *set*, this indicates that the operation now treats
1163 **elements** as if they were independent registers, so regardless of
1164 the length, any parts of a given actual register that are not involved
1165 in the operation are **NOT** modified, but are **PRESERVED**.
1166
1167 For example:
1168
1169 * when the vector bit is clear and elwidth set to 16 on the destination
1170 register, operations are truncated to 16 bit and then sign or zero
1171 extended to the *FULL* XLEN register width.
1172 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1173 groups of elwidth sized elements do not fill an entire XLEN register),
1174 the "top" bits of the destination register do *NOT* get modified, zero'd
1175 or otherwise overwritten.
1176
1177 SIMD micro-architectures may implement this by using predication on
1178 any elements in a given actual register that are beyond the end of
1179 multi-element operation.
1180
1181 Other microarchitectures may choose to provide byte-level write-enable
1182 lines on the register file, such that each 64 bit register in an RV64
1183 system requires 8 WE lines. Scalar RV64 operations would require
1184 activation of all 8 lines, where SV elwidth based operations would
1185 activate the required subset of those byte-level write lines.
1186
1187 Example:
1188
1189 * rs1, rs2 and rd are all set to 8-bit
1190 * VL is set to 3
1191 * RV64 architecture is set (UXL=64)
1192 * add operation is carried out
1193 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1194 concatenated with similar add operations on bits 15..8 and 7..0
1195 * bits 24 through 63 **remain as they originally were**.
1196
1197 Example SIMD micro-architectural implementation:
1198
1199 * SIMD architecture works out the nearest round number of elements
1200 that would fit into a full RV64 register (in this case: 8)
1201 * SIMD architecture creates a hidden predicate, binary 0b00000111
1202 i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1203 * SIMD architecture goes ahead with the add operation as if it
1204 was a full 8-wide batch of 8 adds
1205 * SIMD architecture passes top 5 elements through the adders
1206 (which are "disabled" due to zero-bit predication)
1207 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1208 and stores them in rd.
1209
1210 This requires a read on rd, however this is required anyway in order
1211 to support non-zeroing mode.
1212
1213 ## Polymorphic floating-point
1214
1215 Standard scalar RV integer operations base the register width on XLEN,
1216 which may be changed (UXL in USTATUS, and the corresponding MXL and
1217 SXL in MSTATUS and SSTATUS respectively). Integer LOAD, STORE and
1218 arithmetic operations are therefore restricted to an active XLEN bits,
1219 with sign or zero extension to pad out the upper bits when XLEN has
1220 been dynamically set to less than the actual register size.
1221
1222 For scalar floating-point, the active (used / changed) bits are
1223 specified exclusively by the operation: ADD.S specifies an active
1224 32-bits, with the upper bits of the source registers needing to
1225 be all 1s ("NaN-boxed"), and the destination upper bits being
1226 *set* to all 1s (including on LOAD/STOREs).
1227
1228 Where elwidth is set to default (on any source or the destination)
1229 it is obvious that this NaN-boxing behaviour can and should be
1230 preserved. When elwidth is non-default things are less obvious,
1231 so need to be thought through. Here is a normal (scalar) sequence,
1232 assuming an RV64 which supports Quad (128-bit) FLEN:
1233
1234 * FLD loads 64-bit wide from memory. Top 64 MSBs are set to all 1s
1235 * ADD.D performs a 64-bit-wide add. Top 64 MSBs of destination set to 1s.
1236 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1237 top 64 MSBs ignored.
1238
1239 Therefore it makes sense to mirror this behaviour when, for example,
1240 elwidth is set to 32. Assume elwidth set to 32 on all source and
1241 destination registers:
1242
1243 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1244 floating-point numbers.
1245 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1246 in bits 0-31 and the second in bits 32-63.
1247 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1248
1249 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1250 of the registers either during the FLD **or** the ADD.D. The reason
1251 is that, effectively, the top 64 MSBs actually represent a completely
1252 independent 64-bit register, so overwriting it is not only gratuitous
1253 but may actually be harmful for a future extension to SV which may
1254 have a way to directly access those top 64 bits.
1255
1256 The decision is therefore **not** to touch the upper parts of floating-point
1257 registers whereever elwidth is set to non-default values, including
1258 when "isvec" is false in a given register's CSR entry. Only when the
1259 elwidth is set to default **and** isvec is false will the standard
1260 RV behaviour be followed, namely that the upper bits be modified.
1261
1262 Ultimately if elwidth is default and isvec false on *all* source
1263 and destination registers, a SimpleV instruction defaults completely
1264 to standard RV scalar behaviour (this holds true for **all** operations,
1265 right across the board).
1266
1267 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1268 non-default values are effectively all the same: they all still perform
1269 multiple ADD operations, just at different widths. A future extension
1270 to SimpleV may actually allow ADD.S to access the upper bits of the
1271 register, effectively breaking down a 128-bit register into a bank
1272 of 4 independently-accesible 32-bit registers.
1273
1274 In the meantime, although when e.g. setting VL to 8 it would technically
1275 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1276 using ADD.Q may be an easy way to signal to the microarchitecture that
1277 it is to receive a higher VL value. On a superscalar OoO architecture
1278 there may be absolutely no difference, however on simpler SIMD-style
1279 microarchitectures they may not necessarily have the infrastructure in
1280 place to know the difference, such that when VL=8 and an ADD.D instruction
1281 is issued, it completes in 2 cycles (or more) rather than one, where
1282 if an ADD.Q had been issued instead on such simpler microarchitectures
1283 it would complete in one.
1284
1285 ## Specific instruction walk-throughs
1286
1287 This section covers walk-throughs of the above-outlined procedure
1288 for converting standard RISC-V scalar arithmetic operations to
1289 polymorphic widths, to ensure that it is correct.
1290
1291 ### add
1292
1293 Standard Scalar RV32/RV64 (xlen):
1294
1295 * RS1 @ xlen bits
1296 * RS2 @ xlen bits
1297 * add @ xlen bits
1298 * RD @ xlen bits
1299
1300 Polymorphic variant:
1301
1302 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1303 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1304 * add @ max(rs1, rs2) bits
1305 * RD @ rd bits. zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1306
1307 Note here that polymorphic add zero-extends its source operands,
1308 where addw sign-extends.
1309
1310 ### addw
1311
1312 The RV Specification specifically states that "W" variants of arithmetic
1313 operations always produce 32-bit signed values. In a polymorphic
1314 environment it is reasonable to assume that the signed aspect is
1315 preserved, where it is the length of the operands and the result
1316 that may be changed.
1317
1318 Standard Scalar RV64 (xlen):
1319
1320 * RS1 @ xlen bits
1321 * RS2 @ xlen bits
1322 * add @ xlen bits
1323 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1324
1325 Polymorphic variant:
1326
1327 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1328 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1329 * add @ max(rs1, rs2) bits
1330 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1331
1332 Note here that polymorphic addw sign-extends its source operands,
1333 where add zero-extends.
1334
1335 This requires a little more in-depth analysis. Where the bitwidth of
1336 rs1 equals the bitwidth of rs2, no sign-extending will occur. It is
1337 only where the bitwidth of either rs1 or rs2 are different, will the
1338 lesser-width operand be sign-extended.
1339
1340 Effectively however, both rs1 and rs2 are being sign-extended (or
1341 truncated), where for add they are both zero-extended. This holds true
1342 for all arithmetic operations ending with "W".
1343
1344 ### addiw
1345
1346 Standard Scalar RV64I:
1347
1348 * RS1 @ xlen bits, truncated to 32-bit
1349 * immed @ 12 bits, sign-extended to 32-bit
1350 * add @ 32 bits
1351 * RD @ rd bits. sign-extend to rd if rd > 32, otherwise truncate.
1352
1353 Polymorphic variant:
1354
1355 * RS1 @ rs1 bits
1356 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1357 * add @ max(rs1, 12) bits
1358 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1359
1360 # Predication Element Zeroing
1361
1362 The introduction of zeroing on traditional vector predication is usually
1363 intended as an optimisation for lane-based microarchitectures with register
1364 renaming to be able to save power by avoiding a register read on elements
1365 that are passed through en-masse through the ALU. Simpler microarchitectures
1366 do not have this issue: they simply do not pass the element through to
1367 the ALU at all, and therefore do not store it back in the destination.
1368 More complex non-lane-based micro-architectures can, when zeroing is
1369 not set, use the predication bits to simply avoid sending element-based
1370 operations to the ALUs, entirely: thus, over the long term, potentially
1371 keeping all ALUs 100% occupied even when elements are predicated out.
1372
1373 SimpleV's design principle is not based on or influenced by
1374 microarchitectural design factors: it is a hardware-level API.
1375 Therefore, looking purely at whether zeroing is *useful* or not,
1376 (whether less instructions are needed for certain scenarios),
1377 given that a case can be made for zeroing *and* non-zeroing, the
1378 decision was taken to add support for both.
1379
1380 ## Single-predication (based on destination register)
1381
1382 Zeroing on predication for arithmetic operations is taken from
1383 the destination register's predicate. i.e. the predication *and*
1384 zeroing settings to be applied to the whole operation come from the
1385 CSR Predication table entry for the destination register.
1386 Thus when zeroing is set on predication of a destination element,
1387 if the predication bit is clear, then the destination element is *set*
1388 to zero (twin-predication is slightly different, and will be covered
1389 next).
1390
1391 Thus the pseudo-code loop for a predicated arithmetic operation
1392 is modified to as follows:
1393
1394  for (i = 0; i < VL; i++)
1395 if not zeroing: # an optimisation
1396 while (!(predval & 1<<i) && i < VL)
1397 if (int_vec[rd ].isvector)  { id += 1; }
1398 if (int_vec[rs1].isvector)  { irs1 += 1; }
1399 if (int_vec[rs2].isvector)  { irs2 += 1; }
1400 if i == VL:
1401 return
1402 if (predval & 1<<i)
1403 src1 = ....
1404 src2 = ...
1405 else:
1406 result = src1 + src2 # actual add (or other op) here
1407 set_polymorphed_reg(rd, destwid, ird, result)
1408 if int_vec[rd].ffirst and result == 0:
1409 VL = i # result was zero, end loop early, return VL
1410 return
1411 if (!int_vec[rd].isvector) return
1412 else if zeroing:
1413 result = 0
1414 set_polymorphed_reg(rd, destwid, ird, result)
1415 if (int_vec[rd ].isvector)  { id += 1; }
1416 else if (predval & 1<<i) return
1417 if (int_vec[rs1].isvector)  { irs1 += 1; }
1418 if (int_vec[rs2].isvector)  { irs2 += 1; }
1419 if (rd == VL or rs1 == VL or rs2 == VL): return
1420
1421 The optimisation to skip elements entirely is only possible for certain
1422 micro-architectures when zeroing is not set. However for lane-based
1423 micro-architectures this optimisation may not be practical, as it
1424 implies that elements end up in different "lanes". Under these
1425 circumstances it is perfectly fine to simply have the lanes
1426 "inactive" for predicated elements, even though it results in
1427 less than 100% ALU utilisation.
1428
1429 ## Twin-predication (based on source and destination register)
1430
1431 Twin-predication is not that much different, except that that
1432 the source is independently zero-predicated from the destination.
1433 This means that the source may be zero-predicated *or* the
1434 destination zero-predicated *or both*, or neither.
1435
1436 When with twin-predication, zeroing is set on the source and not
1437 the destination, if a predicate bit is set it indicates that a zero
1438 data element is passed through the operation (the exception being:
1439 if the source data element is to be treated as an address - a LOAD -
1440 then the data returned *from* the LOAD is zero, rather than looking up an
1441 *address* of zero.
1442
1443 When zeroing is set on the destination and not the source, then just
1444 as with single-predicated operations, a zero is stored into the destination
1445 element (or target memory address for a STORE).
1446
1447 Zeroing on both source and destination effectively result in a bitwise
1448 NOR operation of the source and destination predicate: the result is that
1449 where either source predicate OR destination predicate is set to 0,
1450 a zero element will ultimately end up in the destination register.
1451
1452 However: this may not necessarily be the case for all operations;
1453 implementors, particularly of custom instructions, clearly need to
1454 think through the implications in each and every case.
1455
1456 Here is pseudo-code for a twin zero-predicated operation:
1457
1458 function op_mv(rd, rs) # MV not VMV!
1459  rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1460  rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1461  ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1462  pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1463  for (int i = 0, int j = 0; i < VL && j < VL):
1464 if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1465 if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1466 if ((pd & 1<<j))
1467 if ((pd & 1<<j))
1468 sourcedata = ireg[rs+i];
1469 else
1470 sourcedata = 0
1471 ireg[rd+j] <= sourcedata
1472 else if (zerodst)
1473 ireg[rd+j] <= 0
1474 if (int_csr[rs].isvec)
1475 i++;
1476 if (int_csr[rd].isvec)
1477 j++;
1478 else
1479 if ((pd & 1<<j))
1480 break;
1481
1482 Note that in the instance where the destination is a scalar, the hardware
1483 loop is ended the moment a value *or a zero* is placed into the destination
1484 register/element. Also note that, for clarity, variable element widths
1485 have been left out of the above.
1486
1487 # Subsets of RV functionality
1488
1489 This section describes the differences when SV is implemented on top of
1490 different subsets of RV.
1491
1492 ## Common options
1493
1494 It is permitted to only implement SVprefix and not the VBLOCK instruction
1495 format option, and vice-versa. UNIX Platforms **MUST** raise illegal
1496 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1497 traps may emulate the format.
1498
1499 It is permitted in SVprefix to either not implement VL or not implement
1500 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1501 *MUST* raise illegal instruction on implementations that do not support
1502 VL or SUBVL.
1503
1504 It is permitted to limit the size of either (or both) the register files
1505 down to the original size of the standard RV architecture. However, below
1506 the mandatory limits set in the RV standard will result in non-compliance
1507 with the SV Specification.
1508
1509 ## RV32 / RV32F
1510
1511 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1512 maximum limit for predication is also restricted to 32 bits. Whilst not
1513 actually specifically an "option" it is worth noting.
1514
1515 ## RV32G
1516
1517 Normally in standard RV32 it does not make much sense to have
1518 RV32G, The critical instructions that are missing in standard RV32
1519 are those for moving data to and from the double-width floating-point
1520 registers into the integer ones, as well as the FCVT routines.
1521
1522 In an earlier draft of SV, it was possible to specify an elwidth
1523 of double the standard register size: this had to be dropped,
1524 and may be reintroduced in future revisions.
1525
1526 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1527
1528 When floating-point is not implemented, the size of the User Register and
1529 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1530 per table).
1531
1532 ## RV32E
1533
1534 In embedded scenarios the User Register and Predication CSRs may be
1535 dropped entirely, or optionally limited to 1 CSR, such that the combined
1536 number of entries from the M-Mode CSR Register table plus U-Mode
1537 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1538 zero) only 2 16-bit entries (M-Mode CSR table only). Likewise for
1539 the Predication CSR tables.
1540
1541 RV32E is the most likely candidate for simply detecting that registers
1542 are marked as "vectorised", and generating an appropriate exception
1543 for the VL loop to be implemented in software.
1544
1545 ## RV128
1546
1547 RV128 has not been especially considered, here, however it has some
1548 extremely large possibilities: double the element width implies
1549 256-bit operands, spanning 2 128-bit registers each, and predication
1550 of total length 128 bit given that XLEN is now 128.
1551
1552 # Example usage
1553
1554 TODO evaluate strncpy and strlen
1555 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1556
1557 ## strncpy <a name="strncpy"></>
1558
1559 RVV version:
1560
1561 strncpy:
1562 c.mv a3, a0 # Copy dst
1563 loop:
1564 setvli x0, a2, vint8 # Vectors of bytes.
1565 vlbff.v v1, (a1) # Get src bytes
1566 vseq.vi v0, v1, 0 # Flag zero bytes
1567 vmfirst a4, v0 # Zero found?
1568 vmsif.v v0, v0 # Set mask up to and including zero byte.
1569 vsb.v v1, (a3), v0.t # Write out bytes
1570 c.bgez a4, exit # Done
1571 csrr t1, vl # Get number of bytes fetched
1572 c.add a1, a1, t1 # Bump src pointer
1573 c.sub a2, a2, t1 # Decrement count.
1574 c.add a3, a3, t1 # Bump dst pointer
1575 c.bnez a2, loop # Anymore?
1576
1577 exit:
1578 c.ret
1579
1580 SV version (WIP):
1581
1582 strncpy:
1583 c.mv a3, a0
1584 VBLK.RegCSR[t0] = 8bit, t0, vector
1585 VBLK.PredTb[t0] = ffirst, x0, inv
1586 loop:
1587 VBLK.SETVLI a2, t4, 8 # t4 and VL now 1..8 (MVL=8)
1588 c.ldb t0, (a1) # t0 fail first mode
1589 c.bne t0, x0, allnonzero # still ff
1590 # VL (t4) points to last nonzero
1591 c.addi t4, t4, 1 # include zero
1592 c.stb t0, (a3) # store incl zero
1593 c.ret # end subroutine
1594 allnonzero:
1595 c.stb t0, (a3) # VL legal range
1596 c.add a1, a1, t4 # Bump src pointer
1597 c.sub a2, a2, t4 # Decrement count.
1598 c.add a3, a3, t4 # Bump dst pointer
1599 c.bnez a2, loop # Anymore?
1600 exit:
1601 c.ret
1602
1603 Notes:
1604
1605 * Setting MVL to 8 is just an example. If enough registers are spare it
1606 may be set to XLEN which will require a bank of 8 scalar registers for
1607 a1, a3 and t0.
1608 * obviously if that is done, t0 is not separated by 8 full registers, and
1609 would overwrite t1 thru t7. x80 would work well, as an example, instead.
1610 * with the exception of the GETVL (a pseudo code alias for csrr), every
1611 single instruction above may use RVC.
1612 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1613 registers through redirection
1614 * RVC C.LW and C.SW may be used because the W format may be overridden by
1615 the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1616 * with the exception of the GETVL, all Vector Context may be done in
1617 VBLOCK form.
1618 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1619 just ffirst on t0
1620 * ldb and bne are both using t0, both in ffirst mode
1621 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1622 vectorised, no (un)sign-extension or truncation" mode.
1623 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1624 into t0 (could contain zeros).
1625 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1626 scalar x0
1627 * however as t0 is in ffirst mode, the first fail will ALSO stop the
1628 compares, and reduce VL as well
1629 * the branch only goes to allnonzero if all tests succeed
1630 * if it did not, we can safely increment VL by 1 (using a4) to include
1631 the zero.
1632 * SETVL sets *exactly* the requested amount into VL.
1633 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1634 activates but the bne allzeros does not.
1635 * this would cause the stb to copy up to the end of the legal memory
1636 * of course, on the next loop the ldb would throw a trap, as a1 now
1637 points to the first illegal mem location.
1638
1639 ## strcpy
1640
1641 RVV version:
1642
1643 mv a3, a0 # Save start
1644 loop:
1645 setvli a1, x0, vint8 # byte vec, x0 (Zero reg) => use max hardware len
1646 vldbff.v v1, (a3) # Get bytes
1647 csrr a1, vl # Get bytes actually read e.g. if fault
1648 vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
1649 add a3, a3, a1 # Bump pointer
1650 vmfirst a2, v0 # Find first set bit in mask, returns -1 if none
1651 bltz a2, loop # Not found?
1652 add a0, a0, a1 # Sum start + bump
1653 add a3, a3, a2 # Add index of zero byte
1654 sub a0, a3, a0 # Subtract start address+bump
1655 ret
1656
1657 ## DAXPY <a name="daxpy"></a>
1658
1659 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]
1660
1661 Notes:
1662
1663 * Setting MVL to 4 is just an example. With enough space between the
1664 FP regs, MVL may be set to larger values
1665 * VBLOCK header takes 16 bits, 8-bit mode may be used on the registers,
1666 taking only another 16 bits, VBLOCK.SETVL requires 16 bits. Total
1667 overhead for use of VBLOCK: 48 bits (3 16-bit words).
1668 * All instructions except fmadd may use Compressed variants. Total
1669 number of 16-bit instruction words: 11.
1670 * Total: 14 16-bit words. By contrast, RVV requires around 18 16-bit words.
1671
1672 ## BigInt add <a name="bigadd"></a>
1673
1674 [[!inline raw="yes" pages="simple_v_extension/bigadd_example" ]]