simple_v_extension/appendix.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Appendix
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 28 jun 2019
   6 * main spec [[specification]]
   7
   8 [[!toc ]]
   9
  10 # Fail-on-first modes
  11
  12 Fail-on-first data dependency has different behaviour for traps than
  13 for conditional testing.  "Conditional" is taken to mean "anything
  14 that is zero", however with traps, the first element has to
  15 be given the opportunity to throw the exact same trap that would
  16 be thrown if this were a scalar operation (when VL=1).
  17
  18 Note that implementors are required to mutually exclusively choose one
  19 or the other modes: an instruction is **not** permitted to fail on a
  20 trap *and* fail a conditional test at the same time.  This advice to
  21 custom opcode writers as well as future extension writers.
  22
  23 ## Fail-on-first traps
  24
  25 Except for the first element, ffirst stops sequential element processing
  26 when a trap occurs.  The first element is treated normally (as if ffirst
  27 is clear).  Should any subsequent element instruction require a trap,
  28 instead it and subsequent indexed elements are ignored (or cancelled in
  29 out-of-order designs), and VL is set to the *last* in-sequence instruction
  30 that did not take the trap.
  31
  32 Note that predicated-out elements (where the predicate mask bit is
  33 zero) are clearly excluded (i.e. the trap will not occur).  However,
  34 note that the loop still had to test the predicate bit: thus on return,
  35 VL is set to include elements that did not take the trap *and* includes
  36 the elements that were predicated (masked) out (not tested up to the
  37 point where the trap occurred).
  38
  39 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
  40 will cause a trap as normal (as if ffirst is not set); subsequently, the
  41 trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
  42 be modified.  Traps must analyse (x)eSTATE (subvl offset indices) to
  43 determine the element that caused the trap.
  44
  45 Given that predication bits apply to SUBVL groups, the same rules apply
  46 to predicated-out (masked-out) sub-groups in calculating the value that
  47 VL is set to.
  48
  49 ## Fail-on-first conditional tests
  50
  51 ffirst stops sequential (or sequentially-appearing in the case of
  52 out-of-order designs) element conditional testing on the first element
  53 result being zero (or other "fail" condition).  VL is set to the number
  54 of elements that were (sequentially) processed before the fail-condition
  55 was encountered.
  56
  57 Note that just as with traps, if SUBVL!=1, the first trap in the
  58 *sub-group* will cause the processing to end, and, even if there were
  59 elements within the *sub-group* that passed the test, that sub-group is
  60 still (entirely) excluded from the count (from setting VL).  i.e. VL is
  61 set to the total number of *sub-groups* that had no fail-condition up
  62 until execution was stopped.  However, again: SUBVL must not be modified:
  63 traps must analyse (x)eSTATE (subvl offset indices) to determine the
  64 element that caused the trap.
  65
  66 Note again that, just as with traps, predicated-out (masked-out) elements
  67 are included in the (sequential) count leading up to the fail-condition,
  68 even though they were not tested.
  69
  70 # Instructions <a name="instructions" />
  71
  72 Despite being a 98% complete and accurate topological remap of RVV
  73 concepts and functionality, no new instructions are needed.
  74 Compared to RVV: *All* RVV instructions can be re-mapped, however xBitManip
  75 becomes a critical dependency for efficient manipulation of predication
  76 masks (as a bit-field).  Despite the removal of all operations,
  77 with the exception of CLIP and VSELECT.X
  78 *all instructions from RVV Base are topologically re-mapped and retain their
  79 complete functionality, intact*.  Note that if RV64G ever had
  80 a MV.X added as well as FCLIP, the full functionality of RVV-Base would
  81 be obtained in SV.
  82
  83 Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
  84 equivalents, so are left out of Simple-V.  VSELECT could be included if
  85 there existed a MV.X instruction in RV (MV.X is a hypothetical
  86 non-immediate variant of MV that would allow another register to
  87 specify which register was to be copied).  Note that if any of these three
  88 instructions are added to any given RV extension, their functionality
  89 will be inherently parallelised.
  90
  91 With some exceptions, where it does not make sense or is simply too
  92 challenging, all RV-Base instructions are parallelised:
  93
  94 * CSR instructions, whilst a case could be made for fast-polling of
  95   a CSR into multiple registers, or for being able to copy multiple
  96   contiguously addressed CSRs into contiguous registers, and so on,
  97   are the fundamental core basis of SV.  If parallelised, extreme
  98   care would need to be taken.  Additionally, CSR reads are done
  99   using x0, and it is *really* inadviseable to tag x0.
 100 * LUI, C.J, C.JR, WFI, AUIPC are not suitable for parallelising so are
 101   left as scalar.
 102 * LR/SC could hypothetically be parallelised however their purpose is
 103   single (complex) atomic memory operations where the LR must be followed
 104   up by a matching SC.  A sequence of parallel LR instructions followed
 105   by a sequence of parallel SC instructions therefore is guaranteed to
 106   not be useful. Not least: the guarantees of a Multi-LR/SC
 107   would be impossible to provide if emulated in a trap.
 108 * EBREAK, NOP, FENCE and others do not use registers so are not inherently
 109   paralleliseable anyway.
 110
 111 All other operations using registers are automatically parallelised.
 112 This includes AMOMAX, AMOSWAP and so on, where particular care and
 113 attention must be paid.
 114
 115 Example pseudo-code for an integer ADD operation (including scalar
 116 operations).  Floating-point uses the FP Register Table.
 117
 118 [[!inline raw="yes" pages="simple_v_extension/simple_add_example" ]]
 119
 120 Note that for simplicity there is quite a lot missing from the above
 121 pseudo-code: PCVBLK, element widths, zeroing on predication, dimensional
 122 reshaping and offsets and so on.  However it demonstrates the basic
 123 principle.  Augmentations that produce the full pseudo-code are covered in
 124 other sections.
 125
 126 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 127
 128 Adding in support for SUBVL is a matter of adding in an extra inner
 129 for-loop, where register src and dest are still incremented inside the
 130 inner part. Note that the predication is still taken from the VL index.
 131
 132 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 133 indexed by "(i)"
 134
 135     function op_add(rd, rs1, rs2) # add not VADD!
 136       int i, id=0, irs1=0, irs2=0;
 137       predval = get_pred_val(FALSE, rd);
 138       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 139       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 140       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 141       for (i = 0; i < VL; i++)
 142        xSTATE.srcoffs = i # save context
 143        for (s = 0; s < SUBVL; s++)
 144         xSTATE.ssvoffs = s # save context
 145         if (predval & 1<<i) # predication uses intregs
 146            # actual add is here (at last)
 147            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 148            if (!int_vec[rd ].isvector) break;
 149         if (int_vec[rd ].isvector)  { id += 1; }
 150         if (int_vec[rs1].isvector)  { irs1 += 1; }
 151         if (int_vec[rs2].isvector)  { irs2 += 1; }
 152         if (id == VL or irs1 == VL or irs2 == VL) {
 153           # end VL hardware loop
 154           xSTATE.srcoffs = 0; # reset
 155           xSTATE.ssvoffs = 0; # reset
 156           return;
 157         }
 158
 159
 160 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 161 elwidth handling etc. all left out.
 162
 163 ## Instruction Format
 164
 165 It is critical to appreciate that there are
 166 **no operations added to SV, at all**.
 167
 168 Instead, by using CSRs to tag registers as an indication of "changed
 169 behaviour", SV *overloads* pre-existing branch operations into predicated
 170 variants, and implicitly overloads arithmetic operations, MV, FCVT, and
 171 LOAD/STORE depending on CSR configurations for bitwidth and predication.
 172 **Everything** becomes parallelised.  *This includes Compressed
 173 instructions* as well as any future instructions and Custom Extensions.
 174
 175 Note: CSR tags to change behaviour of instructions is nothing new, including
 176 in RISC-V.  UXL, SXL and MXL change the behaviour so that XLEN=32/64/128.
 177 FRM changes the behaviour of the floating-point unit, to alter the rounding
 178 mode.  Other architectures change the LOAD/STORE byte-order from big-endian
 179 to little-endian on a per-instruction basis.  SV is just a little more...
 180 comprehensive in its effect on instructions.
 181
 182 ## Branch Instructions
 183
 184 Branch operations are augmented slightly to be a little more like FP
 185 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 186 of multiple comparisons into a register (taken indirectly from the predicate
 187 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 188 See ffirst mode in the Predication Table section.
 189
 190 ### Standard Branch <a name="standard_branch"></a>
 191
 192 Branch operations use standard RV opcodes that are reinterpreted to
 193 be "predicate variants" in the instance where either of the two src
 194 registers are marked as vectors (active=1, vector=1).
 195
 196 Note that the predication register to use (if one is enabled) is taken from
 197 the *first* src register, and that this is used, just as with predicated
 198 arithmetic operations, to mask whether the comparison operations take
 199 place or not.  The target (destination) predication register
 200 to use (if one is enabled) is taken from the *second* src register.
 201
 202 If either of src1 or src2 are scalars (whether by there being no
 203 CSR register entry or whether by the CSR entry specifically marking
 204 the register as "scalar") the comparison goes ahead as vector-scalar
 205 or scalar-vector.
 206
 207 In instances where no vectorisation is detected on either src registers
 208 the operation is treated as an absolutely standard scalar branch operation.
 209 Where vectorisation is present on either or both src registers, the
 210 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 211 those tests that are predicated out).
 212
 213 Note that when zero-predication is enabled (from source rs1),
 214 a cleared bit in the predicate indicates that the result
 215 of the compare is set to "false", i.e. that the corresponding
 216 destination bit (or result)) be set to zero.  Contrast this with
 217 when zeroing is not set: bits in the destination predicate are
 218 only *set*; they are **not** cleared.  This is important to appreciate,
 219 as there may be an expectation that, going into the hardware-loop,
 220 the destination predicate is always expected to be set to zero:
 221 this is **not** the case.  The destination predicate is only set
 222 to zero if **zeroing** is enabled.
 223
 224 Note that just as with the standard (scalar, non-predicated) branch
 225 operations, BLE, BGT, BLEU and BTGU may be synthesised by inverting
 226 src1 and src2.
 227
 228 In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
 229 for predicated compare operations of function "cmp":
 230
 231     for (int i=0; i<vl; ++i)
 232       if ([!]preg[p][i])
 233          preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
 234                            s2 ? vreg[rs2][i] : sreg[rs2]);
 235
 236 With associated predication, vector-length adjustments and so on,
 237 and temporarily ignoring bitwidth (which makes the comparisons more
 238 complex), this becomes:
 239
 240     s1 = reg_is_vectorised(src1);
 241     s2 = reg_is_vectorised(src2);
 242
 243     if not s1 && not s2
 244         if cmp(rs1, rs2) # scalar compare
 245             goto branch
 246         return
 247
 248     preg = int_pred_reg[rd]
 249     reg = int_regfile
 250
 251     ps = get_pred_val(I/F==INT, rs1);
 252     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 253
 254     if not exists(rd) or zeroing:
 255         result = 0
 256     else
 257         result = preg[rd]
 258
 259     for (int i = 0; i < VL; ++i)
 260       if (zeroing)
 261         if not (ps & (1<<i))
 262            result &= ~(1<<i);
 263       else if (ps & (1<<i))
 264           if (cmp(s1 ? reg[src1+i]:reg[src1],
 265                                s2 ? reg[src2+i]:reg[src2])
 266               result |= 1<<i;
 267           else
 268               result &= ~(1<<i);
 269
 270      if not exists(rd)
 271         if result == ps
 272             goto branch
 273      else
 274         preg[rd] = result # store in destination
 275         if preg[rd] == ps
 276             goto branch
 277
 278 Notes:
 279
 280 * Predicated SIMD comparisons would break src1 and src2 further down
 281   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 282   Reordering") setting Vector-Length times (number of SIMD elements) bits
 283   in Predicate Register rd, as opposed to just Vector-Length bits.
 284 * The execution of "parallelised" instructions **must** be implemented
 285   as "re-entrant" (to use a term from software).  If an exception (trap)
 286   occurs during the middle of a vectorised
 287   Branch (now a SV predicated compare) operation, the partial results
 288   of any comparisons must be written out to the destination
 289   register before the trap is permitted to begin.  If however there
 290   is no predicate, the **entire** set of comparisons must be **restarted**,
 291   with the offset loop indices set back to zero.  This is because
 292   there is no place to store the temporary result during the handling
 293   of traps.
 294
 295 TODO: predication now taken from src2.  also branch goes ahead
 296 if all compares are successful.
 297
 298 Note also that where normally, predication requires that there must
 299 also be a CSR register entry for the register being used in order
 300 for the **predication** CSR register entry to also be active,
 301 for branches this is **not** the case.  src2 does **not** have
 302 to have its CSR register entry marked as active in order for
 303 predication on src2 to be active.
 304
 305 Also note: SV Branch operations are **not** twin-predicated
 306 (see Twin Predication section).  This would require three
 307 element offsets: one to track src1, one to track src2 and a third
 308 to track where to store the accumulation of the results.  Given
 309 that the element offsets need to be exposed via CSRs so that
 310 the parallel hardware looping may be made re-entrant on traps
 311 and exceptions, the decision was made not to make SV Branches
 312 twin-predicated.
 313
 314 ### Floating-point Comparisons
 315
 316 There does not exist floating-point branch operations, only compare.
 317 Interestingly no change is needed to the instruction format because
 318 FP Compare already stores a 1 or a zero in its "rd" integer register
 319 target, i.e. it's not actually a Branch at all: it's a compare.
 320
 321 In RV (scalar) Base, a branch on a floating-point compare is
 322 done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
 323 This does extend to SV, as long as x1 (in the example sequence given)
 324 is vectorised.  When that is the case, x1..x(1+VL-1) will also be
 325 set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
 326 The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
 327 so on.  Consequently, unlike integer-branch, FP Compare needs no
 328 modification in its behaviour.
 329
 330 In addition, it is noted that an entry "FNE" (the opposite of FEQ) is
 331 missing, and whilst in ordinary branch code this is fine because the
 332 standard RVF compare can always be followed up with an integer BEQ or
 333 a BNE (or a compressed comparison to zero or non-zero), in predication
 334 terms that becomes more of an impact.  To deal with this, SV's predication
 335 has had "invert" added to it.
 336
 337 Also: note that FP Compare may be predicated, using the destination
 338 integer register (rd) to determine the predicate.  FP Compare is **not**
 339 a twin-predication operation, as, again, just as with SV Branches,
 340 there are three registers involved: FP src1, FP src2 and INT rd.
 341
 342 Also: note that ffirst (fail first mode) applies directly to this operation.
 343
 344 ### Compressed Branch Instruction
 345
 346 Compressed Branch instructions are, just like standard Branch instructions,
 347 reinterpreted to be vectorised and predicated based on the source register
 348 (rs1s) CSR entries.  As however there is only the one source register,
 349 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 350 to store the results of the comparisions is taken from CSR predication
 351 table entries for **x0**.
 352
 353 The specific required use of x0 is, with a little thought, quite obvious,
 354 but is counterintuitive.  Clearly it is **not** recommended to redirect
 355 x0 with a CSR register entry, however as a means to opaquely obtain
 356 a predication target it is the only sensible option that does not involve
 357 additional special CSRs (or, worse, additional special opcodes).
 358
 359 Note also that, just as with standard branches, the 2nd source
 360 (in this case x0 rather than src2) does **not** have to have its CSR
 361 register table marked as "active" in order for predication to work.
 362
 363 ## Vectorised Dual-operand instructions
 364
 365 There is a series of 2-operand instructions involving copying (and
 366 sometimes alteration):
 367
 368 * C.MV
 369 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 370 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 371 * LOAD(-FP) and STORE(-FP)
 372
 373 All of these operations follow the same two-operand pattern, so it is
 374 *both* the source *and* destination predication masks that are taken into
 375 account.  This is different from
 376 the three-operand arithmetic instructions, where the predication mask
 377 is taken from the *destination* register, and applied uniformly to the
 378 elements of the source register(s), element-for-element.
 379
 380 The pseudo-code pattern for twin-predicated operations is as
 381 follows:
 382
 383     function op(rd, rs):
 384       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 385       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 386       ps = get_pred_val(FALSE, rs); # predication on src
 387       pd = get_pred_val(FALSE, rd); # ... AND on dest
 388       for (int i = 0, int j = 0; i < VL && j < VL;):
 389         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 390         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 391         xSTATE.srcoffs = i # save context
 392         xSTATE.destoffs = j # save context
 393         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 394         if (int_csr[rs].isvec) i++;
 395         if (int_csr[rd].isvec) j++; else break
 396
 397 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 398 and vector-vector, and predicated variants of all of those.
 399 Zeroing is not presently included (TODO).  As such, when compared
 400 to RVV, the twin-predicated variants of C.MV and FMV cover
 401 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 402 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 403
 404 Note that:
 405
 406 * elwidth (SIMD) is not covered in the pseudo-code above
 407 * ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
 408   not covered
 409 * zero predication is also not shown (TODO).
 410
 411 ### C.MV Instruction <a name="c_mv"></a>
 412
 413 There is no MV instruction in RV however there is a C.MV instruction.
 414 It is used for copying integer-to-integer registers (vectorised FMV
 415 is used for copying floating-point).
 416
 417 If either the source or the destination register are marked as vectors
 418 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 419 move operation.  The actual instruction's format does not change:
 420
 421 [[!table  data="""
 422 15  12 | 11   7 | 6  2 | 1  0 |
 423 funct4 | rd     | rs   | op   |
 424 4      | 5      | 5    | 2    |
 425 C.MV   | dest   | src  | C0   |
 426 """]]
 427
 428 A simplified version of the pseudocode for this operation is as follows:
 429
 430     function op_mv(rd, rs) # MV not VMV!
 431       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 432       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 433       ps = get_pred_val(FALSE, rs); # predication on src
 434       pd = get_pred_val(FALSE, rd); # ... AND on dest
 435       for (int i = 0, int j = 0; i < VL && j < VL;):
 436         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 437         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 438         xSTATE.srcoffs = i # save context
 439         xSTATE.destoffs = j # save context
 440         ireg[rd+j] <= ireg[rs+i];
 441         if (int_csr[rs].isvec) i++;
 442         if (int_csr[rd].isvec) j++; else break
 443
 444 There are several different instructions from RVV that are covered by
 445 this one opcode:
 446
 447 [[!table  data="""
 448 src    | dest    | predication   | op             |
 449 scalar | vector  | none          | VSPLAT         |
 450 scalar | vector  | destination   | sparse VSPLAT  |
 451 scalar | vector  | 1-bit dest    | VINSERT        |
 452 vector | scalar  | 1-bit? src    | VEXTRACT       |
 453 vector | vector  | none          | VCOPY          |
 454 vector | vector  | src           | Vector Gather  |
 455 vector | vector  | dest          | Vector Scatter |
 456 vector | vector  | src & dest    | Gather/Scatter |
 457 vector | vector  | src == dest   | sparse VCOPY   |
 458 """]]
 459
 460 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 461 operations with zeroing off, and inversion on the src and dest predication
 462 for one of the two C.MV operations.  The non-inverted C.MV will place
 463 one set of registers into the destination, and the inverted one the other
 464 set.  With predicate-inversion, copying and inversion of the predicate mask
 465 need not be done as a separate (scalar) instruction.
 466
 467 Note that in the instance where the Compressed Extension is not implemented,
 468 MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
 469 Note that the behaviour is **different** from C.MV because with addi the
 470 predication mask to use is taken **only** from rd and is applied against
 471 all elements: rs[i] = rd[i].
 472
 473 ### FMV, FNEG and FABS Instructions
 474
 475 These are identical in form to C.MV, except covering floating-point
 476 register copying.  The same double-predication rules also apply.
 477 However when elwidth is not set to default the instruction is implicitly
 478 and automatic converted to a (vectorised) floating-point type conversion
 479 operation of the appropriate size covering the source and destination
 480 register bitwidths.
 481
 482 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 483
 484 ### FVCT Instructions
 485
 486 These are again identical in form to C.MV, except that they cover
 487 floating-point to integer and integer to floating-point.  When element
 488 width in each vector is set to default, the instructions behave exactly
 489 as they are defined for standard RV (scalar) operations, except vectorised
 490 in exactly the same fashion as outlined in C.MV.
 491
 492 However when the source or destination element width is not set to default,
 493 the opcode's explicit element widths are *over-ridden* to new definitions,
 494 and the opcode's element width is taken as indicative of the SIMD width
 495 (if applicable i.e. if packed SIMD is requested) instead.
 496
 497 For example FCVT.S.L would normally be used to convert a 64-bit
 498 integer in register rs1 to a 64-bit floating-point number in rd.
 499 If however the source rs1 is set to be a vector, where elwidth is set to
 500 default/2 and "packed SIMD" is enabled, then the first 32 bits of
 501 rs1 are converted to a floating-point number to be stored in rd's
 502 first element and the higher 32-bits *also* converted to floating-point
 503 and stored in the second.  The 32 bit size comes from the fact that
 504 FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
 505 divide that by two it means that rs1 element width is to be taken as 32.
 506
 507 Similar rules apply to the destination register.
 508
 509 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 510
 511 An earlier draft of SV modified the behaviour of LOAD/STORE (modified
 512 the interpretation of the instruction fields).  This
 513 actually undermined the fundamental principle of SV, namely that there
 514 be no modifications to the scalar behaviour (except where absolutely
 515 necessary), in order to simplify an implementor's task if considering
 516 converting a pre-existing scalar design to support parallelism.
 517
 518 So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
 519 do not change in SV, however just as with C.MV it is important to note
 520 that dual-predication is possible.
 521
 522 In vectorised architectures there are usually at least two different modes
 523 for LOAD/STORE:
 524
 525 * Read (or write for STORE) from sequential locations, where one
 526   register specifies the address, and the one address is incremented
 527   by a fixed amount.  This is usually known as "Unit Stride" mode.
 528 * Read (or write) from multiple indirected addresses, where the
 529   vector elements each specify separate and distinct addresses.
 530
 531 To support these different addressing modes, the CSR Register "isvector"
 532 bit is used.  So, for a LOAD, when the src register is set to
 533 scalar, the LOADs are sequentially incremented by the src register
 534 element width, and when the src register is set to "vector", the
 535 elements are treated as indirection addresses.  Simplified
 536 pseudo-code would look like this:
 537
 538     function op_ld(rd, rs) # LD not VLD!
 539       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 540       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 541       ps = get_pred_val(FALSE, rs); # predication on src
 542       pd = get_pred_val(FALSE, rd); # ... AND on dest
 543       for (int i = 0, int j = 0; i < VL && j < VL;):
 544         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 545         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 546         if (int_csr[rd].isvec)
 547           # indirect mode (multi mode)
 548           srcbase = ireg[rsv+i];
 549         else
 550           # unit stride mode
 551           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 552         ireg[rdv+j] <= mem[srcbase + imm_offs];
 553         if (!int_csr[rs].isvec &&
 554             !int_csr[rd].isvec) break # scalar-scalar LD
 555         if (int_csr[rs].isvec) i++;
 556         if (int_csr[rd].isvec) j++;
 557
 558 Notes:
 559
 560 * For simplicity, zeroing and elwidth is not included in the above:
 561   the key focus here is the decision-making for srcbase; vectorised
 562   rs means use sequentially-numbered registers as the indirection
 563   address, and scalar rs is "offset" mode.
 564 * The test towards the end for whether both source and destination are
 565   scalar is what makes the above pseudo-code provide the "standard" RV
 566   Base behaviour for LD operations.
 567 * The offset in bytes (XLEN/8) changes depending on whether the
 568   operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
 569   (8 bytes), and also whether the element width is over-ridden
 570   (see special element width section).
 571
 572 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 573
 574 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 575 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 576 It is therefore possible to use predicated C.LWSP to efficiently
 577 pop registers off the stack (by predicating x2 as the source), cherry-picking
 578 which registers to store to (by predicating the destination).  Likewise
 579 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 580
 581 The two modes ("unit stride" and multi-indirection) are still supported,
 582 as with standard LD/ST.  Essentially, the only difference is that the
 583 use of x2 is hard-coded into the instruction.
 584
 585 **Note**: it is still possible to redirect x2 to an alternative target
 586 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 587 general-purpose LOAD/STORE operations.
 588
 589 ## Compressed LOAD / STORE Instructions
 590
 591 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 592 where the same rules apply and the same pseudo-code apply as for
 593 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 594 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 595 to "Multi-indirection", respectively.
 596
 597 # Element bitwidth polymorphism <a name="elwidth"></a>
 598
 599 Element bitwidth is best covered as its own special section, as it
 600 is quite involved and applies uniformly across-the-board.  SV restricts
 601 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 602
 603 The effect of setting an element bitwidth is to re-cast each entry
 604 in the register table, and for all memory operations involving
 605 load/stores of certain specific sizes, to a completely different width.
 606 Thus In c-style terms, on an RV64 architecture, effectively each register
 607 now looks like this:
 608
 609     typedef union {
 610         uint8_t  b[8];
 611         uint16_t s[4];
 612         uint32_t i[2];
 613         uint64_t l[1];
 614     } reg_t;
 615
 616     // integer table: assume maximum SV 7-bit regfile size
 617     reg_t int_regfile[128];
 618
 619 where the CSR Register table entry (not the instruction alone) determines
 620 which of those union entries is to be used on each operation, and the
 621 VL element offset in the hardware-loop specifies the index into each array.
 622
 623 However a naive interpretation of the data structure above masks the
 624 fact that setting VL greater than 8, for example, when the bitwidth is 8,
 625 accessing one specific register "spills over" to the following parts of
 626 the register file in a sequential fashion.  So a much more accurate way
 627 to reflect this would be:
 628
 629     typedef union {
 630         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 631         uint8_t  b[0]; // array of type uint8_t
 632         uint16_t s[0];
 633         uint32_t i[0];
 634         uint64_t l[0];
 635         uint128_t d[0];
 636     } reg_t;
 637
 638     reg_t int_regfile[128];
 639
 640 where when accessing any individual regfile[n].b entry it is permitted
 641 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 642 and thus "overspill" to consecutive register file entries in a fashion
 643 that is completely transparent to a greatly-simplified software / pseudo-code
 644 representation.
 645 It is however critical to note that it is clearly the responsibility of
 646 the implementor to ensure that, towards the end of the register file,
 647 an exception is thrown if attempts to access beyond the "real" register
 648 bytes is ever attempted.
 649
 650 Now we may modify pseudo-code an operation where all element bitwidths have
 651 been set to the same size, where this pseudo-code is otherwise identical
 652 to its "non" polymorphic versions (above):
 653
 654     function op_add(rd, rs1, rs2) # add not VADD!
 655       ...
 656       ...
 657       for (i = 0; i < VL; i++)
 658            ...
 659            ...
 660            // TODO, calculate if over-run occurs, for each elwidth
 661            if (elwidth == 8) {
 662                int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
 663                                         int_regfile[rs2].i[irs2];
 664             } else if elwidth == 16 {
 665                int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
 666                                         int_regfile[rs2].s[irs2];
 667             } else if elwidth == 32 {
 668                int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
 669                                         int_regfile[rs2].i[irs2];
 670             } else { // elwidth == 64
 671                int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
 672                                         int_regfile[rs2].l[irs2];
 673             }
 674            ...
 675            ...
 676
 677 So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
 678 following sequentially on respectively from the same) are "type-cast"
 679 to 8-bit; for 16-bit entries likewise and so on.
 680
 681 However that only covers the case where the element widths are the same.
 682 Where the element widths are different, the following algorithm applies:
 683
 684 * Analyse the bitwidth of all source operands and work out the
 685   maximum.  Record this as "maxsrcbitwidth"
 686 * If any given source operand requires sign-extension or zero-extension
 687   (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
 688   sign-extension / zero-extension or whatever is specified in the standard
 689   RV specification, **change** that to sign-extending from the respective
 690   individual source operand's bitwidth from the CSR table out to
 691   "maxsrcbitwidth" (previously calculated), instead.
 692 * Following separate and distinct (optional) sign/zero-extension of all
 693   source operands as specifically required for that operation, carry out the
 694   operation at "maxsrcbitwidth".  (Note that in the case of LOAD/STORE or MV
 695   this may be a "null" (copy) operation, and that with FCVT, the changes
 696   to the source and destination bitwidths may also turn FVCT effectively
 697   into a copy).
 698 * If the destination operand requires sign-extension or zero-extension,
 699   instead of a mandatory fixed size (typically 32-bit for arithmetic,
 700   for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
 701   etc.), overload the RV specification with the bitwidth from the
 702   destination register's elwidth entry.
 703 * Finally, store the (optionally) sign/zero-extended value into its
 704   destination: memory for sb/sw etc., or an offset section of the register
 705   file for an arithmetic operation.
 706
 707 In this way, polymorphic bitwidths are achieved without requiring a
 708 massive 64-way permutation of calculations **per opcode**, for example
 709 (4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
 710 rd bitwidths).  The pseudo-code is therefore as follows:
 711
 712     typedef union {
 713         uint8_t  b;
 714         uint16_t s;
 715         uint32_t i;
 716         uint64_t l;
 717     } el_reg_t;
 718
 719     bw(elwidth):
 720         if elwidth == 0: return xlen
 721         if elwidth == 1: return 8
 722         if elwidth == 2: return 16
 723         // elwidth == 3:
 724         return 32
 725
 726     get_max_elwidth(rs1, rs2):
 727         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 728                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 729
 730     get_polymorphed_reg(reg, bitwidth, offset):
 731         el_reg_t res;
 732         res.l = 0; // TODO: going to need sign-extending / zero-extending
 733         if bitwidth == 8:
 734             reg.b = int_regfile[reg].b[offset]
 735         elif bitwidth == 16:
 736             reg.s = int_regfile[reg].s[offset]
 737         elif bitwidth == 32:
 738             reg.i = int_regfile[reg].i[offset]
 739         elif bitwidth == 64:
 740             reg.l = int_regfile[reg].l[offset]
 741         return res
 742
 743     set_polymorphed_reg(reg, bitwidth, offset, val):
 744         if (!int_csr[reg].isvec):
 745             # sign/zero-extend depending on opcode requirements, from
 746             # the reg's bitwidth out to the full bitwidth of the regfile
 747             val = sign_or_zero_extend(val, bitwidth, xlen)
 748             int_regfile[reg].l[0] = val
 749         elif bitwidth == 8:
 750             int_regfile[reg].b[offset] = val
 751         elif bitwidth == 16:
 752             int_regfile[reg].s[offset] = val
 753         elif bitwidth == 32:
 754             int_regfile[reg].i[offset] = val
 755         elif bitwidth == 64:
 756             int_regfile[reg].l[offset] = val
 757
 758       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 759       destwid = int_csr[rs1].elwidth         # destination element width
 760       for (i = 0; i < VL; i++)
 761         if (predval & 1<<i) # predication uses intregs
 762            // TODO, calculate if over-run occurs, for each elwidth
 763            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 764            // TODO, sign/zero-extend src1 and src2 as operation requires
 765            if (op_requires_sign_extend_src1)
 766               src1 = sign_extend(src1, maxsrcwid)
 767            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 768            result = src1 + src2 # actual add here
 769            // TODO, sign/zero-extend result, as operation requires
 770            if (op_requires_sign_extend_dest)
 771               result = sign_extend(result, maxsrcwid)
 772            set_polymorphed_reg(rd, destwid, ird, result)
 773            if (!int_vec[rd].isvector) break
 774         if (int_vec[rd ].isvector)  { id += 1; }
 775         if (int_vec[rs1].isvector)  { irs1 += 1; }
 776         if (int_vec[rs2].isvector)  { irs2 += 1; }
 777
 778 Whilst specific sign-extension and zero-extension pseudocode call
 779 details are left out, due to each operation being different, the above
 780 should be clear that;
 781
 782 * the source operands are extended out to the maximum bitwidth of all
 783   source operands
 784 * the operation takes place at that maximum source bitwidth (the
 785   destination bitwidth is not involved at this point, at all)
 786 * the result is extended (or potentially even, truncated) before being
 787   stored in the destination.  i.e. truncation (if required) to the
 788   destination width occurs **after** the operation **not** before.
 789 * when the destination is not marked as "vectorised", the **full**
 790   (standard, scalar) register file entry is taken up, i.e. the
 791   element is either sign-extended or zero-extended to cover the
 792   full register bitwidth (XLEN) if it is not already XLEN bits long.
 793
 794 Implementors are entirely free to optimise the above, particularly
 795 if it is specifically known that any given operation will complete
 796 accurately in less bits, as long as the results produced are
 797 directly equivalent and equal, for all inputs and all outputs,
 798 to those produced by the above algorithm.
 799
 800 ## Polymorphic floating-point operation exceptions and error-handling
 801
 802 For floating-point operations, conversion takes place without raising any
 803 kind of exception.  Exactly as specified in the standard RV specification,
 804 NAN (or appropriate) is stored if the result is beyond the range of the
 805 destination, and, again, exactly as with the standard RV specification
 806 just as with scalar operations, the floating-point flag is raised
 807 (FCSR).  And, again, just as with scalar operations, it is software's
 808 responsibility to check this flag.  Given that the FCSR flags are
 809 "accrued", the fact that multiple element operations could have occurred
 810 is not a problem.
 811
 812 Note that it is perfectly legitimate for floating-point bitwidths of
 813 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 814 principles, no actual standard yet exists.  Implementors wishing to
 815 provide hardware-level 8-bit support rather than throw a trap to emulate
 816 in software should contact the author of this specification before
 817 proceeding.
 818
 819 ## Polymorphic shift operators
 820
 821 A special note is needed for changing the element width of left and
 822 right shift operators, particularly right-shift.  Even for standard RV
 823 base, in order for correct results to be returned, the second operand
 824 RS2 must be truncated to be within the range of RS1's bitwidth.
 825 spike's implementation of sll for example is as follows:
 826
 827     WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
 828
 829 which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
 830 range 0..31 so that RS1 will only be left-shifted by the amount that
 831 is possible to fit into a 32-bit register.  Whilst this appears not
 832 to matter for hardware, it matters greatly in software implementations,
 833 and it also matters where an RV64 system is set to "RV32" mode, such
 834 that the underlying registers RS1 and RS2 comprise 64 hardware bits
 835 each.
 836
 837 For SV, where each operand's element bitwidth may be over-ridden, the
 838 rule about determining the operation's bitwidth *still applies*, being
 839 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 840 **also applies to the truncation of RS2**.  In other words, *after*
 841 determining the maximum bitwidth, RS2's range must **also be truncated**
 842 to ensure a correct answer.  Example:
 843
 844 * RS1 is over-ridden to a 16-bit width
 845 * RS2 is over-ridden to an 8-bit width
 846 * RD is over-ridden to a 64-bit width
 847 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 848 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 849
 850 Pseudocode (in spike) for this example would therefore be:
 851
 852     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 853
 854 This example illustrates that considerable care therefore needs to be
 855 taken to ensure that left and right shift operations are implemented
 856 correctly.  The key is that
 857
 858 * The operation bitwidth is determined by the maximum bitwidth
 859   of the *source registers*, **not** the destination register bitwidth
 860 * The result is then sign-extend (or truncated) as appropriate.
 861
 862 ## Polymorphic MULH/MULHU/MULHSU
 863
 864 MULH is designed to take the top half MSBs of a multiply that
 865 does not fit within the range of the source operands, such that
 866 smaller width operations may produce a full double-width multiply
 867 in two cycles.  The issue is: SV allows the source operands to
 868 have variable bitwidth.
 869
 870 Here again special attention has to be paid to the rules regarding
 871 bitwidth, which, again, are that the operation is performed at
 872 the maximum bitwidth of the **source** registers.  Therefore:
 873
 874 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 875   be shifted down by 8 bits
 876 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 877   be shifted down by 16 bits (top 8 bits being zero)
 878 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 879   be shifted down by 16 bits
 880 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 881   be shifted down by 32 bits
 882 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 883   be shifted down by 32 bits
 884
 885 So again, just as with shift-left and shift-right, the result
 886 is shifted down by the maximum of the two source register bitwidths.
 887 And, exactly again, truncation or sign-extension is performed on the
 888 result.  If sign-extension is to be carried out, it is performed
 889 from the same maximum of the two source register bitwidths out
 890 to the result element's bitwidth.
 891
 892 If truncation occurs, i.e. the top MSBs of the result are lost,
 893 this is "Officially Not Our Problem", i.e. it is assumed that the
 894 programmer actually desires the result to be truncated.  i.e. if the
 895 programmer wanted all of the bits, they would have set the destination
 896 elwidth to accommodate them.
 897
 898 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 899
 900 Polymorphic element widths in vectorised form means that the data
 901 being loaded (or stored) across multiple registers needs to be treated
 902 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 903 the source register's element width is **independent** from the destination's.
 904
 905 This makes for a slightly more complex algorithm when using indirection
 906 on the "addressed" register (source for LOAD and destination for STORE),
 907 particularly given that the LOAD/STORE instruction provides important
 908 information about the width of the data to be reinterpreted.
 909
 910 Let's illustrate the "load" part, where the pseudo-code for elwidth=default
 911 was as follows, and i is the loop from 0 to VL-1:
 912
 913     srcbase = ireg[rs+i];
 914     return mem[srcbase + imm]; // returns XLEN bits
 915
 916 Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
 917 chunks are taken from the source memory location addressed by the current
 918 indexed source address register, and only when a full 32-bits-worth
 919 are taken will the index be moved on to the next contiguous source
 920 address register:
 921
 922     bitwidth = bw(elwidth);             // source elwidth from CSR reg entry
 923     elsperblock = 32 / bitwidth         // 1 if bw=32, 2 if bw=16, 4 if bw=8
 924     srcbase = ireg[rs+i/(elsperblock)]; // integer divide
 925     offs = i % elsperblock;             // modulo
 926     return &mem[srcbase + imm + offs];  // re-cast to uint8_t*, uint16_t* etc.
 927
 928 Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
 929 and 128 for LQ.
 930
 931 The principle is basically exactly the same as if the srcbase were pointing
 932 at the memory of the *register* file: memory is re-interpreted as containing
 933 groups of elwidth-wide discrete elements.
 934
 935 When storing the result from a load, it's important to respect the fact
 936 that the destination register has its *own separate element width*.  Thus,
 937 when each element is loaded (at the source element width), any sign-extension
 938 or zero-extension (or truncation) needs to be done to the *destination*
 939 bitwidth.  Also, the storing has the exact same analogous algorithm as
 940 above, where in fact it is just the set\_polymorphed\_reg pseudocode
 941 (completely unchanged) used above.
 942
 943 One issue remains: when the source element width is **greater** than
 944 the width of the operation, it is obvious that a single LB for example
 945 cannot possibly obtain 16-bit-wide data.  This condition may be detected
 946 where, when using integer divide, elsperblock (the width of the LOAD
 947 divided by the bitwidth of the element) is zero.
 948
 949 The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
 950
 951     elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
 952
 953 The elements, if the element bitwidth is larger than the LD operation's
 954 size, will then be sign/zero-extended to the full LD operation size, as
 955 specified by the LOAD (LDU instead of LD, LBU instead of LB), before
 956 being passed on to the second phase.
 957
 958 As LOAD/STORE may be twin-predicated, it is important to note that
 959 the rules on twin predication still apply, except where in previous
 960 pseudo-code (elwidth=default for both source and target) it was
 961 the *registers* that the predication was applied to, it is now the
 962 **elements** that the predication is applied to.
 963
 964 Thus the full pseudocode for all LD operations may be written out
 965 as follows:
 966
 967     function LBU(rd, rs):
 968         load_elwidthed(rd, rs, 8, true)
 969     function LB(rd, rs):
 970         load_elwidthed(rd, rs, 8, false)
 971     function LH(rd, rs):
 972         load_elwidthed(rd, rs, 16, false)
 973     ...
 974     ...
 975     function LQ(rd, rs):
 976         load_elwidthed(rd, rs, 128, false)
 977
 978     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 979     function load_memory(rs, imm, i, opwidth):
 980         elwidth = int_csr[rs].elwidth
 981         bitwidth = bw(elwidth);
 982         elsperblock = min(1, opwidth / bitwidth)
 983         srcbase = ireg[rs+i/(elsperblock)];
 984         offs = i % elsperblock;
 985         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 986
 987     function load_elwidthed(rd, rs, opwidth, unsigned):
 988       destwid = int_csr[rd].elwidth # destination element width
 989       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 990       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 991       ps = get_pred_val(FALSE, rs); # predication on src
 992       pd = get_pred_val(FALSE, rd); # ... AND on dest
 993       for (int i = 0, int j = 0; i < VL && j < VL;):
 994         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 995         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 996         val = load_memory(rs, imm, i, opwidth)
 997         if unsigned:
 998             val = zero_extend(val, min(opwidth, bitwidth))
 999         else:
1000             val = sign_extend(val, min(opwidth, bitwidth))
1001         set_polymorphed_reg(rd, bitwidth, j, val)
1002         if (int_csr[rs].isvec) i++;
1003         if (int_csr[rd].isvec) j++; else break;
1004
1005 Note:
1006
1007 * when comparing against for example the twin-predicated c.mv
1008   pseudo-code, the pattern of independent incrementing of rd and rs
1009   is preserved unchanged.
1010 * just as with the c.mv pseudocode, zeroing is not included and must be
1011   taken into account (TODO).
1012 * that due to the use of a twin-predication algorithm, LOAD/STORE also
1013   take on the same VSPLAT, VINSERT, VREDUCE, VEXTRACT, VGATHER and
1014   VSCATTER characteristics.
1015 * that due to the use of the same set\_polymorphed\_reg pseudocode,
1016   a destination that is not vectorised (marked as scalar) will
1017   result in the element being fully sign-extended or zero-extended
1018   out to the full register file bitwidth (XLEN).  When the source
1019   is also marked as scalar, this is how the compatibility with
1020   standard RV LOAD/STORE is preserved by this algorithm.
1021
1022 ### Example Tables showing LOAD elements
1023
1024 This section contains examples of vectorised LOAD operations, showing
1025 how the two stage process works (three if zero/sign-extension is included).
1026
1027
1028 #### Example: LD x8, x5(0), x8 CSR-elwidth=32, x5 CSR-elwidth=16, VL=7
1029
1030 This is:
1031
1032 * a 64-bit load, with an offset of zero
1033 * with a source-address elwidth of 16-bit
1034 * into a destination-register with an elwidth of 32-bit
1035 * where VL=7
1036 * from register x5 (actually x5-x6) to x8 (actually x8 to half of x11)
1037 * RV64, where XLEN=64 is assumed.
1038
1039 First, the memory table, which, due to the element width being 16 and the
1040 operation being LD (64), the 64-bits loaded from memory are subdivided
1041 into groups of **four** elements.  And, with VL being 7 (deliberately
1042 to illustrate that this is reasonable and possible), the first four are
1043 sourced from the offset addresses pointed to by x5, and the next three
1044 from the ofset addresses pointed to by the next contiguous register, x6:
1045
1046 [[!table  data="""
1047 addr | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
1048 @x5  | elem 0         || elem 1         || elem 2         || elem 3         ||
1049 @x6  | elem 4         || elem 5         || elem 6         || not loaded     ||
1050 """]]
1051
1052 Next, the elements are zero-extended from 16-bit to 32-bit, as whilst
1053 the elwidth CSR entry for x5 is 16-bit, the destination elwidth on x8 is 32.
1054
1055 [[!table  data="""
1056 byte 3 | byte 2 | byte 1 | byte 0 |
1057 0x0    | 0x0    | elem0          ||
1058 0x0    | 0x0    | elem1          ||
1059 0x0    | 0x0    | elem2          ||
1060 0x0    | 0x0    | elem3          ||
1061 0x0    | 0x0    | elem4          ||
1062 0x0    | 0x0    | elem5          ||
1063 0x0    | 0x0    | elem6          ||
1064 0x0    | 0x0    | elem7          ||
1065 """]]
1066
1067 Lastly, the elements are stored in contiguous blocks, as if x8 was also
1068 byte-addressable "memory".  That "memory" happens to cover registers
1069 x8, x9, x10 and x11, with the last 32 "bits" of x11 being **UNMODIFIED**:
1070
1071 [[!table  data="""
1072 reg# | byte 7 | byte 6 | byte 5 | byte 4 | byte 3 | byte 2 | byte 1 | byte 0 |
1073 x8   | 0x0    | 0x0    | elem 1         || 0x0    | 0x0    | elem 0         ||
1074 x9   | 0x0    | 0x0    | elem 3         || 0x0    | 0x0    | elem 2         ||
1075 x10  | 0x0    | 0x0    | elem 5         || 0x0    | 0x0    | elem 4         ||
1076 x11  | **UNMODIFIED**                 |||| 0x0    | 0x0    | elem 6         ||
1077 """]]
1078
1079 Thus we have data that is loaded from the **addresses** pointed to by
1080 x5 and x6, zero-extended from 16-bit to 32-bit, stored in the **registers**
1081 x8 through to half of x11.
1082 The end result is that elements 0 and 1 end up in x8, with element 8 being
1083 shifted up 32 bits, and so on, until finally element 6 is in the
1084 LSBs of x11.
1085
1086 Note that whilst the memory addressing table is shown left-to-right byte order,
1087 the registers are shown in right-to-left (MSB) order.  This does **not**
1088 imply that bit or byte-reversal is carried out: it's just easier to visualise
1089 memory as being contiguous bytes, and emphasises that registers are not
1090 really actually "memory" as such.
1091
1092 ## Why SV bitwidth specification is restricted to 4 entries
1093
1094 The four entries for SV element bitwidths only allows three over-rides:
1095
1096 * 8 bit
1097 * 16 hit
1098 * 32 bit
1099
1100 This would seem inadequate, surely it would be better to have 3 bits or
1101 more and allow 64, 128 and some other options besides.  The answer here
1102 is, it gets too complex, no RV128 implementation yet exists, and so RV64's
1103 default is 64 bit, so the 4 major element widths are covered anyway.
1104
1105 There is an absolutely crucial aspect oF SV here that explicitly
1106 needs spelling out, and it's whether the "vectorised" bit is set in
1107 the Register's CSR entry.
1108
1109 If "vectorised" is clear (not set), this indicates that the operation
1110 is "scalar".  Under these circumstances, when set on a destination (RD),
1111 then sign-extension and zero-extension, whilst changed to match the
1112 override bitwidth (if set), will erase the **full** register entry
1113 (64-bit if RV64).
1114
1115 When vectorised is *set*, this indicates that the operation now treats
1116 **elements** as if they were independent registers, so regardless of
1117 the length, any parts of a given actual register that are not involved
1118 in the operation are **NOT** modified, but are **PRESERVED**.
1119
1120 For example:
1121
1122 * when the vector bit is clear and elwidth set to 16 on the destination
1123   register, operations are truncated to 16 bit and then sign or zero
1124   extended to the *FULL* XLEN register width.
1125 * when the vector bit is set, elwidth is 16 and VL=1 (or other value where
1126   groups of elwidth sized elements do not fill an entire XLEN register),
1127   the "top" bits of the destination register do *NOT* get modified, zero'd
1128   or otherwise overwritten.
1129
1130 SIMD micro-architectures may implement this by using predication on
1131 any elements in a given actual register that are beyond the end of
1132 multi-element operation.
1133
1134 Other microarchitectures may choose to provide byte-level write-enable
1135 lines on the register file, such that each 64 bit register in an RV64
1136 system requires 8 WE lines.  Scalar RV64 operations would require
1137 activation of all 8 lines, where SV elwidth based operations would
1138 activate the required subset of those byte-level write lines.
1139
1140 Example:
1141
1142 * rs1, rs2 and rd are all set to 8-bit
1143 * VL is set to 3
1144 * RV64 architecture is set (UXL=64)
1145 * add operation is carried out
1146 * bits 0-23 of RD are modified to be rs1[23..16] + rs2[23..16]
1147   concatenated with similar add operations on bits 15..8 and 7..0
1148 * bits 24 through 63 **remain as they originally were**.
1149
1150 Example SIMD micro-architectural implementation:
1151
1152 * SIMD architecture works out the nearest round number of elements
1153   that would fit into a full RV64 register (in this case: 8)
1154 * SIMD architecture creates a hidden predicate, binary 0b00000111
1155   i.e. the bottom 3 bits set (VL=3) and the top 5 bits clear
1156 * SIMD architecture goes ahead with the add operation as if it
1157   was a full 8-wide batch of 8 adds
1158 * SIMD architecture passes top 5 elements through the adders
1159   (which are "disabled" due to zero-bit predication)
1160 * SIMD architecture gets the 5 unmodified top 8-bits back unmodified
1161   and stores them in rd.
1162
1163 This requires a read on rd, however this is required anyway in order
1164 to support non-zeroing mode.
1165
1166 ## Polymorphic floating-point
1167
1168 Standard scalar RV integer operations base the register width on XLEN,
1169 which may be changed (UXL in USTATUS, and the corresponding MXL and
1170 SXL in MSTATUS and SSTATUS respectively).  Integer LOAD, STORE and
1171 arithmetic operations are therefore restricted to an active XLEN bits,
1172 with sign or zero extension to pad out the upper bits when XLEN has
1173 been dynamically set to less than the actual register size.
1174
1175 For scalar floating-point, the active (used / changed) bits are
1176 specified exclusively by the operation: ADD.S specifies an active
1177 32-bits, with the upper bits of the source registers needing to
1178 be all 1s ("NaN-boxed"), and the destination upper bits being
1179 *set* to all 1s (including on LOAD/STOREs).
1180
1181 Where elwidth is set to default (on any source or the destination)
1182 it is obvious that this NaN-boxing behaviour can and should be
1183 preserved.  When elwidth is non-default things are less obvious,
1184 so need to be thought through.  Here is a normal (scalar) sequence,
1185 assuming an RV64 which supports Quad (128-bit) FLEN:
1186
1187 * FLD loads 64-bit wide from memory.  Top 64 MSBs are set to all 1s
1188 * ADD.D performs a 64-bit-wide add.  Top 64 MSBs of destination set to 1s.
1189 * FSD stores lowest 64-bits from the 128-bit-wide register to memory:
1190   top 64 MSBs ignored.
1191
1192 Therefore it makes sense to mirror this behaviour when, for example,
1193 elwidth is set to 32.  Assume elwidth set to 32 on all source and
1194 destination registers:
1195
1196 * FLD loads 64-bit wide from memory as **two** 32-bit single-precision
1197   floating-point numbers.
1198 * ADD.D performs **two** 32-bit-wide adds, storing one of the adds
1199   in bits 0-31 and the second in bits 32-63.
1200 * FSD stores lowest 64-bits from the 128-bit-wide register to memory
1201
1202 Here's the thing: it does not make sense to overwrite the top 64 MSBs
1203 of the registers either during the FLD **or** the ADD.D.  The reason
1204 is that, effectively, the top 64 MSBs actually represent a completely
1205 independent 64-bit register, so overwriting it is not only gratuitous
1206 but may actually be harmful for a future extension to SV which may
1207 have a way to directly access those top 64 bits.
1208
1209 The decision is therefore **not** to touch the upper parts of floating-point
1210 registers whereever elwidth is set to non-default values, including
1211 when "isvec" is false in a given register's CSR entry.  Only when the
1212 elwidth is set to default **and** isvec is false will the standard
1213 RV behaviour be followed, namely that the upper bits be modified.
1214
1215 Ultimately if elwidth is default and isvec false on *all* source
1216 and destination registers, a SimpleV instruction defaults completely
1217 to standard RV scalar behaviour (this holds true for **all** operations,
1218 right across the board).
1219
1220 The nice thing here is that ADD.S, ADD.D and ADD.Q when elwidth are
1221 non-default values are effectively all the same: they all still perform
1222 multiple ADD operations, just at different widths.  A future extension
1223 to SimpleV may actually allow ADD.S to access the upper bits of the
1224 register, effectively breaking down a 128-bit register into a bank
1225 of 4 independently-accesible 32-bit registers.
1226
1227 In the meantime, although when e.g. setting VL to 8 it would technically
1228 make no difference to the ALU whether ADD.S, ADD.D or ADD.Q is used,
1229 using ADD.Q may be an easy way to signal to the microarchitecture that
1230 it is to receive a higher VL value.  On a superscalar OoO architecture
1231 there may be absolutely no difference, however on simpler SIMD-style
1232 microarchitectures they may not necessarily have the infrastructure in
1233 place to know the difference, such that when VL=8 and an ADD.D instruction
1234 is issued, it completes in 2 cycles (or more) rather than one, where
1235 if an ADD.Q had been issued instead on such simpler microarchitectures
1236 it would complete in one.
1237
1238 ## Specific instruction walk-throughs
1239
1240 This section covers walk-throughs of the above-outlined procedure
1241 for converting standard RISC-V scalar arithmetic operations to
1242 polymorphic widths, to ensure that it is correct.
1243
1244 ### add
1245
1246 Standard Scalar RV32/RV64 (xlen):
1247
1248 * RS1 @ xlen bits
1249 * RS2 @ xlen bits
1250 * add @ xlen bits
1251 * RD @ xlen bits
1252
1253 Polymorphic variant:
1254
1255 * RS1 @ rs1 bits, zero-extended to max(rs1, rs2) bits
1256 * RS2 @ rs2 bits, zero-extended to max(rs1, rs2) bits
1257 * add @ max(rs1, rs2) bits
1258 * RD @ rd bits.  zero-extend to rd if rd > max(rs1, rs2) otherwise truncate
1259
1260 Note here that polymorphic add zero-extends its source operands,
1261 where addw sign-extends.
1262
1263 ### addw
1264
1265 The RV Specification specifically states that "W" variants of arithmetic
1266 operations always produce 32-bit signed values.  In a polymorphic
1267 environment it is reasonable to assume that the signed aspect is
1268 preserved, where it is the length of the operands and the result
1269 that may be changed.
1270
1271 Standard Scalar RV64 (xlen):
1272
1273 * RS1 @ xlen bits
1274 * RS2 @ xlen bits
1275 * add @ xlen bits
1276 * RD @ xlen bits, truncate add to 32-bit and sign-extend to xlen.
1277
1278 Polymorphic variant:
1279
1280 * RS1 @ rs1 bits, sign-extended to max(rs1, rs2) bits
1281 * RS2 @ rs2 bits, sign-extended to max(rs1, rs2) bits
1282 * add @ max(rs1, rs2) bits
1283 * RD @ rd bits. sign-extend to rd if rd > max(rs1, rs2) otherwise truncate
1284
1285 Note here that polymorphic addw sign-extends its source operands,
1286 where add zero-extends.
1287
1288 This requires a little more in-depth analysis.  Where the bitwidth of
1289 rs1 equals the bitwidth of rs2, no sign-extending will occur.  It is
1290 only where the bitwidth of either rs1 or rs2 are different, will the
1291 lesser-width operand be sign-extended.
1292
1293 Effectively however, both rs1 and rs2 are being sign-extended (or
1294 truncated), where for add they are both zero-extended.  This holds true
1295 for all arithmetic operations ending with "W".
1296
1297 ### addiw
1298
1299 Standard Scalar RV64I:
1300
1301 * RS1 @ xlen bits, truncated to 32-bit
1302 * immed @ 12 bits, sign-extended to 32-bit
1303 * add @ 32 bits
1304 * RD @ rd bits.  sign-extend to rd if rd > 32, otherwise truncate.
1305
1306 Polymorphic variant:
1307
1308 * RS1 @ rs1 bits
1309 * immed @ 12 bits, sign-extend to max(rs1, 12) bits
1310 * add @ max(rs1, 12) bits
1311 * RD @ rd bits. sign-extend to rd if rd > max(rs1, 12) otherwise truncate
1312
1313 # Predication Element Zeroing
1314
1315 The introduction of zeroing on traditional vector predication is usually
1316 intended as an optimisation for lane-based microarchitectures with register
1317 renaming to be able to save power by avoiding a register read on elements
1318 that are passed through en-masse through the ALU.  Simpler microarchitectures
1319 do not have this issue: they simply do not pass the element through to
1320 the ALU at all, and therefore do not store it back in the destination.
1321 More complex non-lane-based micro-architectures can, when zeroing is
1322 not set, use the predication bits to simply avoid sending element-based
1323 operations to the ALUs, entirely: thus, over the long term, potentially
1324 keeping all ALUs 100% occupied even when elements are predicated out.
1325
1326 SimpleV's design principle is not based on or influenced by
1327 microarchitectural design factors: it is a hardware-level API.
1328 Therefore, looking purely at whether zeroing is *useful* or not,
1329 (whether less instructions are needed for certain scenarios),
1330 given that a case can be made for zeroing *and* non-zeroing, the
1331 decision was taken to add support for both.
1332
1333 ## Single-predication (based on destination register)
1334
1335 Zeroing on predication for arithmetic operations is taken from
1336 the destination register's predicate.  i.e. the predication *and*
1337 zeroing settings to be applied to the whole operation come from the
1338 CSR Predication table entry for the destination register.
1339 Thus when zeroing is set on predication of a destination element,
1340 if the predication bit is clear, then the destination element is *set*
1341 to zero (twin-predication is slightly different, and will be covered
1342 next).
1343
1344 Thus the pseudo-code loop for a predicated arithmetic operation
1345 is modified to as follows:
1346
1347       for (i = 0; i < VL; i++)
1348         if not zeroing: # an optimisation
1349            while (!(predval & 1<<i) && i < VL)
1350              if (int_vec[rd ].isvector)  { id += 1; }
1351              if (int_vec[rs1].isvector)  { irs1 += 1; }
1352              if (int_vec[rs2].isvector)  { irs2 += 1; }
1353            if i == VL:
1354              return
1355         if (predval & 1<<i)
1356            src1 = ....
1357            src2 = ...
1358            else:
1359                result = src1 + src2 # actual add (or other op) here
1360            set_polymorphed_reg(rd, destwid, ird, result)
1361            if int_vec[rd].ffirst and result == 0:
1362               VL = i # result was zero, end loop early, return VL
1363               return
1364            if (!int_vec[rd].isvector) return
1365         else if zeroing:
1366            result = 0
1367            set_polymorphed_reg(rd, destwid, ird, result)
1368         if (int_vec[rd ].isvector)  { id += 1; }
1369         else if (predval & 1<<i) return
1370         if (int_vec[rs1].isvector)  { irs1 += 1; }
1371         if (int_vec[rs2].isvector)  { irs2 += 1; }
1372         if (rd == VL or rs1 == VL or rs2 == VL): return
1373
1374 The optimisation to skip elements entirely is only possible for certain
1375 micro-architectures when zeroing is not set.  However for lane-based
1376 micro-architectures this optimisation may not be practical, as it
1377 implies that elements end up in different "lanes".  Under these
1378 circumstances it is perfectly fine to simply have the lanes
1379 "inactive" for predicated elements, even though it results in
1380 less than 100% ALU utilisation.
1381
1382 ## Twin-predication (based on source and destination register)
1383
1384 Twin-predication is not that much different, except that that
1385 the source is independently zero-predicated from the destination.
1386 This means that the source may be zero-predicated *or* the
1387 destination zero-predicated *or both*, or neither.
1388
1389 When with twin-predication, zeroing is set on the source and not
1390 the destination, if a predicate bit is set it indicates that a zero
1391 data element is passed through the operation (the exception being:
1392 if the source data element is to be treated as an address - a LOAD -
1393 then the data returned *from* the LOAD is zero, rather than looking up an
1394 *address* of zero.
1395
1396 When zeroing is set on the destination and not the source, then just
1397 as with single-predicated operations, a zero is stored into the destination
1398 element (or target memory address for a STORE).
1399
1400 Zeroing on both source and destination effectively result in a bitwise
1401 NOR operation of the source and destination predicate: the result is that
1402 where either source predicate OR destination predicate is set to 0,
1403 a zero element will ultimately end up in the destination register.
1404
1405 However: this may not necessarily be the case for all operations;
1406 implementors, particularly of custom instructions, clearly need to
1407 think through the implications in each and every case.
1408
1409 Here is pseudo-code for a twin zero-predicated operation:
1410
1411     function op_mv(rd, rs) # MV not VMV!
1412       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1413       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1414       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1415       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1416       for (int i = 0, int j = 0; i < VL && j < VL):
1417         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1418         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1419         if ((pd & 1<<j))
1420             if ((pd & 1<<j))
1421                 sourcedata = ireg[rs+i];
1422             else
1423                 sourcedata = 0
1424             ireg[rd+j] <= sourcedata
1425         else if (zerodst)
1426             ireg[rd+j] <= 0
1427         if (int_csr[rs].isvec)
1428             i++;
1429         if (int_csr[rd].isvec)
1430             j++;
1431         else
1432             if ((pd & 1<<j))
1433                 break;
1434
1435 Note that in the instance where the destination is a scalar, the hardware
1436 loop is ended the moment a value *or a zero* is placed into the destination
1437 register/element.  Also note that, for clarity, variable element widths
1438 have been left out of the above.
1439
1440 # Subsets of RV functionality
1441
1442 This section describes the differences when SV is implemented on top of
1443 different subsets of RV.
1444
1445 ## Common options
1446
1447 It is permitted to only implement SVprefix and not the VBLOCK instruction
1448 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1449 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1450 traps may emulate the format.
1451
1452 It is permitted in SVprefix to either not implement VL or not implement
1453 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1454 *MUST* raise illegal instruction on implementations that do not support
1455 VL or SUBVL.
1456
1457 It is permitted to limit the size of either (or both) the register files
1458 down to the original size of the standard RV architecture.  However, below
1459 the mandatory limits set in the RV standard will result in non-compliance
1460 with the SV Specification.
1461
1462 ## RV32 / RV32F
1463
1464 When RV32 or RV32F is implemented, XLEN is set to 32, and thus the
1465 maximum limit for predication is also restricted to 32 bits.  Whilst not
1466 actually specifically an "option" it is worth noting.
1467
1468 ## RV32G
1469
1470 Normally in standard RV32 it does not make much sense to have
1471 RV32G, The critical instructions that are missing in standard RV32
1472 are those for moving data to and from the double-width floating-point
1473 registers into the integer ones, as well as the FCVT routines.
1474
1475 In an earlier draft of SV, it was possible to specify an elwidth
1476 of double the standard register size: this had to be dropped,
1477 and may be reintroduced in future revisions.
1478
1479 ## RV32 (not RV32F / RV32G) and RV64 (not RV64F / RV64G)
1480
1481 When floating-point is not implemented, the size of the User Register and
1482 Predication CSR tables may be halved, to only 4 2x16-bit CSRs (8 entries
1483 per table).
1484
1485 ## RV32E
1486
1487 In embedded scenarios the User Register and Predication CSRs may be
1488 dropped entirely, or optionally limited to 1 CSR, such that the combined
1489 number of entries from the M-Mode CSR Register table plus U-Mode
1490 CSR Register table is either 4 16-bit entries or (if the U-Mode is
1491 zero) only 2 16-bit entries (M-Mode CSR table only).  Likewise for
1492 the Predication CSR tables.
1493
1494 RV32E is the most likely candidate for simply detecting that registers
1495 are marked as "vectorised", and generating an appropriate exception
1496 for the VL loop to be implemented in software.
1497
1498 ## RV128
1499
1500 RV128 has not been especially considered, here, however it has some
1501 extremely large possibilities: double the element width implies
1502 256-bit operands, spanning 2 128-bit registers each, and predication
1503 of total length 128 bit given that XLEN is now 128.
1504
1505 # Example usage
1506
1507 TODO evaluate strncpy and strlen
1508 <https://groups.google.com/forum/m/#!msg/comp.arch/bGBeaNjAKvc/_vbqyxTUAQAJ>
1509
1510 ## strncpy
1511
1512 RVV version: <a name="strncpy"></>
1513
1514     strncpy:
1515         mv a3, a0               # Copy dst
1516     loop:
1517         setvli x0, a2, vint8    # Vectors of bytes.
1518         vlbff.v v1, (a1)        # Get src bytes
1519         vseq.vi v0, v1, 0       # Flag zero bytes
1520         vmfirst a4, v0          # Zero found?
1521         vmsif.v v0, v0          # Set mask up to and including zero byte.
1522         vsb.v v1, (a3), v0.t    # Write out bytes
1523         bgez a4, exit           # Done
1524         csrr t1, vl             # Get number of bytes fetched
1525         add a1, a1, t1          # Bump src pointer
1526         sub a2, a2, t1          # Decrement count.
1527         add a3, a3, t1          # Bump dst pointer
1528         bnez a2, loop           # Anymore?
1529
1530     exit:
1531         ret
1532
1533 SV version (WIP):
1534
1535     strncpy:
1536         mv a3, a0
1537         SETMVLI 8 # set max vector to 8
1538         RegCSR[a3] = 8bit, a3, scalar
1539         RegCSR[a1] = 8bit, a1, scalar
1540         RegCSR[t0] = 8bit, t0, vector
1541         PredTb[t0] = ffirst, x0, inv
1542     loop:
1543         SETVLI a2, t4 # t4 and VL now 1..8
1544         ldb t0, (a1) # t0 fail first mode
1545         bne t0, x0, allnonzero # still ff
1546         # VL points to last nonzero
1547         GETVL t4       # from bne tests
1548         addi t4, t4, 1 # include zero
1549         SETVL t4       # set exactly to t4
1550         stb t0, (a3)   # store incl zero
1551         ret            # end subroutine
1552     allnonzero:
1553         stb t0, (a3)    # VL legal range
1554         GETVL t4        # from bne tests
1555         add a1, a1, t4  # Bump src pointer
1556         sub a2, a2, t4  # Decrement count.
1557         add a3, a3, t4  # Bump dst pointer
1558         bnez a2, loop   # Anymore?
1559     exit:
1560         ret
1561
1562 Notes:
1563
1564 * Setting MVL to 8 is just an example. If enough registers are spare it
1565   may be set to XLEN which will require a bank of 8 scalar registers for
1566   a1, a3 and t0.
1567 * obviously if that is done, t0 is not separated by 8 full registers, and
1568   would overwrite t1 thru t7. x80 would work well, as an example, instead.
1569 * with the exception of the GETVL (a pseudo code alias for csrr), every
1570   single instruction above may use RVC.
1571 * RVC C.BNEZ can be used because rs1' may be extended to the full 128
1572   registers through redirection
1573 * RVC C.LW and C.SW may be used because the W format may be overridden by
1574   the 8 bit format. All of t0, a3 and a1 are overridden to make that work.
1575 * with the exception of the GETVL, all Vector Context may be done in
1576   VBLOCK form.
1577 * setting predication to x0 (zero) and invert on t0 is a trick to enable
1578   just ffirst on t0
1579 * ldb and bne are both using t0, both in ffirst mode
1580 * t0 vectorised, a1 scalar, both elwidth 8 bit: ldb enters "unit stride,
1581   vectorised, no (un)sign-extension or truncation" mode.
1582 * ldb will end on illegal mem, reduce VL, but copied all sorts of stuff
1583   into t0 (could contain zeros).
1584 * bne t0 x0 tests up to the NEW VL for nonzero, vector t0 against
1585   scalar x0
1586 * however as t0 is in ffirst mode, the first fail wil ALSO stop the
1587   compares, and reduce VL as well
1588 * the branch only goes to allnonzero if all tests succeed
1589 * if it did not, we can safely increment VL by 1 (using a4) to include
1590   the zero.
1591 * SETVL sets *exactly* the requested amount into VL.
1592 * the SETVL just after allnonzero label is needed in case the ldb ffirst
1593   activates but the bne allzeros does not.
1594 * this would cause the stb to copy up to the end of the legal memory
1595 * of course, on the next loop the ldb would throw a trap, as a1 now
1596   points to the first illegal mem location.
1597
1598 ## strcpy
1599
1600 RVV version:
1601
1602         mv a3, a0             # Save start
1603     loop:
1604         setvli a1, x0, vint8  # byte vec, x0 (Zero reg) => use max hardware len
1605         vldbff.v v1, (a3)     # Get bytes
1606         csrr a1, vl           # Get bytes actually read e.g. if fault
1607         vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
1608         add a3, a3, a1        # Bump pointer
1609         vmfirst a2, v0        # Find first set bit in mask, returns -1 if none
1610         bltz a2, loop         # Not found?
1611         add a0, a0, a1        # Sum start + bump
1612         add a3, a3, a2        # Add index of zero byte
1613         sub a0, a3, a0        # Subtract start address+bump
1614         ret
1615
1616 ## DAXPY
1617
1618 [[!inline raw="yes" pages="simple_v_extension/daxpy_example" ]]