simple_v_extension/abridged_spec.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Specification (Abridged)
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 25 jun 2019
   6
   7 [[!toc ]]
   8
   9 # Introduction
  10
  11 Simple-V is a uniform parallelism API for RISC-V hardware that allows
  12 the Program Counter to enter "sub-contexts" in which, ultimately, standard
  13 RISC-V scalar opcodes are executed.
  14
  15 The sub-context execution is "nested" in "re-entrant" form, in the
  16 following order:
  17
  18 * Main standard RISC-V Program Counter (PC)
  19 * VBLOCK sub-execution context (PCVBLK increments whilst PC is paused)
  20 * VL element loops (STATE srcoffs and destoffs increment, PC and PCVBLK pause)
  21 * SUBVL element loops (STATE svsrcoffs/svdestoffs increment, VL pauses)
  22
  23 Note: **there are *no* new opcodes**. The scheme works *entirely*
  24 on hidden context that augments *scalar* RISCV instructions.
  25
  26 # CSRs <a name="csrs"></a>
  27
  28 There are five CSRs, available in any privilege level:
  29
  30 * MVL (the Maximum Vector Length)
  31 * VL (which has different characteristics from standard CSRs)
  32 * SUBVL (effectively a kind of SIMD)
  33 * STATE (containing copies of MVL, VL and SUBVL as well as context information)
  34 * PCVBLK (the current operation being executed within a VBLOCK Group)
  35
  36 For Privilege Levels (trap handling) there are the following CSRs,
  37 where x may be u, m, s or h for User, Machine, Supervisor or Hypervisor
  38 Modes respectively:
  39
  40 * (x)ePCVBLK (a copy of the sub-execution Program Counter, that is relative
  41   to the start of the current VBLOCK Group, set on a trap).
  42 * (x) eSTATE (useful for saving and restoring during context switch,
  43   and for providing fast transitions)
  44
  45 The u/m/s CSRs are treated and handled exactly like their (x)epc
  46 equivalents.  On entry to or exit from a privilege level, the contents
  47 of its (x)eSTATE are swapped with STATE.
  48
  49 (x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
  50 equivalents. See VBLOCK section for details.
  51
  52 ## MAXVECTORLENGTH (MVL) <a name="mvl" />
  53
  54 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
  55 is variable length and may be dynamically set.  MVL is
  56 however limited to the regfile bitwidth XLEN (1-32 for RV32,
  57 1-64 for RV64 and so on).
  58
  59 ## Vector Length (VL) <a name="vl" />
  60
  61 VSETVL is slightly different from RVV.  Similar to RVV, VL is set to be within
  62 the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
  63
  64     VL = rd = MIN(vlen, MVL)
  65
  66 where 1 <= MVL <= XLEN
  67
  68 ## SUBVL - Sub Vector Length
  69
  70 This is a "group by quantity" that effectivrly asks each iteration
  71 of the hardware loop to load SUBVL elements of width elwidth at a
  72 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
  73 operation issued, SUBVL operations are issued.
  74
  75 The main effect of SUBVL is that predication bits are applied per
  76 **group**, rather than by individual element.
  77
  78 ## STATE
  79
  80 This is a standard CSR that contains sufficient information for a
  81 full context save/restore.  It contains (and permits setting of):
  82
  83 * MVL
  84 * VL
  85 * destoffs - the destination element offset of the current parallel
  86   instruction being executed
  87 * srcoffs - for twin-predication, the source element offset as well.
  88 * SUBVL
  89 * svdestoffs - the subvector destination element offset of the current
  90   parallel instruction being executed
  91 * svsrcoffs - for twin-predication, the subvector source element offset
  92   as well.
  93
  94 The format of the STATE CSR is as follows:
  95
  96 | (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
  97 | ------- | -------- | -------- | -------- | -------- | ------- | ------- |
  98 | dsvoffs | ssvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
  99
 100 Notes:
 101
 102 * The entries are truncated to be within range.  Attempts to set VL to
 103   greater than MAXVL will truncate VL.
 104 * Both VL and MAXVL are stored offset by one.  0b000000 represents VL=1,
 105   0b000001 represents VL=2.  This allows the full range 1 to XLEN instead
 106   of 0 to only 63.
 107
 108 ## VL, MVL and SUBVL instruction aliases
 109
 110 This table contains pseudo-assembly instruction aliases. Note the
 111 subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
 112 reduced range of the 5 bit immediate.
 113
 114 | alias           | CSR                  |
 115 | -               | -                    |
 116 | SETVL rd, rs    | CSRRW  VL, rd, rs    |
 117 | SETVLi rd, #n   | CSRRWI VL, rd, #n-1  |
 118 | GETVL rd        | CSRRW  VL, rd, x0    |
 119 | SETMVL rd, rs   | CSRRW  MVL, rd, rs   |
 120 | SETMVLi rd, #n  | CSRRWI MVL,rd, #n-1  |
 121 | GETMVL rd       | CSRRW  MVL, rd, x0   |
 122
 123 Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
 124
 125 ## Register key-value (CAM) table <a name="regcsrtable" />
 126
 127 The purpose of the Register table is to mark which registers change behaviour
 128 if used in a "Standard" (normally scalar) opcode.
 129
 130 16 bit format:
 131
 132 | RegCAM | 15       | (14..8)  | 7   | (6..5) | (4..0)  |
 133 | ------ | -        | -        | -   | ------ | ------- |
 134 | 0      | isvec0   | regidx0  | i/f | vew0   | regkey  |
 135 | 1      | isvec1   | regidx1  | i/f | vew1   | regkey  |
 136 | 2      | isvec2   | regidx2  | i/f | vew2   | regkey  |
 137 | 3      | isvec3   | regidx3  | i/f | vew3   | regkey  |
 138
 139 8 bit format:
 140
 141 | RegCAM | | 7   | (6..5) | (4..0)  |
 142 | ------ | | -   | ------ | ------- |
 143 | 0      | | i/f | vew0   | regnum  |
 144
 145 Mapping the 8-bit to 16-bit format:
 146
 147 | RegCAM | 15      | (14..8)    | 7   | (6..5) | (4..0)  |
 148 | ------ | -       | -          | -   | ------ | ------- |
 149 | 0      | isvec=1 | regnum0<<2 | i/f | vew0   | regnum0 |
 150 | 1      | isvec=1 | regnum1<<2 | i/f | vew1   | regnum1 |
 151 | 2      | isvec=1 | regnum2<<2 | i/f | vew2   | regnum2 |
 152 | 3      | isvec=1 | regnum2<<2 | i/f | vew3   | regnum3 |
 153
 154 Fields:
 155
 156 * i/f is set to "1" to indicate that the redirection/tag entry is to
 157   be applied to integer registers; 0 indicates that it is relevant to
 158   floating-point registers.
 159 * isvec indicates that the register (whether a src or dest) is to progress
 160   incrementally forward on each loop iteration.  this gives the "effect"
 161   of vectorisation.  isvec is zero indicates "do not progress", giving
 162   the "effect" of that register being scalar.
 163 * vew overrides the operation's default width.  See table below
 164 * regkey is the register which, if encountered in an op (as src or dest)
 165   is to be "redirected"
 166 * in the 16-bit format, regidx is the *actual* register to be used
 167   for the operation (note that it is 7 bits wide)
 168
 169 | vew | bitwidth            |
 170 | --- | ------------------- |
 171 | 00  | default (XLEN/FLEN) |
 172 | 01  | 8 bit               |
 173 | 10  | 16 bit              |
 174 | 11  | 32 bit              |
 175
 176 A useful way to view the above table (and not have it as a CAM):
 177
 178 As the above table is a CAM (key-value store) it may be appropriate
 179 (faster, less gates, implementation-wise) to expand it as follows:
 180
 181     struct vectorised {
 182         bool isvector:1;
 183         int  vew:2;
 184         bool enabled:1;
 185         int  predidx:7;
 186     }
 187
 188     struct vectorised fp_vec[32], int_vec[32];
 189
 190     for (i = 0; i < len; i++) // from VBLOCK Format
 191        tb = int_vec if CSRvec[i].type == 0 else fp_vec
 192        idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
 193        tb[idx].elwidth  = CSRvec[i].elwidth
 194        tb[idx].regidx   = CSRvec[i].regidx  // indirection
 195        tb[idx].isvector = CSRvec[i].isvector // 0=scalar
 196        tb[idx].enabled  = true;
 197
 198 ## Predication Table <a name="predication_csr_table"></a>
 199
 200 The Predication Table is a key-value store indicating whether, if a
 201 given destination register (integer or floating-point) is referred to
 202 in an instruction, it is to be predicated. Like the Register table, it
 203 is an indirect lookup that allows the RV opcodes to not need modification.
 204
 205 * regidx is the register that in combination with the
 206   i/f flag, if that integer or floating-point register is referred to in a
 207   (standard RV) instruction results in the lookup table being referenced
 208   to find the predication mask to use for this operation.
 209 * predidx is the *actual* (full, 7 bit) register to be used for the
 210   predication mask.
 211 * inv indicates that the predication mask bits are to be inverted
 212   prior to use *without* actually modifying the contents of the
 213   registerfrom which those bits originated.
 214 * zeroing is either 1 or 0, and if set to 1, the operation must
 215   place zeros in any element position where the predication mask is
 216   set to zero.  If zeroing is set to 0, unpredicated elements *must*
 217   be left alone.  Some microarchitectures may choose to interpret
 218   this as skipping the operation entirely.  Others which wish to
 219   stick more closely to a SIMD architecture may choose instead to
 220   interpret unpredicated elements as an internal "copy element"
 221   operation (which would be necessary in SIMD microarchitectures
 222   that perform register-renaming)
 223 * ffirst is a special mode that stops sequential element processing when
 224   a data-dependent condition occurs, whether a trap or a conditional test.
 225   The handling of each (trap or conditional test) is slightly different:
 226   see Instruction sections for further details
 227
 228 16 bit format:
 229
 230 | PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
 231 | ----- | -        | -      | -     | -   | ------- | ------- |
 232 | 0     | predidx  | zero0  | inv0  | i/f | regidx  | ffirst0 |
 233 | 1     | predidx  | zero1  | inv1  | i/f | regidx  | ffirst1 |
 234 | 2     | predidx  | zero2  | inv2  | i/f | regidx  | ffirst2 |
 235 | 3     | predidx  | zero3  | inv3  | i/f | regidx  | ffirst3 |
 236
 237 Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding.  Its use must
 238 generate an illegal instruction trap.
 239
 240 8 bit format:
 241
 242 | PrCSR | 7     | 6     | 5   | (4..0)  |
 243 | ----- | -     | -     | -   | ------- |
 244 | 0     | zero0 | inv0  | i/f | regnum  |
 245
 246 Mapping from 8 to 16 bit format, the table becomes:
 247
 248 | PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
 249 | ----- | -        | -      | -     | -   | ------- | ------- |
 250 | 0     | x9       | zero0  | inv0  | i/f | regnum  | ff=0    |
 251 | 1     | x10      | zero1  | inv1  | i/f | regnum  | ff=0    |
 252 | 2     | x11      | zero2  | inv2  | i/f | regnum  | ff=0    |
 253 | 3     | x12      | zero3  | inv3  | i/f | regnum  | ff=0    |
 254
 255 Pseudocode for predication:
 256
 257     struct pred {
 258         bool zero;    // zeroing
 259         bool inv;     // register at predidx is inverted
 260         bool ffirst;  // fail-on-first
 261         bool enabled; // use this to tell if the table-entry is active
 262         int predidx;  // redirection: actual int register to use
 263     }
 264
 265     struct pred fp_pred_reg[32];
 266     struct pred int_pred_reg[32];
 267
 268     for (i = 0; i < len; i++) // number of Predication entries in VBLOCK
 269       tb = int_pred_reg if PredicateTable[i].type == 0 else fp_pred_reg;
 270       idx = VBLOCKPredicateTable[i].regidx
 271       tb[idx].zero     = CSRpred[i].zero
 272       tb[idx].inv      = CSRpred[i].inv
 273       tb[idx].ffirst   = CSRpred[i].ffirst
 274       tb[idx].predidx  = CSRpred[i].predidx
 275       tb[idx].enabled  = true
 276
 277     def get_pred_val(bool is_fp_op, int reg):
 278        tb = int_reg if is_fp_op else fp_reg
 279        if (!tb[reg].enabled):
 280           return ~0x0, False       // all enabled; no zeroing
 281        tb = int_pred if is_fp_op else fp_pred
 282        if (!tb[reg].enabled):
 283           return ~0x0, False       // all enabled; no zeroing
 284        predidx = tb[reg].predidx   // redirection occurs HERE
 285        predicate = intreg[predidx] // actual predicate HERE
 286        if (tb[reg].inv):
 287           predicate = ~predicate   // invert ALL bits
 288        return predicate, tb[reg].zero
 289
 290 ## Fail-on-First Mode <a name="ffirst-mode"></a>
 291
 292 ffirst is a special data-dependent predicate mode.  There are two
 293 variants: one is for faults: typically for LOAD/STORE operations,
 294 which may encounter end of page faults during a series of operations.
 295 The other variant is comparisons such as FEQ (or the augmented behaviour
 296 of Branch), and any operation that returns a result of zero (whether
 297 integer or floating-point).  In the FP case, this includes negative-zero.
 298
 299 Note that the execution order must "appear" to be sequential for ffirst
 300 mode to work correctly.  An in-order architecture must execute the element
 301 operations in sequence, whilst an out-of-order architecture must *commit*
 302 the element operations in sequence (giving the appearance of in-order
 303 execution).
 304
 305 Note also, that if ffirst mode is needed without predication, a special
 306 "always-on" Predicate Table Entry may be constructed by setting
 307 inverse-on and using x0 as the predicate register.  This
 308 will have the effect of creating a mask of all ones, allowing ffirst
 309 to be set.
 310
 311 ### Fail-on-first traps
 312
 313 Except for the first element, ffault stops sequential element processing
 314 when a trap occurs.  The first element is treated normally (as if ffirst
 315 is clear).  Should any subsequent element instruction require a trap,
 316 instead it and subsequent indexed elements are ignored (or cancelled in
 317 out-of-order designs), and VL is set to the *last* instruction that did
 318 not take the trap.
 319
 320 Note that predicated-out elements (where the predicate mask bit is zero)
 321 are clearly excluded (i.e. the trap will not occur).  However, note that
 322 the loop still had to test the predicate bit: thus on return,
 323 VL is set to include elements that did not take the trap *and* includes
 324 the elements that were predicated (masked) out (not tested up to the
 325 point where the trap occurred).
 326
 327 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
 328 will cause a trap as normal (as if ffirst is not set); subsequently,
 329 the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
 330 be modified.
 331
 332 Given that predication bits apply to SUBVL groups, the same rules apply
 333 to predicated-out (masked-out) sub-groups in calculating the value that VL
 334 is set to.
 335
 336 ### Fail-on-first conditional tests
 337
 338 ffault stops sequential element conditional testing on the first element result
 339 being zero.  VL is set to the number of elements that were processed before
 340 the fail-condition was encountered.
 341
 342 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
 343 will cause the processing to end, and, even if there were elements within
 344 the *sub-group* that passed the test, that sub-group is still (entirely)
 345 excluded from the count (from setting VL).  i.e. VL is set to the total
 346 number of *sub-groups* that had no fail-condition up until execution was
 347 stopped.
 348
 349 Note again that, just as with traps, predicated-out (masked-out) elements
 350 are included in the count leading up to the fail-condition, even though they
 351 were not tested.
 352
 353 The pseudo-code for Predication makes this clearer and simpler than it is
 354 in words (the loop ends, VL is set to the current element index, "i").
 355
 356 # Instructions <a name="instructions" />
 357
 358 To illustrate how Scalar operations are turned "vector" and "predicated",
 359 simplified example pseudo-code for an integer ADD operation is shown below.
 360 Floating-point would use the FP Register Table.
 361
 362     function op_add(rd, rs1, rs2) # add not VADD!
 363       int i, id=0, irs1=0, irs2=0;
 364       predval = get_pred_val(FALSE, rd);
 365       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 366       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 367       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 368       for (i = 0; i < VL; i++)
 369         xSTATE.srcoffs = i # save context
 370         if (predval & 1<<i) # predication uses intregs
 371            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 372            if (!int_vec[rd ].isvector) break;
 373         if (int_vec[rd ].isvector)  { id += 1; }
 374         if (int_vec[rs1].isvector)  { irs1 += 1; }
 375         if (int_vec[rs2].isvector)  { irs2 += 1; }
 376
 377 Note that for simplicity there is quite a lot missing from the above
 378 pseudo-code.
 379
 380 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 381
 382 Adding in support for SUBVL is a matter of adding in an extra inner
 383 for-loop, where register src and dest are still incremented inside the
 384 inner part. Not that the predication is still taken from the VL index.
 385
 386 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 387 indexed by "(i)"
 388
 389     function op_add(rd, rs1, rs2) # add not VADD!
 390       int i, id=0, irs1=0, irs2=0;
 391       predval = get_pred_val(FALSE, rd);
 392       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 393       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 394       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 395       for (i = 0; i < VL; i++)
 396        xSTATE.srcoffs = i # save context
 397        for (s = 0; s < SUBVL; s++)
 398         xSTATE.ssvoffs = s # save context
 399         if (predval & 1<<i) # predication uses intregs
 400            # actual add is here (at last)
 401            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 402            if (!int_vec[rd ].isvector) break;
 403         if (int_vec[rd ].isvector)  { id += 1; }
 404         if (int_vec[rs1].isvector)  { irs1 += 1; }
 405         if (int_vec[rs2].isvector)  { irs2 += 1; }
 406         if (id == VL or irs1 == VL or irs2 == VL) {
 407           # end VL hardware loop
 408           xSTATE.srcoffs = 0; # reset
 409           xSTATE.ssvoffs = 0; # reset
 410           return;
 411         }
 412
 413 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 414 elwidth handling etc. all left out.
 415
 416 ## Instruction Format
 417
 418 It is critical to appreciate that there are
 419 **no operations added to SV, at all**.
 420
 421 Examples are given below where "standard" RV scalar behaviour is augmented.
 422
 423 ## Branch Instructions
 424
 425 Branch operations are augmented slightly to be a little more like FP
 426 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 427 of multiple comparisons into a register (taken indirectly from the predicate
 428 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 429 See ffirst mode in the Predication Table section.
 430
 431 ### Standard Branch <a name="standard_branch"></a>
 432
 433 Branch operations use standard RV opcodes that are reinterpreted to
 434 be "predicate variants" in the instance where either of the two src
 435 registers are marked as vectors (active=1, vector=1).
 436
 437 Note that the predication register to use (if one is enabled) is taken from
 438 the *first* src register, and that this is used, just as with predicated
 439 arithmetic operations, to mask whether the comparison operations take
 440 place or not.  If the second register is also marked as predicated,
 441 that (scalar) predicate register is used as a **destination** to store
 442 the results of all the comparisons.
 443
 444 In instances where no vectorisation is detected on either src registers
 445 the operation is treated as an absolutely standard scalar branch operation.
 446 Where vectorisation is present on either or both src registers, the
 447 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 448 those tests that are predicated out).
 449
 450 Pseudo-code for branch:
 451
 452     s1 = reg_is_vectorised(src1);
 453     s2 = reg_is_vectorised(src2);
 454
 455     if not s1 && not s2
 456         if cmp(rs1, rs2) # scalar compare
 457             goto branch
 458         return
 459
 460     preg = int_pred_reg[rd]
 461     reg = int_regfile
 462
 463     ps = get_pred_val(I/F==INT, rs1);
 464     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 465
 466     if not exists(rd) or zeroing:
 467         result = 0
 468     else
 469         result = preg[rd]
 470
 471     for (int i = 0; i < VL; ++i)
 472       if (zeroing)
 473         if not (ps & (1<<i))
 474            result &= ~(1<<i);
 475       else if (ps & (1<<i))
 476           if (cmp(s1 ? reg[src1+i]:reg[src1],
 477                                s2 ? reg[src2+i]:reg[src2])
 478               result |= 1<<i;
 479           else
 480               result &= ~(1<<i);
 481
 482      if not exists(rd)
 483         if result == ps
 484             goto branch
 485      else
 486         preg[rd] = result # store in destination
 487         if preg[rd] == ps
 488             goto branch
 489
 490 Notes:
 491
 492 * Predicated SIMD comparisons would break src1 and src2 further down
 493   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 494   Reordering") setting Vector-Length times (number of SIMD elements) bits
 495   in Predicate Register rd, as opposed to just Vector-Length bits.
 496 * The execution of "parallelised" instructions **must** be implemented
 497   as "re-entrant" (to use a term from software).  If an exception (trap)
 498   occurs during the middle of a vectorised
 499   Branch (now a SV predicated compare) operation, the partial results
 500   of any comparisons must be written out to the destination
 501   register before the trap is permitted to begin.  If however there
 502   is no predicate, the **entire** set of comparisons must be **restarted**,
 503   with the offset loop indices set back to zero.  This is because
 504   there is no place to store the temporary result during the handling
 505   of traps.
 506
 507 Note also that where normally, predication requires that there must
 508 also be a CSR register entry for the register being used in order
 509 for the **predication** CSR register entry to also be active,
 510 for branches this is **not** the case.  src2 does **not** have
 511 to have its CSR register entry marked as active in order for
 512 predication on src2 to be active.
 513
 514 ### Floating-point Comparisons
 515
 516 There does not exist floating-point branch operations, only compare.
 517 Interestingly no change is needed to the instruction format because
 518 FP Compare already stores a 1 or a zero in its "rd" integer register
 519 target, i.e. it's not actually a Branch at all: it's a compare.
 520
 521 As RV Scalar does not have "FNE", predication inversion must be used.
 522 Also: note that FP Compare may be predicated, using the destination
 523 integer register (rd) to determine the predicate.  FP Compare is **not**
 524 a twin-predication operation, as, again, just as with SV Branches,
 525 there are three registers involved: FP src1, FP src2 and INT rd.
 526
 527 Also: note that ffirst (fail first mode) applies directly to this operation.
 528
 529 ### Compressed Branch Instruction
 530
 531 Compressed Branch instructions are, just like standard Branch instructions,
 532 reinterpreted to be vectorised and predicated based on the source register
 533 (rs1s) CSR entries.  As however there is only the one source register,
 534 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 535 to store the results of the comparisions is taken from CSR predication
 536 table entries for **x0**.
 537
 538 The specific required use of x0 is, with a little thought, quite obvious,
 539 but is counterintuitive.  Clearly it is **not** recommended to redirect
 540 x0 with a CSR register entry, however as a means to opaquely obtain
 541 a predication target it is the only sensible option that does not involve
 542 additional special CSRs (or, worse, additional special opcodes).
 543
 544 Note also that, just as with standard branches, the 2nd source
 545 (in this case x0 rather than src2) does **not** have to have its CSR
 546 register table marked as "active" in order for predication to work.
 547
 548 ## Vectorised Dual-operand instructions
 549
 550 There is a series of 2-operand instructions involving copying (and
 551 sometimes alteration):
 552
 553 * C.MV
 554 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 555 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 556 * LOAD(-FP) and STORE(-FP)
 557
 558 All of these operations follow the same two-operand pattern, so it is
 559 *both* the source *and* destination predication masks that are taken into
 560 account.  This is different from
 561 the three-operand arithmetic instructions, where the predication mask
 562 is taken from the *destination* register, and applied uniformly to the
 563 elements of the source register(s), element-for-element.
 564
 565 The pseudo-code pattern for twin-predicated operations is as
 566 follows:
 567
 568     function op(rd, rs):
 569       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 570       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 571       ps = get_pred_val(FALSE, rs); # predication on src
 572       pd = get_pred_val(FALSE, rd); # ... AND on dest
 573       for (int i = 0, int j = 0; i < VL && j < VL;):
 574         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 575         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 576         xSTATE.srcoffs = i # save context
 577         xSTATE.destoffs = j # save context
 578         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 579         if (int_csr[rs].isvec) i++;
 580         if (int_csr[rd].isvec) j++; else break
 581
 582 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 583 and vector-vector, and predicated variants of all of those.
 584 Zeroing is not presently included (TODO).  As such, when compared
 585 to RVV, the twin-predicated variants of C.MV and FMV cover
 586 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 587 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 588
 589 ### C.MV Instruction <a name="c_mv"></a>
 590
 591 There is no MV instruction in RV however there is a C.MV instruction.
 592 It is used for copying integer-to-integer registers (vectorised FMV
 593 is used for copying floating-point).
 594
 595 If either the source or the destination register are marked as vectors
 596 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 597 move operation.  The actual instruction's format does not change.
 598
 599 There are several different instructions from RVV that are covered by
 600 this one opcode:
 601
 602 [[!table  data="""
 603 src    | dest    | predication   | op             |
 604 scalar | vector  | none          | VSPLAT         |
 605 scalar | vector  | destination   | sparse VSPLAT  |
 606 scalar | vector  | 1-bit dest    | VINSERT        |
 607 vector | scalar  | 1-bit? src    | VEXTRACT       |
 608 vector | vector  | none          | VCOPY          |
 609 vector | vector  | src           | Vector Gather  |
 610 vector | vector  | dest          | Vector Scatter |
 611 vector | vector  | src & dest    | Gather/Scatter |
 612 vector | vector  | src == dest   | sparse VCOPY   |
 613 """]]
 614
 615 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 616 operations with inversion on the src and dest predication for one of the
 617 two C.MV operations.
 618
 619 ### FMV, FNEG and FABS Instructions
 620
 621 These are identical in form to C.MV, except covering floating-point
 622 register copying.  The same double-predication rules also apply.
 623 However when elwidth is not set to default the instruction is implicitly
 624 and automatic converted to a (vectorised) floating-point type conversion
 625 operation of the appropriate size covering the source and destination
 626 register bitwidths.
 627
 628 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 629
 630 ### FVCT Instructions
 631
 632 These are again identical in form to C.MV, except that they cover
 633 floating-point to integer and integer to floating-point.  When element
 634 width in each vector is set to default, the instructions behave exactly
 635 as they are defined for standard RV (scalar) operations, except vectorised
 636 in exactly the same fashion as outlined in C.MV.
 637
 638 However when the source or destination element width is not set to default,
 639 the opcode's explicit element widths are *over-ridden* to new definitions,
 640 and the opcode's element width is taken as indicative of the SIMD width
 641 (if applicable i.e. if packed SIMD is requested) instead.
 642
 643 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 644
 645 In vectorised architectures there are usually at least two different modes
 646 for LOAD/STORE:
 647
 648 * Read (or write for STORE) from sequential locations, where one
 649   register specifies the address, and the one address is incremented
 650   by a fixed amount.  This is usually known as "Unit Stride" mode.
 651 * Read (or write) from multiple indirected addresses, where the
 652   vector elements each specify separate and distinct addresses.
 653
 654 To support these different addressing modes, the CSR Register "isvector"
 655 bit is used.  So, for a LOAD, when the src register is set to
 656 scalar, the LOADs are sequentially incremented by the src register
 657 element width, and when the src register is set to "vector", the
 658 elements are treated as indirection addresses.  Simplified
 659 pseudo-code would look like this:
 660
 661     function op_ld(rd, rs) # LD not VLD!
 662       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 663       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 664       ps = get_pred_val(FALSE, rs); # predication on src
 665       pd = get_pred_val(FALSE, rd); # ... AND on dest
 666       for (int i = 0, int j = 0; i < VL && j < VL;):
 667         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 668         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 669         if (int_csr[rd].isvec)
 670           # indirect mode (multi mode)
 671           srcbase = ireg[rsv+i];
 672         else
 673           # unit stride mode
 674           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 675         ireg[rdv+j] <= mem[srcbase + imm_offs];
 676         if (!int_csr[rs].isvec &&
 677             !int_csr[rd].isvec) break # scalar-scalar LD
 678         if (int_csr[rs].isvec) i++;
 679         if (int_csr[rd].isvec) j++;
 680
 681 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 682
 683 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 684 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 685 It is therefore possible to use predicated C.LWSP to efficiently
 686 pop registers off the stack (by predicating x2 as the source), cherry-picking
 687 which registers to store to (by predicating the destination).  Likewise
 688 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 689
 690 **Note**: it is still possible to redirect x2 to an alternative target
 691 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 692 general-purpose LOAD/STORE operations.
 693
 694 ## Compressed LOAD / STORE Instructions
 695
 696 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 697 where the same rules apply and the same pseudo-code apply as for
 698 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 699 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 700 to "Multi-indirection", respectively.
 701
 702 # Element bitwidth polymorphism <a name="elwidth"></a>
 703
 704 Element bitwidth is best covered as its own special section, as it
 705 is quite involved and applies uniformly across-the-board.  SV restricts
 706 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 707
 708 The effect of setting an element bitwidth is to re-cast each entry
 709 in the register table, and for all memory operations involving
 710 load/stores of certain specific sizes, to a completely different width.
 711 Thus In c-style terms, on an RV64 architecture, effectively each register
 712 now looks like this:
 713
 714     typedef union {
 715         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 716         uint8_t  b[0]; // array of type uint8_t
 717         uint16_t s[0];
 718         uint32_t i[0];
 719         uint64_t l[0];
 720         uint128_t d[0];
 721     } reg_t;
 722
 723     reg_t int_regfile[128];
 724
 725 where when accessing any individual regfile[n].b entry it is permitted
 726 (in c) to arbitrarily over-run the *declared* length of the array (zero),
 727 and thus "overspill" to consecutive register file entries in a fashion
 728 that is completely transparent to a greatly-simplified software / pseudo-code
 729 representation.
 730 It is however critical to note that it is clearly the responsibility of
 731 the implementor to ensure that, towards the end of the register file,
 732 an exception is thrown if attempts to access beyond the "real" register
 733 bytes is ever attempted.
 734
 735 The pseudo-code is as follows, to demonstrate how the sign-extending
 736 and width-extending works:
 737
 738     typedef union {
 739         uint8_t  b;
 740         uint16_t s;
 741         uint32_t i;
 742         uint64_t l;
 743     } el_reg_t;
 744
 745     bw(elwidth):
 746         if elwidth == 0:
 747             return xlen
 748         if elwidth == 1:
 749             return xlen / 2
 750         if elwidth == 2:
 751             return xlen * 2
 752         // elwidth == 3:
 753         return 8
 754
 755     get_max_elwidth(rs1, rs2):
 756         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 757                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 758
 759     get_polymorphed_reg(reg, bitwidth, offset):
 760         el_reg_t res;
 761         res.l = 0; // TODO: going to need sign-extending / zero-extending
 762         if bitwidth == 8:
 763             reg.b = int_regfile[reg].b[offset]
 764         elif bitwidth == 16:
 765             reg.s = int_regfile[reg].s[offset]
 766         elif bitwidth == 32:
 767             reg.i = int_regfile[reg].i[offset]
 768         elif bitwidth == 64:
 769             reg.l = int_regfile[reg].l[offset]
 770         return res
 771
 772     set_polymorphed_reg(reg, bitwidth, offset, val):
 773         if (!int_csr[reg].isvec):
 774             # sign/zero-extend depending on opcode requirements, from
 775             # the reg's bitwidth out to the full bitwidth of the regfile
 776             val = sign_or_zero_extend(val, bitwidth, xlen)
 777             int_regfile[reg].l[0] = val
 778         elif bitwidth == 8:
 779             int_regfile[reg].b[offset] = val
 780         elif bitwidth == 16:
 781             int_regfile[reg].s[offset] = val
 782         elif bitwidth == 32:
 783             int_regfile[reg].i[offset] = val
 784         elif bitwidth == 64:
 785             int_regfile[reg].l[offset] = val
 786
 787       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 788       destwid = int_csr[rs1].elwidth         # destination element width
 789       for (i = 0; i < VL; i++)
 790         if (predval & 1<<i) # predication uses intregs
 791            // TODO, calculate if over-run occurs, for each elwidth
 792            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 793            // TODO, sign/zero-extend src1 and src2 as operation requires
 794            if (op_requires_sign_extend_src1)
 795               src1 = sign_extend(src1, maxsrcwid)
 796            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 797            result = src1 + src2 # actual add here
 798            // TODO, sign/zero-extend result, as operation requires
 799            if (op_requires_sign_extend_dest)
 800               result = sign_extend(result, maxsrcwid)
 801            set_polymorphed_reg(rd, destwid, ird, result)
 802            if (!int_vec[rd].isvector) break
 803         if (int_vec[rd ].isvector)  { id += 1; }
 804         if (int_vec[rs1].isvector)  { irs1 += 1; }
 805         if (int_vec[rs2].isvector)  { irs2 += 1; }
 806
 807 ## Polymorphic floating-point operation exceptions and error-handling
 808
 809 For floating-point operations, conversion takes place without
 810 raising any kind of exception.  Exactly as specified in the standard
 811 RV specification, NAN (or appropriate) is stored if the result
 812 is beyond the range of the destination, and, again, exactly as
 813 with the standard RV specification just as with scalar
 814 operations, the floating-point flag is raised (FCSR).  And, again, just as
 815 with scalar operations, it is software's responsibility to check this flag.
 816 Given that the FCSR flags are "accrued", the fact that multiple element
 817 operations could have occurred is not a problem.
 818
 819 Note that it is perfectly legitimate for floating-point bitwidths of
 820 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 821 principles, no actual standard yet exists.  Implementors wishing to
 822 provide hardware-level 8-bit support rather than throw a trap to emulate
 823 in software should contact the author of this specification before
 824 proceeding.
 825
 826 ## Polymorphic shift operators
 827
 828 A special note is needed for changing the element width of left and right
 829 shift operators, particularly right-shift.
 830
 831 For SV, where each operand's element bitwidth may be over-ridden, the
 832 rule about determining the operation's bitwidth *still applies*, being
 833 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 834 **also applies to the truncation of RS2**.  In other words, *after*
 835 determining the maximum bitwidth, RS2's range must **also be truncated**
 836 to ensure a correct answer.  Example:
 837
 838 * RS1 is over-ridden to a 16-bit width
 839 * RS2 is over-ridden to an 8-bit width
 840 * RD is over-ridden to a 64-bit width
 841 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 842 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 843
 844 Pseudocode (in spike) for this example would therefore be:
 845
 846     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 847
 848 ## Polymorphic MULH/MULHU/MULHSU
 849
 850 MULH is designed to take the top half MSBs of a multiply that
 851 does not fit within the range of the source operands, such that
 852 smaller width operations may produce a full double-width multiply
 853 in two cycles.  The issue is: SV allows the source operands to
 854 have variable bitwidth.
 855
 856 Here again special attention has to be paid to the rules regarding
 857 bitwidth, which, again, are that the operation is performed at
 858 the maximum bitwidth of the **source** registers.  Therefore:
 859
 860 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 861   be shifted down by 8 bits
 862 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 863   be shifted down by 16 bits (top 8 bits being zero)
 864 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 865   be shifted down by 16 bits
 866 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 867   be shifted down by 32 bits
 868 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 869   be shifted down by 32 bits
 870
 871 So again, just as with shift-left and shift-right, the result
 872 is shifted down by the maximum of the two source register bitwidths.
 873 And, exactly again, truncation or sign-extension is performed on the
 874 result.  If sign-extension is to be carried out, it is performed
 875 from the same maximum of the two source register bitwidths out
 876 to the result element's bitwidth.
 877
 878 If truncation occurs, i.e. the top MSBs of the result are lost,
 879 this is "Officially Not Our Problem", i.e. it is assumed that the
 880 programmer actually desires the result to be truncated.  i.e. if the
 881 programmer wanted all of the bits, they would have set the destination
 882 elwidth to accommodate them.
 883
 884 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 885
 886 Polymorphic element widths in vectorised form means that the data
 887 being loaded (or stored) across multiple registers needs to be treated
 888 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 889 the source register's element width is **independent** from the destination's.
 890
 891 This makes for a slightly more complex algorithm when using indirection
 892 on the "addressed" register (source for LOAD and destination for STORE),
 893 particularly given that the LOAD/STORE instruction provides important
 894 information about the width of the data to be reinterpreted.
 895
 896 As LOAD/STORE may be twin-predicated, it is important to note that
 897 the rules on twin predication still apply.  Where in previous
 898 pseudo-code (elwidth=default for both source and target) it was
 899 the *registers* that the predication was applied to, it is now the
 900 **elements** that the predication is applied to.
 901
 902 The full pseudocode for all LD operations may be written out
 903 as follows:
 904
 905     function LBU(rd, rs):
 906         load_elwidthed(rd, rs, 8, true)
 907     function LB(rd, rs):
 908         load_elwidthed(rd, rs, 8, false)
 909     function LH(rd, rs):
 910         load_elwidthed(rd, rs, 16, false)
 911     ...
 912     ...
 913     function LQ(rd, rs):
 914         load_elwidthed(rd, rs, 128, false)
 915
 916     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 917     function load_memory(rs, imm, i, opwidth):
 918         elwidth = int_csr[rs].elwidth
 919         bitwidth = bw(elwidth);
 920         elsperblock = min(1, opwidth / bitwidth)
 921         srcbase = ireg[rs+i/(elsperblock)];
 922         offs = i % elsperblock;
 923         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 924
 925     function load_elwidthed(rd, rs, opwidth, unsigned):
 926       destwid = int_csr[rd].elwidth # destination element width
 927       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 928       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 929       ps = get_pred_val(FALSE, rs); # predication on src
 930       pd = get_pred_val(FALSE, rd); # ... AND on dest
 931       for (int i = 0, int j = 0; i < VL && j < VL;):
 932         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 933         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 934         val = load_memory(rs, imm, i, opwidth)
 935         if unsigned:
 936             val = zero_extend(val, min(opwidth, bitwidth))
 937         else:
 938             val = sign_extend(val, min(opwidth, bitwidth))
 939         set_polymorphed_reg(rd, bitwidth, j, val)
 940         if (int_csr[rs].isvec) i++;
 941         if (int_csr[rd].isvec) j++; else break;
 942
 943 # Predication Element Zeroing
 944
 945 The introduction of zeroing on traditional vector predication is usually
 946 intended as an optimisation for lane-based microarchitectures with register
 947 renaming to be able to save power by avoiding a register read on elements
 948 that are passed through en-masse through the ALU.  Simpler microarchitectures
 949 do not have this issue: they simply do not pass the element through to
 950 the ALU at all, and therefore do not store it back in the destination.
 951 More complex non-lane-based micro-architectures can, when zeroing is
 952 not set, use the predication bits to simply avoid sending element-based
 953 operations to the ALUs, entirely: thus, over the long term, potentially
 954 keeping all ALUs 100% occupied even when elements are predicated out.
 955
 956 SimpleV's design principle is not based on or influenced by
 957 microarchitectural design factors: it is a hardware-level API.
 958 Therefore, looking purely at whether zeroing is *useful* or not,
 959 (whether less instructions are needed for certain scenarios),
 960 given that a case can be made for zeroing *and* non-zeroing, the
 961 decision was taken to add support for both.
 962
 963 ## Single-predication (based on destination register)
 964
 965 Zeroing on predication for arithmetic operations is taken from
 966 the destination register's predicate.  i.e. the predication *and*
 967 zeroing settings to be applied to the whole operation come from the
 968 CSR Predication table entry for the destination register.
 969 Thus when zeroing is set on predication of a destination element,
 970 if the predication bit is clear, then the destination element is *set*
 971 to zero (twin-predication is slightly different, and will be covered
 972 next).
 973
 974 Thus the pseudo-code loop for a predicated arithmetic operation
 975 is modified to as follows:
 976
 977       for (i = 0; i < VL; i++)
 978         if not zeroing: # an optimisation
 979            while (!(predval & 1<<i) && i < VL)
 980              if (int_vec[rd ].isvector)  { id += 1; }
 981              if (int_vec[rs1].isvector)  { irs1 += 1; }
 982              if (int_vec[rs2].isvector)  { irs2 += 1; }
 983            if i == VL:
 984              return
 985         if (predval & 1<<i)
 986            src1 = ....
 987            src2 = ...
 988            else:
 989                result = src1 + src2 # actual add (or other op) here
 990            set_polymorphed_reg(rd, destwid, ird, result)
 991            if int_vec[rd].ffirst and result == 0:
 992               VL = i # result was zero, end loop early, return VL
 993               return
 994            if (!int_vec[rd].isvector) return
 995         else if zeroing:
 996            result = 0
 997            set_polymorphed_reg(rd, destwid, ird, result)
 998         if (int_vec[rd ].isvector)  { id += 1; }
 999         else if (predval & 1<<i) return
1000         if (int_vec[rs1].isvector)  { irs1 += 1; }
1001         if (int_vec[rs2].isvector)  { irs2 += 1; }
1002         if (rd == VL or rs1 == VL or rs2 == VL): return
1003
1004 The optimisation to skip elements entirely is only possible for certain
1005 micro-architectures when zeroing is not set.  However for lane-based
1006 micro-architectures this optimisation may not be practical, as it
1007 implies that elements end up in different "lanes".  Under these
1008 circumstances it is perfectly fine to simply have the lanes
1009 "inactive" for predicated elements, even though it results in
1010 less than 100% ALU utilisation.
1011
1012 ## Twin-predication (based on source and destination register)
1013
1014 Twin-predication is not that much different, except that that
1015 the source is independently zero-predicated from the destination.
1016 This means that the source may be zero-predicated *or* the
1017 destination zero-predicated *or both*, or neither.
1018
1019 When with twin-predication, zeroing is set on the source and not
1020 the destination, if a predicate bit is set it indicates that a zero
1021 data element is passed through the operation (the exception being:
1022 if the source data element is to be treated as an address - a LOAD -
1023 then the data returned *from* the LOAD is zero, rather than looking up an
1024 *address* of zero.
1025
1026 When zeroing is set on the destination and not the source, then just
1027 as with single-predicated operations, a zero is stored into the destination
1028 element (or target memory address for a STORE).
1029
1030 Zeroing on both source and destination effectively result in a bitwise
1031 NOR operation of the source and destination predicate: the result is that
1032 where either source predicate OR destination predicate is set to 0,
1033 a zero element will ultimately end up in the destination register.
1034
1035 However: this may not necessarily be the case for all operations;
1036 implementors, particularly of custom instructions, clearly need to
1037 think through the implications in each and every case.
1038
1039 Here is pseudo-code for a twin zero-predicated operation:
1040
1041     function op_mv(rd, rs) # MV not VMV!
1042       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1043       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1044       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1045       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1046       for (int i = 0, int j = 0; i < VL && j < VL):
1047         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1048         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1049         if ((pd & 1<<j))
1050             if ((pd & 1<<j))
1051                 sourcedata = ireg[rs+i];
1052             else
1053                 sourcedata = 0
1054             ireg[rd+j] <= sourcedata
1055         else if (zerodst)
1056             ireg[rd+j] <= 0
1057         if (int_csr[rs].isvec)
1058             i++;
1059         if (int_csr[rd].isvec)
1060             j++;
1061         else
1062             if ((pd & 1<<j))
1063                 break;
1064
1065 Note that in the instance where the destination is a scalar, the hardware
1066 loop is ended the moment a value *or a zero* is placed into the destination
1067 register/element.  Also note that, for clarity, variable element widths
1068 have been left out of the above.
1069
1070 # Exceptions
1071
1072 TODO: expand.
1073
1074 # Hints
1075
1076 With Simple-V being capable of issuing *parallel* instructions where
1077 rd=x0, the space for possible HINTs is expanded considerably.  VL
1078 could be used to indicate different hints.  In addition, if predication
1079 is set, the predication register itself could hypothetically be passed
1080 in as a *parameter* to the HINT operation.
1081
1082 No specific hints are yet defined in Simple-V
1083
1084 # Vector Block Format <a name="vliw-format"></a>
1085
1086 See ancillary resource: [[vblock_format]]
1087
1088 # Subsets of RV functionality
1089
1090 It is permitted to only implement SVprefix and not the VBLOCK instruction
1091 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1092 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1093 traps may emulate the format.
1094
1095 It is permitted in SVprefix to either not implement VL or not implement
1096 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1097 *MUST* raise illegal instruction on implementations that do not support
1098 VL or SUBVL.
1099
1100 It is permitted to limit the size of either (or both) the register files
1101 down to the original size of the standard RV architecture.  However, below
1102 the mandatory limits set in the RV standard will result in non-compliance
1103 with the SV Specification.
1104