simple_v_extension/abridged_spec.mdwn

   1 # Simple-V (Parallelism Extension Proposal) Specification (Abridged)
   2
   3 * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
   4 * Status: DRAFTv0.6
   5 * Last edited: 25 jun 2019
   6
   7 [[!toc ]]
   8
   9 # Introduction
  10
  11 Simple-V is a uniform parallelism API for RISC-V hardware that allows
  12 the Program Counter to enter "sub-contexts" in which, ultimately, standard
  13 RISC-V scalar opcodes are executed.
  14
  15 The sub-context execution is "nested" in "re-entrant" form, in the
  16 following order:
  17
  18 * Main standard RISC-V Program Counter (PC)
  19 * VBLOCK sub-execution context (PCVBLK increments whilst PC is paused).
  20 * VL element loops (STATE srcoffs and destoffs increment, PC and PCVBLK pause).
  21   Predication bits may be individually applied per element.
  22 * SUBVL element loops (STATE svsrcoffs/svdestoffs increment, VL pauses).
  23   Individual predicate bits from VL loops apply to the *group* of SUBVL
  24   elements.
  25
  26 An ancillary "SVPrefix" Format (P48/P64) [[sv_prefix_proposal]] may
  27 run its own VL/SUBVL "loops" and specifies its own Register and Predication
  28 format on the 32-bit RV scalar opcode embedded within it.
  29
  30 The [[vblock_format]] specifies how VBLOCK sub-execution contexts
  31 operate.
  32
  33 SV is never actually switched "off".  VL or SUBVL may be equal to 1, and
  34 Register or Predicate over-ride tables may be empty: under such circumstances
  35 the behaviour becomes effectively identical to standard RV execution, however
  36 SV is never truly actually "off".
  37
  38 Note: **there are *no* new opcodes**. The scheme works *entirely*
  39 on hidden context that augments *scalar* RISC-V instructions.  Thus it
  40 may cover existing, future and custom scalar extensions, turning all
  41 existing, all future and all custom scalar operations parallel, without
  42 requiring any special opcodes to do so.
  43
  44 # CSRs <a name="csrs"></a>
  45
  46 There are five CSRs, available in any privilege level:
  47
  48 * MVL (the Maximum Vector Length)
  49 * VL (which has different characteristics from standard CSRs)
  50 * SUBVL (effectively a kind of SIMD)
  51 * STATE (containing copies of MVL, VL and SUBVL as well as context information)
  52 * PCVBLK (the current operation being executed within a VBLOCK Group)
  53
  54 For Privilege Levels (trap handling) there are the following CSRs,
  55 where x may be u, m, s or h for User, Machine, Supervisor or Hypervisor
  56 Modes respectively:
  57
  58 * (x)ePCVBLK (a copy of the sub-execution Program Counter, that is relative
  59   to the start of the current VBLOCK Group, set on a trap).
  60 * (x) eSTATE (useful for saving and restoring during context switch,
  61   and for providing fast transitions)
  62
  63 The u/m/s CSRs are treated and handled exactly like their (x)epc
  64 equivalents.  On entry to or exit from a privilege level, the contents
  65 of its (x)eSTATE are swapped with STATE.
  66
  67 (x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
  68 equivalents. See VBLOCK section for details.
  69
  70 ## MAXVECTORLENGTH (MVL) <a name="mvl" />
  71
  72 MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
  73 is variable length and may be dynamically set.  MVL is
  74 however limited to the regfile bitwidth XLEN (1-32 for RV32,
  75 1-64 for RV64 and so on).
  76
  77 ## Vector Length (VL) <a name="vl" />
  78
  79 VSETVL is slightly different from RVV.  Similar to RVV, VL is set to be within
  80 the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
  81
  82     VL = rd = MIN(vlen, MVL)
  83
  84 where 1 <= MVL <= XLEN
  85
  86 ## SUBVL - Sub Vector Length
  87
  88 This is a "group by quantity" that effectivrly asks each iteration
  89 of the hardware loop to load SUBVL elements of width elwidth at a
  90 time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
  91 operation issued, SUBVL operations are issued.
  92
  93 The main effect of SUBVL is that predication bits are applied per
  94 **group**, rather than by individual element.
  95
  96 ## STATE
  97
  98 This is a standard CSR that contains sufficient information for a
  99 full context save/restore.  It contains (and permits setting of):
 100
 101 * MVL
 102 * VL
 103 * destoffs - the destination element offset of the current parallel
 104   instruction being executed
 105 * srcoffs - for twin-predication, the source element offset as well.
 106 * SUBVL
 107 * svdestoffs - the subvector destination element offset of the current
 108   parallel instruction being executed
 109 * svsrcoffs - for twin-predication, the subvector source element offset
 110   as well.
 111
 112 The format of the STATE CSR is as follows:
 113
 114 | (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
 115 | ------- | -------- | -------- | -------- | -------- | ------- | ------- |
 116 | dsvoffs | ssvoffs  | subvl    | destoffs | srcoffs  | vl      | maxvl   |
 117
 118 Notes:
 119
 120 * The entries are truncated to be within range.  Attempts to set VL to
 121   greater than MAXVL will truncate VL.
 122 * Both VL and MAXVL are stored offset by one.  0b000000 represents VL=1,
 123   0b000001 represents VL=2.  This allows the full range 1 to XLEN instead
 124   of 0 to only 63.
 125
 126 ## VL, MVL and SUBVL instruction aliases
 127
 128 This table contains pseudo-assembly instruction aliases. Note the
 129 subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
 130 reduced range of the 5 bit immediate.
 131
 132 | alias           | CSR                  |
 133 | -               | -                    |
 134 | SETVL rd, rs    | CSRRW  VL, rd, rs    |
 135 | SETVLi rd, #n   | CSRRWI VL, rd, #n-1  |
 136 | GETVL rd        | CSRRW  VL, rd, x0    |
 137 | SETMVL rd, rs   | CSRRW  MVL, rd, rs   |
 138 | SETMVLi rd, #n  | CSRRWI MVL,rd, #n-1  |
 139 | GETMVL rd       | CSRRW  MVL, rd, x0   |
 140
 141 Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
 142
 143 ## Register key-value (CAM) table <a name="regcsrtable" />
 144
 145 The purpose of the Register table is to mark which registers change behaviour
 146 if used in a "Standard" (normally scalar) opcode.
 147
 148 16 bit format:
 149
 150 | RegCAM | 15       | (14..8)  | 7   | (6..5) | (4..0)  |
 151 | ------ | -        | -        | -   | ------ | ------- |
 152 | 0      | isvec0   | regidx0  | i/f | vew0   | regkey  |
 153 | 1      | isvec1   | regidx1  | i/f | vew1   | regkey  |
 154 | 2      | isvec2   | regidx2  | i/f | vew2   | regkey  |
 155 | 3      | isvec3   | regidx3  | i/f | vew3   | regkey  |
 156
 157 8 bit format:
 158
 159 | RegCAM | | 7   | (6..5) | (4..0)  |
 160 | ------ | | -   | ------ | ------- |
 161 | 0      | | i/f | vew0   | regnum  |
 162
 163 Mapping the 8-bit to 16-bit format:
 164
 165 | RegCAM | 15      | (14..8)    | 7   | (6..5) | (4..0)  |
 166 | ------ | -       | -          | -   | ------ | ------- |
 167 | 0      | isvec=1 | regnum0<<2 | i/f | vew0   | regnum0 |
 168 | 1      | isvec=1 | regnum1<<2 | i/f | vew1   | regnum1 |
 169 | 2      | isvec=1 | regnum2<<2 | i/f | vew2   | regnum2 |
 170 | 3      | isvec=1 | regnum2<<2 | i/f | vew3   | regnum3 |
 171
 172 Fields:
 173
 174 * i/f is set to "1" to indicate that the redirection/tag entry is to
 175   be applied to integer registers; 0 indicates that it is relevant to
 176   floating-point registers.
 177 * isvec indicates that the register (whether a src or dest) is to progress
 178   incrementally forward on each loop iteration.  this gives the "effect"
 179   of vectorisation.  isvec is zero indicates "do not progress", giving
 180   the "effect" of that register being scalar.
 181 * vew overrides the operation's default width.  See table below
 182 * regkey is the register which, if encountered in an op (as src or dest)
 183   is to be "redirected"
 184 * in the 16-bit format, regidx is the *actual* register to be used
 185   for the operation (note that it is 7 bits wide)
 186
 187 | vew | bitwidth            |
 188 | --- | ------------------- |
 189 | 00  | default (XLEN/FLEN) |
 190 | 01  | 8 bit               |
 191 | 10  | 16 bit              |
 192 | 11  | 32 bit              |
 193
 194 As the above table is a CAM (key-value store) it may be appropriate
 195 (faster, less gates, implementation-wise) to expand it as follows:
 196
 197     struct vectorised {
 198         bool isvector:1;
 199         int  vew:2;
 200         bool enabled:1;
 201         int  predidx:7;
 202     }
 203
 204     struct vectorised fp_vec[32], int_vec[32];
 205
 206     for (i = 0; i < len; i++) // from VBLOCK Format
 207        tb = int_vec if CSRvec[i].type == 0 else fp_vec
 208        idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
 209        tb[idx].elwidth  = CSRvec[i].elwidth
 210        tb[idx].regidx   = CSRvec[i].regidx  // indirection
 211        tb[idx].isvector = CSRvec[i].isvector // 0=scalar
 212        tb[idx].enabled  = true;
 213
 214 ## Predication Table <a name="predication_csr_table"></a>
 215
 216 The Predication Table is a key-value store indicating whether, if a
 217 given destination register (integer or floating-point) is referred to
 218 in an instruction, it is to be predicated. Like the Register table, it
 219 is an indirect lookup that allows the RV opcodes to not need modification.
 220
 221 * regidx is the register that in combination with the
 222   i/f flag, if that integer or floating-point register is referred to in a
 223   (standard RV) instruction results in the lookup table being referenced
 224   to find the predication mask to use for this operation.
 225 * predidx is the *actual* (full, 7 bit) register to be used for the
 226   predication mask.
 227 * inv indicates that the predication mask bits are to be inverted
 228   prior to use *without* actually modifying the contents of the
 229   register from which those bits originated.
 230 * zeroing is either 1 or 0, and if set to 1, the operation must
 231   place zeros in any element position where the predication mask is
 232   set to zero.  If zeroing is set to 0, unpredicated elements *must*
 233   be left alone (unaltered), even when elwidth != default.
 234 * ffirst is a special mode that stops sequential element processing when
 235   a data-dependent condition occurs, whether a trap or a conditional test.
 236   The handling of each (trap or conditional test) is slightly different:
 237   see Instruction sections for further details
 238
 239 16 bit format:
 240
 241 | PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
 242 | ----- | -        | -      | -     | -   | ------- | ------- |
 243 | 0     | predidx  | zero0  | inv0  | i/f | regidx  | ffirst0 |
 244 | 1     | predidx  | zero1  | inv1  | i/f | regidx  | ffirst1 |
 245 | 2     | predidx  | zero2  | inv2  | i/f | regidx  | ffirst2 |
 246 | 3     | predidx  | zero3  | inv3  | i/f | regidx  | ffirst3 |
 247
 248 Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding.  Its use must
 249 generate an illegal instruction trap.
 250
 251 8 bit format:
 252
 253 | PrCSR | 7     | 6     | 5   | (4..0)  |
 254 | ----- | -     | -     | -   | ------- |
 255 | 0     | zero0 | inv0  | i/f | regnum  |
 256
 257 Mapping from 8 to 16 bit format, the table becomes:
 258
 259 | PrCSR | (15..11) | 10     | 9     | 8   | (7..1)  | 0       |
 260 | ----- | -        | -      | -     | -   | ------- | ------- |
 261 | 0     | x9       | zero0  | inv0  | i/f | regnum  | ff=0    |
 262 | 1     | x10      | zero1  | inv1  | i/f | regnum  | ff=0    |
 263 | 2     | x11      | zero2  | inv2  | i/f | regnum  | ff=0    |
 264 | 3     | x12      | zero3  | inv3  | i/f | regnum  | ff=0    |
 265
 266 Pseudocode for predication:
 267
 268     struct pred {
 269         bool zero;    // zeroing
 270         bool inv;     // register at predidx is inverted
 271         bool ffirst;  // fail-on-first
 272         bool enabled; // use this to tell if the table-entry is active
 273         int predidx;  // redirection: actual int register to use
 274     }
 275
 276     struct pred fp_pred_reg[32];
 277     struct pred int_pred_reg[32];
 278
 279     for (i = 0; i < len; i++) // number of Predication entries in VBLOCK
 280       tb = int_pred_reg if PredicateTable[i].type == 0 else fp_pred_reg;
 281       idx = VBLOCKPredicateTable[i].regidx
 282       tb[idx].zero     = CSRpred[i].zero
 283       tb[idx].inv      = CSRpred[i].inv
 284       tb[idx].ffirst   = CSRpred[i].ffirst
 285       tb[idx].predidx  = CSRpred[i].predidx
 286       tb[idx].enabled  = true
 287
 288     def get_pred_val(bool is_fp_op, int reg):
 289        tb = int_reg if is_fp_op else fp_reg
 290        if (!tb[reg].enabled):
 291           return ~0x0, False       // all enabled; no zeroing
 292        tb = int_pred if is_fp_op else fp_pred
 293        if (!tb[reg].enabled):
 294           return ~0x0, False       // all enabled; no zeroing
 295        predidx = tb[reg].predidx   // redirection occurs HERE
 296        predicate = intreg[predidx] // actual predicate HERE
 297        if (tb[reg].inv):
 298           predicate = ~predicate   // invert ALL bits
 299        return predicate, tb[reg].zero
 300
 301 ## Fail-on-First Mode <a name="ffirst-mode"></a>
 302
 303 ffirst is a special data-dependent predicate mode.  There are two
 304 variants: one is for faults: typically for LOAD/STORE operations,
 305 which may encounter end of page faults during a series of operations.
 306 The other variant is comparisons such as FEQ (or the augmented behaviour
 307 of Branch), and any operation that returns a result of zero (whether
 308 integer or floating-point).  In the FP case, this includes negative-zero.
 309
 310 Note that the execution order must "appear" to be sequential for ffirst
 311 mode to work correctly.  An in-order architecture must execute the element
 312 operations in sequence, whilst an out-of-order architecture must *commit*
 313 the element operations in sequence (giving the appearance of in-order
 314 execution).
 315
 316 Note also, that if ffirst mode is needed without predication, a special
 317 "always-on" Predicate Table Entry may be constructed by setting
 318 inverse-on and using x0 as the predicate register.  This
 319 will have the effect of creating a mask of all ones, allowing ffirst
 320 to be set.
 321
 322 ### Fail-on-first traps
 323
 324 Except for the first element, ffault stops sequential element processing
 325 when a trap occurs.  The first element is treated normally (as if ffirst
 326 is clear).  Should any subsequent element instruction require a trap,
 327 instead it and subsequent indexed elements are ignored (or cancelled in
 328 out-of-order designs), and VL is set to the *last* instruction that did
 329 not take the trap.
 330
 331 Note that predicated-out elements (where the predicate mask bit is zero)
 332 are clearly excluded (i.e. the trap will not occur).  However, note that
 333 the loop still had to test the predicate bit: thus on return,
 334 VL is set to include elements that did not take the trap *and* includes
 335 the elements that were predicated (masked) out (not tested up to the
 336 point where the trap occurred).
 337
 338 If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
 339 will cause a trap as normal (as if ffirst is not set); subsequently,
 340 the trap must not occur in the *sub-group* of elements.  SUBVL will **NOT**
 341 be modified.
 342
 343 Given that predication bits apply to SUBVL groups, the same rules apply
 344 to predicated-out (masked-out) sub-groups in calculating the value that VL
 345 is set to.
 346
 347 ### Fail-on-first conditional tests
 348
 349 ffault stops sequential element conditional testing on the first element result
 350 being zero.  VL is set to the number of elements that were processed before
 351 the fail-condition was encountered.
 352
 353 Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
 354 will cause the processing to end, and, even if there were elements within
 355 the *sub-group* that passed the test, that sub-group is still (entirely)
 356 excluded from the count (from setting VL).  i.e. VL is set to the total
 357 number of *sub-groups* that had no fail-condition up until execution was
 358 stopped.
 359
 360 Note again that, just as with traps, predicated-out (masked-out) elements
 361 are included in the count leading up to the fail-condition, even though they
 362 were not tested.
 363
 364 The pseudo-code for Predication makes this clearer and simpler than it is
 365 in words (the loop ends, VL is set to the current element index, "i").
 366
 367 # Instructions <a name="instructions" />
 368
 369 To illustrate how Scalar operations are turned "vector" and "predicated",
 370 simplified example pseudo-code for an integer ADD operation is shown below.
 371 Floating-point would use the FP Register Table.
 372
 373     function op_add(rd, rs1, rs2) # add not VADD!
 374       int i, id=0, irs1=0, irs2=0;
 375       predval = get_pred_val(FALSE, rd);
 376       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 377       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 378       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 379       for (i = 0; i < VL; i++)
 380         xSTATE.srcoffs = i # save context
 381         if (predval & 1<<i) # predication uses intregs
 382            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 383            if (!int_vec[rd ].isvector) break;
 384         if (int_vec[rd ].isvector)  { id += 1; }
 385         if (int_vec[rs1].isvector)  { irs1 += 1; }
 386         if (int_vec[rs2].isvector)  { irs2 += 1; }
 387
 388 Note that for simplicity there is quite a lot missing from the above
 389 pseudo-code.
 390
 391 ## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
 392
 393 Adding in support for SUBVL is a matter of adding in an extra inner
 394 for-loop, where register src and dest are still incremented inside the
 395 inner part. Not that the predication is still taken from the VL index.
 396
 397 So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
 398 indexed by "(i)"
 399
 400     function op_add(rd, rs1, rs2) # add not VADD!
 401       int i, id=0, irs1=0, irs2=0;
 402       predval = get_pred_val(FALSE, rd);
 403       rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
 404       rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
 405       rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
 406       for (i = 0; i < VL; i++)
 407        xSTATE.srcoffs = i # save context
 408        for (s = 0; s < SUBVL; s++)
 409         xSTATE.ssvoffs = s # save context
 410         if (predval & 1<<i) # predication uses intregs
 411            # actual add is here (at last)
 412            ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
 413            if (!int_vec[rd ].isvector) break;
 414         if (int_vec[rd ].isvector)  { id += 1; }
 415         if (int_vec[rs1].isvector)  { irs1 += 1; }
 416         if (int_vec[rs2].isvector)  { irs2 += 1; }
 417         if (id == VL or irs1 == VL or irs2 == VL) {
 418           # end VL hardware loop
 419           xSTATE.srcoffs = 0; # reset
 420           xSTATE.ssvoffs = 0; # reset
 421           return;
 422         }
 423
 424 NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
 425 elwidth handling etc. all left out.
 426
 427 ## Instruction Format
 428
 429 It is critical to appreciate that there are
 430 **no operations added to SV, at all**.
 431
 432 Examples are given below where "standard" RV scalar behaviour is augmented.
 433
 434 ## Branch Instructions
 435
 436 Branch operations are augmented slightly to be a little more like FP
 437 Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
 438 of multiple comparisons into a register (taken indirectly from the predicate
 439 table).  As such, "ffirst" - fail-on-first - condition mode can be enabled.
 440 See ffirst mode in the Predication Table section.
 441
 442 ### Standard Branch <a name="standard_branch"></a>
 443
 444 Branch operations use standard RV opcodes that are reinterpreted to
 445 be "predicate variants" in the instance where either of the two src
 446 registers are marked as vectors (active=1, vector=1).
 447
 448 Note that the predication register to use (if one is enabled) is taken from
 449 the *first* src register, and that this is used, just as with predicated
 450 arithmetic operations, to mask whether the comparison operations take
 451 place or not.  If the second register is also marked as predicated,
 452 that (scalar) predicate register is used as a **destination** to store
 453 the results of all the comparisons.
 454
 455 In instances where no vectorisation is detected on either src registers
 456 the operation is treated as an absolutely standard scalar branch operation.
 457 Where vectorisation is present on either or both src registers, the
 458 branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
 459 those tests that are predicated out).
 460
 461 Pseudo-code for branch:
 462
 463     s1 = reg_is_vectorised(src1);
 464     s2 = reg_is_vectorised(src2);
 465
 466     if not s1 && not s2
 467         if cmp(rs1, rs2) # scalar compare
 468             goto branch
 469         return
 470
 471     preg = int_pred_reg[rd]
 472     reg = int_regfile
 473
 474     ps = get_pred_val(I/F==INT, rs1);
 475     rd = get_pred_val(I/F==INT, rs2); # this may not exist
 476
 477     if not exists(rd) or zeroing:
 478         result = 0
 479     else
 480         result = preg[rd]
 481
 482     for (int i = 0; i < VL; ++i)
 483       if (zeroing)
 484         if not (ps & (1<<i))
 485            result &= ~(1<<i);
 486       else if (ps & (1<<i))
 487           if (cmp(s1 ? reg[src1+i]:reg[src1],
 488                                s2 ? reg[src2+i]:reg[src2])
 489               result |= 1<<i;
 490           else
 491               result &= ~(1<<i);
 492
 493      if not exists(rd)
 494         if result == ps
 495             goto branch
 496      else
 497         preg[rd] = result # store in destination
 498         if preg[rd] == ps
 499             goto branch
 500
 501 Notes:
 502
 503 * Predicated SIMD comparisons would break src1 and src2 further down
 504   into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
 505   Reordering") setting Vector-Length times (number of SIMD elements) bits
 506   in Predicate Register rd, as opposed to just Vector-Length bits.
 507 * The execution of "parallelised" instructions **must** be implemented
 508   as "re-entrant" (to use a term from software).  If an exception (trap)
 509   occurs during the middle of a vectorised
 510   Branch (now a SV predicated compare) operation, the partial results
 511   of any comparisons must be written out to the destination
 512   register before the trap is permitted to begin.  If however there
 513   is no predicate, the **entire** set of comparisons must be **restarted**,
 514   with the offset loop indices set back to zero.  This is because
 515   there is no place to store the temporary result during the handling
 516   of traps.
 517
 518 Note also that where normally, predication requires that there must
 519 also be a CSR register entry for the register being used in order
 520 for the **predication** CSR register entry to also be active,
 521 for branches this is **not** the case.  src2 does **not** have
 522 to have its CSR register entry marked as active in order for
 523 predication on src2 to be active.
 524
 525 ### Floating-point Comparisons
 526
 527 There does not exist floating-point branch operations, only compare.
 528 Interestingly no change is needed to the instruction format because
 529 FP Compare already stores a 1 or a zero in its "rd" integer register
 530 target, i.e. it's not actually a Branch at all: it's a compare.
 531
 532 As RV Scalar does not have "FNE", predication inversion must be used.
 533 Also: note that FP Compare may be predicated, using the destination
 534 integer register (rd) to determine the predicate.  FP Compare is **not**
 535 a twin-predication operation, as, again, just as with SV Branches,
 536 there are three registers involved: FP src1, FP src2 and INT rd.
 537
 538 Also: note that ffirst (fail first mode) applies directly to this operation.
 539
 540 ### Compressed Branch Instruction
 541
 542 Compressed Branch instructions are, just like standard Branch instructions,
 543 reinterpreted to be vectorised and predicated based on the source register
 544 (rs1s) CSR entries.  As however there is only the one source register,
 545 given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
 546 to store the results of the comparisions is taken from CSR predication
 547 table entries for **x0**.
 548
 549 The specific required use of x0 is, with a little thought, quite obvious,
 550 but is counterintuitive.  Clearly it is **not** recommended to redirect
 551 x0 with a CSR register entry, however as a means to opaquely obtain
 552 a predication target it is the only sensible option that does not involve
 553 additional special CSRs (or, worse, additional special opcodes).
 554
 555 Note also that, just as with standard branches, the 2nd source
 556 (in this case x0 rather than src2) does **not** have to have its CSR
 557 register table marked as "active" in order for predication to work.
 558
 559 ## Vectorised Dual-operand instructions
 560
 561 There is a series of 2-operand instructions involving copying (and
 562 sometimes alteration):
 563
 564 * C.MV
 565 * FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
 566 * C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
 567 * LOAD(-FP) and STORE(-FP)
 568
 569 All of these operations follow the same two-operand pattern, so it is
 570 *both* the source *and* destination predication masks that are taken into
 571 account.  This is different from
 572 the three-operand arithmetic instructions, where the predication mask
 573 is taken from the *destination* register, and applied uniformly to the
 574 elements of the source register(s), element-for-element.
 575
 576 The pseudo-code pattern for twin-predicated operations is as
 577 follows:
 578
 579     function op(rd, rs):
 580       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 581       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 582       ps = get_pred_val(FALSE, rs); # predication on src
 583       pd = get_pred_val(FALSE, rd); # ... AND on dest
 584       for (int i = 0, int j = 0; i < VL && j < VL;):
 585         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 586         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 587         xSTATE.srcoffs = i # save context
 588         xSTATE.destoffs = j # save context
 589         reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
 590         if (int_csr[rs].isvec) i++;
 591         if (int_csr[rd].isvec) j++; else break
 592
 593 This pattern covers scalar-scalar, scalar-vector, vector-scalar
 594 and vector-vector, and predicated variants of all of those.
 595 Zeroing is not presently included (TODO).  As such, when compared
 596 to RVV, the twin-predicated variants of C.MV and FMV cover
 597 **all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
 598 VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
 599
 600 ### C.MV Instruction <a name="c_mv"></a>
 601
 602 There is no MV instruction in RV however there is a C.MV instruction.
 603 It is used for copying integer-to-integer registers (vectorised FMV
 604 is used for copying floating-point).
 605
 606 If either the source or the destination register are marked as vectors
 607 C.MV is reinterpreted to be a vectorised (multi-register) predicated
 608 move operation.  The actual instruction's format does not change.
 609
 610 There are several different instructions from RVV that are covered by
 611 this one opcode:
 612
 613 [[!table  data="""
 614 src    | dest    | predication   | op             |
 615 scalar | vector  | none          | VSPLAT         |
 616 scalar | vector  | destination   | sparse VSPLAT  |
 617 scalar | vector  | 1-bit dest    | VINSERT        |
 618 vector | scalar  | 1-bit? src    | VEXTRACT       |
 619 vector | vector  | none          | VCOPY          |
 620 vector | vector  | src           | Vector Gather  |
 621 vector | vector  | dest          | Vector Scatter |
 622 vector | vector  | src & dest    | Gather/Scatter |
 623 vector | vector  | src == dest   | sparse VCOPY   |
 624 """]]
 625
 626 Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
 627 operations with inversion on the src and dest predication for one of the
 628 two C.MV operations.
 629
 630 ### FMV, FNEG and FABS Instructions
 631
 632 These are identical in form to C.MV, except covering floating-point
 633 register copying.  The same double-predication rules also apply.
 634 However when elwidth is not set to default the instruction is implicitly
 635 and automatic converted to a (vectorised) floating-point type conversion
 636 operation of the appropriate size covering the source and destination
 637 register bitwidths.
 638
 639 (Note that FMV, FNEG and FABS are all actually pseudo-instructions)
 640
 641 ### FVCT Instructions
 642
 643 These are again identical in form to C.MV, except that they cover
 644 floating-point to integer and integer to floating-point.  When element
 645 width in each vector is set to default, the instructions behave exactly
 646 as they are defined for standard RV (scalar) operations, except vectorised
 647 in exactly the same fashion as outlined in C.MV.
 648
 649 However when the source or destination element width is not set to default,
 650 the opcode's explicit element widths are *over-ridden* to new definitions,
 651 and the opcode's element width is taken as indicative of the SIMD width
 652 (if applicable i.e. if packed SIMD is requested) instead.
 653
 654 ## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
 655
 656 In vectorised architectures there are usually at least two different modes
 657 for LOAD/STORE:
 658
 659 * Read (or write for STORE) from sequential locations, where one
 660   register specifies the address, and the one address is incremented
 661   by a fixed amount.  This is usually known as "Unit Stride" mode.
 662 * Read (or write) from multiple indirected addresses, where the
 663   vector elements each specify separate and distinct addresses.
 664
 665 To support these different addressing modes, the CSR Register "isvector"
 666 bit is used.  So, for a LOAD, when the src register is set to
 667 scalar, the LOADs are sequentially incremented by the src register
 668 element width, and when the src register is set to "vector", the
 669 elements are treated as indirection addresses.  Simplified
 670 pseudo-code would look like this:
 671
 672     function op_ld(rd, rs) # LD not VLD!
 673       rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
 674       rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
 675       ps = get_pred_val(FALSE, rs); # predication on src
 676       pd = get_pred_val(FALSE, rd); # ... AND on dest
 677       for (int i = 0, int j = 0; i < VL && j < VL;):
 678         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 679         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 680         if (int_csr[rd].isvec)
 681           # indirect mode (multi mode)
 682           srcbase = ireg[rsv+i];
 683         else
 684           # unit stride mode
 685           srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
 686         ireg[rdv+j] <= mem[srcbase + imm_offs];
 687         if (!int_csr[rs].isvec &&
 688             !int_csr[rd].isvec) break # scalar-scalar LD
 689         if (int_csr[rs].isvec) i++;
 690         if (int_csr[rd].isvec) j++;
 691
 692 ## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
 693
 694 C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
 695 where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
 696 It is therefore possible to use predicated C.LWSP to efficiently
 697 pop registers off the stack (by predicating x2 as the source), cherry-picking
 698 which registers to store to (by predicating the destination).  Likewise
 699 for C.SWSP.  In this way, LOAD/STORE-Multiple is efficiently achieved.
 700
 701 **Note**: it is still possible to redirect x2 to an alternative target
 702 register.  With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
 703 general-purpose LOAD/STORE operations.
 704
 705 ## Compressed LOAD / STORE Instructions
 706
 707 Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
 708 where the same rules apply and the same pseudo-code apply as for
 709 non-compressed LOAD/STORE.  Again: setting scalar or vector mode
 710 on the src for LOAD and dest for STORE switches mode from "Unit Stride"
 711 to "Multi-indirection", respectively.
 712
 713 # Element bitwidth polymorphism <a name="elwidth"></a>
 714
 715 Element bitwidth is best covered as its own special section, as it
 716 is quite involved and applies uniformly across-the-board.  SV restricts
 717 bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
 718
 719 The effect of setting an element bitwidth is to re-cast each entry
 720 in the register table, and for all memory operations involving
 721 load/stores of certain specific sizes, to a completely different width.
 722 Thus In c-style terms, on an RV64 architecture, effectively each register
 723 now looks like this:
 724
 725     typedef union {
 726         uint8_t  actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
 727         uint8_t  b[0]; // array of type uint8_t
 728         uint16_t s[0];
 729         uint32_t i[0];
 730         uint64_t l[0];
 731         uint128_t d[0];
 732     } reg_t;
 733
 734     reg_t int_regfile[128];
 735
 736 Implementors must ensure that over-runs of the register file throw
 737 an exception.
 738
 739 The pseudo-code is as follows, to demonstrate how the sign-extending
 740 and width-extending works:
 741
 742     typedef union {
 743         uint8_t  b;
 744         uint16_t s;
 745         uint32_t i;
 746         uint64_t l;
 747     } el_reg_t;
 748
 749     bw(elwidth):
 750         if elwidth == 0:
 751             return xlen
 752         if elwidth == 1:
 753             return xlen / 2
 754         if elwidth == 2:
 755             return xlen * 2
 756         // elwidth == 3:
 757         return 8
 758
 759     get_max_elwidth(rs1, rs2):
 760         return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
 761                    bw(int_csr[rs2].elwidth)) # again XLEN if no entry
 762
 763     get_polymorphed_reg(reg, bitwidth, offset):
 764         el_reg_t res;
 765         res.l = 0; // TODO: going to need sign-extending / zero-extending
 766         if bitwidth == 8:
 767             reg.b = int_regfile[reg].b[offset]
 768         elif bitwidth == 16:
 769             reg.s = int_regfile[reg].s[offset]
 770         elif bitwidth == 32:
 771             reg.i = int_regfile[reg].i[offset]
 772         elif bitwidth == 64:
 773             reg.l = int_regfile[reg].l[offset]
 774         return res
 775
 776     set_polymorphed_reg(reg, bitwidth, offset, val):
 777         if (!int_csr[reg].isvec):
 778             # sign/zero-extend depending on opcode requirements, from
 779             # the reg's bitwidth out to the full bitwidth of the regfile
 780             val = sign_or_zero_extend(val, bitwidth, xlen)
 781             int_regfile[reg].l[0] = val
 782         elif bitwidth == 8:
 783             int_regfile[reg].b[offset] = val
 784         elif bitwidth == 16:
 785             int_regfile[reg].s[offset] = val
 786         elif bitwidth == 32:
 787             int_regfile[reg].i[offset] = val
 788         elif bitwidth == 64:
 789             int_regfile[reg].l[offset] = val
 790
 791       maxsrcwid =  get_max_elwidth(rs1, rs2) # source element width(s)
 792       destwid = int_csr[rs1].elwidth         # destination element width
 793       for (i = 0; i < VL; i++)
 794         if (predval & 1<<i) # predication uses intregs
 795            // TODO, calculate if over-run occurs, for each elwidth
 796            src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
 797            // TODO, sign/zero-extend src1 and src2 as operation requires
 798            if (op_requires_sign_extend_src1)
 799               src1 = sign_extend(src1, maxsrcwid)
 800            src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
 801            result = src1 + src2 # actual add here
 802            // TODO, sign/zero-extend result, as operation requires
 803            if (op_requires_sign_extend_dest)
 804               result = sign_extend(result, maxsrcwid)
 805            set_polymorphed_reg(rd, destwid, ird, result)
 806            if (!int_vec[rd].isvector) break
 807         if (int_vec[rd ].isvector)  { id += 1; }
 808         if (int_vec[rs1].isvector)  { irs1 += 1; }
 809         if (int_vec[rs2].isvector)  { irs2 += 1; }
 810
 811 ## Polymorphic floating-point operation exceptions and error-handling
 812
 813 For floating-point operations, conversion takes place without
 814 raising any kind of exception.  Exactly as specified in the standard
 815 RV specification, NAN (or appropriate) is stored if the result
 816 is beyond the range of the destination, and, again, exactly as
 817 with the standard RV specification just as with scalar
 818 operations, the floating-point flag is raised (FCSR).  And, again, just as
 819 with scalar operations, it is software's responsibility to check this flag.
 820 Given that the FCSR flags are "accrued", the fact that multiple element
 821 operations could have occurred is not a problem.
 822
 823 Note that it is perfectly legitimate for floating-point bitwidths of
 824 only 8 to be specified.  However whilst it is possible to apply IEEE 754
 825 principles, no actual standard yet exists.  Implementors wishing to
 826 provide hardware-level 8-bit support rather than throw a trap to emulate
 827 in software should contact the author of this specification before
 828 proceeding.
 829
 830 ## Polymorphic shift operators
 831
 832 A special note is needed for changing the element width of left and right
 833 shift operators, particularly right-shift.
 834
 835 For SV, where each operand's element bitwidth may be over-ridden, the
 836 rule about determining the operation's bitwidth *still applies*, being
 837 defined as the maximum bitwidth of RS1 and RS2.  *However*, this rule
 838 **also applies to the truncation of RS2**.  In other words, *after*
 839 determining the maximum bitwidth, RS2's range must **also be truncated**
 840 to ensure a correct answer.  Example:
 841
 842 * RS1 is over-ridden to a 16-bit width
 843 * RS2 is over-ridden to an 8-bit width
 844 * RD is over-ridden to a 64-bit width
 845 * the maximum bitwidth is thus determined to be 16-bit - max(8,16)
 846 * RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
 847
 848 Pseudocode (in spike) for this example would therefore be:
 849
 850     WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
 851
 852 ## Polymorphic MULH/MULHU/MULHSU
 853
 854 MULH is designed to take the top half MSBs of a multiply that
 855 does not fit within the range of the source operands, such that
 856 smaller width operations may produce a full double-width multiply
 857 in two cycles.  The issue is: SV allows the source operands to
 858 have variable bitwidth.
 859
 860 Here again special attention has to be paid to the rules regarding
 861 bitwidth, which, again, are that the operation is performed at
 862 the maximum bitwidth of the **source** registers.  Therefore:
 863
 864 * An 8-bit x 8-bit multiply will create a 16-bit result that must
 865   be shifted down by 8 bits
 866 * A 16-bit x 8-bit multiply will create a 24-bit result that must
 867   be shifted down by 16 bits (top 8 bits being zero)
 868 * A 16-bit x 16-bit multiply will create a 32-bit result that must
 869   be shifted down by 16 bits
 870 * A 32-bit x 16-bit multiply will create a 48-bit result that must
 871   be shifted down by 32 bits
 872 * A 32-bit x 8-bit multiply will create a 40-bit result that must
 873   be shifted down by 32 bits
 874
 875 So again, just as with shift-left and shift-right, the result
 876 is shifted down by the maximum of the two source register bitwidths.
 877 And, exactly again, truncation or sign-extension is performed on the
 878 result.  If sign-extension is to be carried out, it is performed
 879 from the same maximum of the two source register bitwidths out
 880 to the result element's bitwidth.
 881
 882 If truncation occurs, i.e. the top MSBs of the result are lost,
 883 this is "Officially Not Our Problem", i.e. it is assumed that the
 884 programmer actually desires the result to be truncated.  i.e. if the
 885 programmer wanted all of the bits, they would have set the destination
 886 elwidth to accommodate them.
 887
 888 ## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
 889
 890 Polymorphic element widths in vectorised form means that the data
 891 being loaded (or stored) across multiple registers needs to be treated
 892 (reinterpreted) as a contiguous stream of elwidth-wide items, where
 893 the source register's element width is **independent** from the destination's.
 894
 895 This makes for a slightly more complex algorithm when using indirection
 896 on the "addressed" register (source for LOAD and destination for STORE),
 897 particularly given that the LOAD/STORE instruction provides important
 898 information about the width of the data to be reinterpreted.
 899
 900 As LOAD/STORE may be twin-predicated, it is important to note that
 901 the rules on twin predication still apply.  Where in previous
 902 pseudo-code (elwidth=default for both source and target) it was
 903 the *registers* that the predication was applied to, it is now the
 904 **elements** that the predication is applied to.
 905
 906 The pseudocode for all LD operations may be written out
 907 as follows:
 908
 909     function LBU(rd, rs):
 910         load_elwidthed(rd, rs, 8, true)
 911     function LB(rd, rs):
 912         load_elwidthed(rd, rs, 8, false)
 913     function LH(rd, rs):
 914         load_elwidthed(rd, rs, 16, false)
 915     ...
 916     ...
 917     function LQ(rd, rs):
 918         load_elwidthed(rd, rs, 128, false)
 919
 920     # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
 921     function load_memory(rs, imm, i, opwidth):
 922         elwidth = int_csr[rs].elwidth
 923         bitwidth = bw(elwidth);
 924         elsperblock = min(1, opwidth / bitwidth)
 925         srcbase = ireg[rs+i/(elsperblock)];
 926         offs = i % elsperblock;
 927         return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
 928
 929     function load_elwidthed(rd, rs, opwidth, unsigned):
 930       destwid = int_csr[rd].elwidth # destination element width
 931       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
 932       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
 933       ps = get_pred_val(FALSE, rs); # predication on src
 934       pd = get_pred_val(FALSE, rd); # ... AND on dest
 935       for (int i = 0, int j = 0; i < VL && j < VL;):
 936         if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
 937         if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
 938         val = load_memory(rs, imm, i, opwidth)
 939         if unsigned:
 940             val = zero_extend(val, min(opwidth, bitwidth))
 941         else:
 942             val = sign_extend(val, min(opwidth, bitwidth))
 943         set_polymorphed_reg(rd, bitwidth, j, val)
 944         if (int_csr[rs].isvec) i++;
 945         if (int_csr[rd].isvec) j++; else break;
 946
 947 # Predication Element Zeroing
 948
 949 The decision to add the *option* to zero unpredicated (masked-out)
 950 elements was based on whether it would be useful, rather than on
 951 how the microarchitecture is implemented (or optimised).  Therefore,
 952 both zeroing and non-zeroing are mandatory.
 953
 954 ## Single-predication (based on destination register)
 955
 956 Zeroing on predication for arithmetic operations is taken from
 957 the destination register's predicate.  i.e. the predication *and*
 958 zeroing settings to be applied to the whole operation come from the
 959 CSR Predication table entry for the destination register.
 960
 961 Thus when zeroing is set on predication of a destination element,
 962 if the predication bit is clear, then the destination element is *set*
 963 to zero (twin-predication is slightly different, and is covered below)
 964
 965 Thus the pseudo-code loop for a predicated arithmetic operation
 966 is modified to as follows:
 967
 968       for (i = 0; i < VL; i++)
 969         if not zeroing: # an optimisation
 970            while (!(predval & 1<<i) && i < VL)
 971              if (int_vec[rd ].isvector)  { id += 1; }
 972              if (int_vec[rs1].isvector)  { irs1 += 1; }
 973              if (int_vec[rs2].isvector)  { irs2 += 1; }
 974            if i == VL:
 975              return
 976         if (predval & 1<<i)
 977            src1 = ....
 978            src2 = ...
 979            else:
 980                result = src1 + src2 # actual add (or other op) here
 981            set_polymorphed_reg(rd, destwid, ird, result)
 982            if int_vec[rd].ffirst and result == 0:
 983               VL = i # result was zero, end loop early, return VL
 984               return
 985            if (!int_vec[rd].isvector) return
 986         else if zeroing:
 987            result = 0
 988            set_polymorphed_reg(rd, destwid, ird, result)
 989         if (int_vec[rd ].isvector)  { id += 1; }
 990         else if (predval & 1<<i) return
 991         if (int_vec[rs1].isvector)  { irs1 += 1; }
 992         if (int_vec[rs2].isvector)  { irs2 += 1; }
 993         if (rd == VL or rs1 == VL or rs2 == VL): return
 994
 995 ## Twin-predication (based on source and destination register)
 996
 997 In twin-predication, the source is independently zero-predicated from
 998 the destination.  This means that the source may be zero-predicated *or*
 999 the destination zero-predicated *or both*, or neither.
1000
1001 When with twin-predication, zeroing is set on the source and not
1002 the destination, if a predicate bit is set it indicates that a zero
1003 data element is passed through the operation (the exception being:
1004 if the source data element is to be treated as an address - a LOAD -
1005 then the data returned *from* the LOAD is zero, rather than looking up an
1006 *address* of zero.
1007
1008 When zeroing is set on the destination and not the source, then just
1009 as with single-predicated operations, a zero is stored into the destination
1010 element (or target memory address for a STORE).
1011
1012 Zeroing on both source and destination effectively result in a bitwise
1013 NOR operation of the source and destination predicate: the result is that
1014 where either source predicate OR destination predicate is set to 0,
1015 a zero element will ultimately end up in the destination register.
1016
1017 However: this may not necessarily be the case for all operations;
1018 implementors, particularly of custom instructions, clearly need to
1019 think through the implications in each and every case.
1020
1021 Here is (simplified) pseudo-code for a twin zero-predicated MV operation:
1022
1023     function op_mv(rd, rs) # MV, not VMV!
1024       rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
1025       rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
1026       ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
1027       pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
1028       for (int i = 0, int j = 0; i < VL && j < VL):
1029         if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
1030         if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
1031         if ((pd & 1<<j))
1032             ireg[rd+j] <= (pd & 1<<j) ? ireg[rs+1] : 0
1033         else if (zerodst)
1034             ireg[rd+j] <= 0
1035         if (int_csr[rs].isvec) i++;
1036         if (int_csr[rd].isvec) j++;
1037         else if ((pd & 1<<j)) break;
1038
1039 Note that in the instance where the destination is a scalar, the hardware
1040 loop is ended the moment a value *or a zero* is placed into the destination
1041 register/element.  Also note that, for clarity, variable element widths
1042 have been left out of the above.
1043
1044 # Exceptions
1045
1046 TODO: expand.
1047
1048 # Hints
1049
1050 With Simple-V being capable of issuing *parallel* instructions where
1051 rd=x0, the space for possible HINTs is expanded considerably.  VL
1052 could be used to indicate different hints.  In addition, if predication
1053 is set, the predication register itself could hypothetically be passed
1054 in as a *parameter* to the HINT operation.
1055
1056 No specific hints are yet defined in Simple-V
1057
1058 # Vector Block Format <a name="vliw-format"></a>
1059
1060 See ancillary resource: [[vblock_format]]
1061
1062 # Subsets of RV functionality
1063
1064 It is permitted to only implement SVprefix and not the VBLOCK instruction
1065 format option, and vice-versa.  UNIX Platforms **MUST** raise illegal
1066 instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
1067 traps may emulate the format.
1068
1069 It is permitted in SVprefix to either not implement VL or not implement
1070 SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
1071 *MUST* raise illegal instruction on implementations that do not support
1072 VL or SUBVL.
1073
1074 It is permitted to limit the size of either (or both) the register files
1075 down to the original size of the standard RV architecture.  However, below
1076 the mandatory limits set in the RV standard will result in non-compliance
1077 with the SV Specification.
1078