--- /dev/null
+# Simple-V (Parallelism Extension Proposal) Specification (Abridged)
+
+* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
+* Status: DRAFTv0.6
+* Last edited: 25 jun 2019
+
+[[!toc ]]
+
+# Introduction
+
+Simple-V is a uniform parallelism API for RISC-V hardware that allows
+the Program Counter to enter "sub-contexts" in which, ultimately, standard
+RISC-V scalar opcodes are executed.
+
+The sub-context execution is "nested" in "re-entrant" form, in the
+following order:
+
+* Main standard RISC-V Program Counter (PC)
+* VBLOCK sub-execution context (PCVBLK increments whilst PC is paused)
+* VL element loops (STATE srcoffs and destoffs increment, PC and PCVBLK pause)
+* SUBVL element loops (STATE svsrcoffs/svdestoffs increment, VL pauses)
+
+Note: **there are *no* new opcodes**. The scheme works *entirely*
+on hidden context that augments *scalar* RISCV instructions.
+
+# CSRs <a name="csrs"></a>
+
+There are five CSRs, available in any privilege level:
+
+* MVL (the Maximum Vector Length)
+* VL (which has different characteristics from standard CSRs)
+* SUBVL (effectively a kind of SIMD)
+* STATE (containing copies of MVL, VL and SUBVL as well as context information)
+* PCVBLK (the current operation being executed within a VBLOCK Group)
+
+For Privilege Levels (trap handling) there are the following CSRs,
+where x may be u, m, s or h for User, Machine, Supervisor or Hypervisor
+Modes respectively:
+
+* (x)ePCVBLK (a copy of the sub-execution Program Counter, that is relative
+ to the start of the current VBLOCK Group, set on a trap).
+* (x) eSTATE (useful for saving and restoring during context switch,
+ and for providing fast transitions)
+
+The u/m/s CSRs are treated and handled exactly like their (x)epc
+equivalents. On entry to or exit from a privilege level, the contents
+of its (x)eSTATE are swapped with STATE.
+
+(x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc
+equivalents. See VBLOCK section for details.
+
+## MAXVECTORLENGTH (MVL) <a name="mvl" />
+
+MAXVECTORLENGTH is the same concept as MVL in RVV, except that it
+is variable length and may be dynamically set. MVL is
+however limited to the regfile bitwidth XLEN (1-32 for RV32,
+1-64 for RV64 and so on).
+
+## Vector Length (VL) <a name="vl" />
+
+VSETVL is slightly different from RVV. Similar to RVV, VL is set to be within
+the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN)
+
+ VL = rd = MIN(vlen, MVL)
+
+where 1 <= MVL <= XLEN
+
+## SUBVL - Sub Vector Length
+
+This is a "group by quantity" that effectivrly asks each iteration
+of the hardware loop to load SUBVL elements of width elwidth at a
+time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1
+operation issued, SUBVL operations are issued.
+
+The main effect of SUBVL is that predication bits are applied per
+**group**, rather than by individual element.
+
+## STATE
+
+This is a standard CSR that contains sufficient information for a
+full context save/restore. It contains (and permits setting of):
+
+* MVL
+* VL
+* destoffs - the destination element offset of the current parallel
+ instruction being executed
+* srcoffs - for twin-predication, the source element offset as well.
+* SUBVL
+* svdestoffs - the subvector destination element offset of the current
+ parallel instruction being executed
+* svsrcoffs - for twin-predication, the subvector source element offset
+ as well.
+
+The format of the STATE CSR is as follows:
+
+| (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) |
+| ------- | -------- | -------- | -------- | -------- | ------- | ------- |
+| dsvoffs | ssvoffs | subvl | destoffs | srcoffs | vl | maxvl |
+
+Notes:
+
+* The entries are truncated to be within range. Attempts to set VL to
+ greater than MAXVL will truncate VL.
+* Both VL and MAXVL are stored offset by one. 0b000000 represents VL=1,
+ 0b000001 represents VL=2. This allows the full range 1 to XLEN instead
+ of 0 to only 63.
+
+## VL, MVL and SUBVL instruction aliases
+
+This table contains pseudo-assembly instruction aliases. Note the
+subtraction of 1 from the CSRRWI pseudo variants, to compensate for the
+reduced range of the 5 bit immediate.
+
+| alias | CSR |
+| - | - |
+| SETVL rd, rs | CSRRW VL, rd, rs |
+| SETVLi rd, #n | CSRRWI VL, rd, #n-1 |
+| GETVL rd | CSRRW VL, rd, x0 |
+| SETMVL rd, rs | CSRRW MVL, rd, rs |
+| SETMVLi rd, #n | CSRRWI MVL,rd, #n-1 |
+| GETMVL rd | CSRRW MVL, rd, x0 |
+
+Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure).
+
+## Register key-value (CAM) table <a name="regcsrtable" />
+
+The purpose of the Register table is to mark which registers change behaviour
+if used in a "Standard" (normally scalar) opcode.
+
+16 bit format:
+
+| RegCAM | 15 | (14..8) | 7 | (6..5) | (4..0) |
+| ------ | - | - | - | ------ | ------- |
+| 0 | isvec0 | regidx0 | i/f | vew0 | regkey |
+| 1 | isvec1 | regidx1 | i/f | vew1 | regkey |
+| 2 | isvec2 | regidx2 | i/f | vew2 | regkey |
+| 3 | isvec3 | regidx3 | i/f | vew3 | regkey |
+
+8 bit format:
+
+| RegCAM | | 7 | (6..5) | (4..0) |
+| ------ | | - | ------ | ------- |
+| 0 | | i/f | vew0 | regnum |
+
+Mapping the 8-bit to 16-bit format:
+
+| RegCAM | 15 | (14..8) | 7 | (6..5) | (4..0) |
+| ------ | - | - | - | ------ | ------- |
+| 0 | isvec=1 | regnum0<<2 | i/f | vew0 | regnum0 |
+| 1 | isvec=1 | regnum1<<2 | i/f | vew1 | regnum1 |
+| 2 | isvec=1 | regnum2<<2 | i/f | vew2 | regnum2 |
+| 3 | isvec=1 | regnum2<<2 | i/f | vew3 | regnum3 |
+
+Fields:
+
+* i/f is set to "1" to indicate that the redirection/tag entry is to
+ be applied to integer registers; 0 indicates that it is relevant to
+ floating-point registers.
+* isvec indicates that the register (whether a src or dest) is to progress
+ incrementally forward on each loop iteration. this gives the "effect"
+ of vectorisation. isvec is zero indicates "do not progress", giving
+ the "effect" of that register being scalar.
+* vew overrides the operation's default width. See table below
+* regkey is the register which, if encountered in an op (as src or dest)
+ is to be "redirected"
+* in the 16-bit format, regidx is the *actual* register to be used
+ for the operation (note that it is 7 bits wide)
+
+| vew | bitwidth |
+| --- | ------------------- |
+| 00 | default (XLEN/FLEN) |
+| 01 | 8 bit |
+| 10 | 16 bit |
+| 11 | 32 bit |
+
+A useful way to view the above table (and not have it as a CAM):
+
+As the above table is a CAM (key-value store) it may be appropriate
+(faster, less gates, implementation-wise) to expand it as follows:
+
+ struct vectorised {
+ bool isvector:1;
+ int vew:2;
+ bool enabled:1;
+ int predidx:7;
+ }
+
+ struct vectorised fp_vec[32], int_vec[32];
+
+ for (i = 0; i < len; i++) // from VBLOCK Format
+ tb = int_vec if CSRvec[i].type == 0 else fp_vec
+ idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
+ tb[idx].elwidth = CSRvec[i].elwidth
+ tb[idx].regidx = CSRvec[i].regidx // indirection
+ tb[idx].isvector = CSRvec[i].isvector // 0=scalar
+ tb[idx].enabled = true;
+
+## Predication Table <a name="predication_csr_table"></a>
+
+The Predication Table is a key-value store indicating whether, if a
+given destination register (integer or floating-point) is referred to
+in an instruction, it is to be predicated. Like the Register table, it
+is an indirect lookup that allows the RV opcodes to not need modification.
+
+* regidx is the register that in combination with the
+ i/f flag, if that integer or floating-point register is referred to in a
+ (standard RV) instruction results in the lookup table being referenced
+ to find the predication mask to use for this operation.
+* predidx is the *actual* (full, 7 bit) register to be used for the
+ predication mask.
+* inv indicates that the predication mask bits are to be inverted
+ prior to use *without* actually modifying the contents of the
+ registerfrom which those bits originated.
+* zeroing is either 1 or 0, and if set to 1, the operation must
+ place zeros in any element position where the predication mask is
+ set to zero. If zeroing is set to 0, unpredicated elements *must*
+ be left alone. Some microarchitectures may choose to interpret
+ this as skipping the operation entirely. Others which wish to
+ stick more closely to a SIMD architecture may choose instead to
+ interpret unpredicated elements as an internal "copy element"
+ operation (which would be necessary in SIMD microarchitectures
+ that perform register-renaming)
+* ffirst is a special mode that stops sequential element processing when
+ a data-dependent condition occurs, whether a trap or a conditional test.
+ The handling of each (trap or conditional test) is slightly different:
+ see Instruction sections for further details
+
+16 bit format:
+
+| PrCSR | (15..11) | 10 | 9 | 8 | (7..1) | 0 |
+| ----- | - | - | - | - | ------- | ------- |
+| 0 | predidx | zero0 | inv0 | i/f | regidx | ffirst0 |
+| 1 | predidx | zero1 | inv1 | i/f | regidx | ffirst1 |
+| 2 | predidx | zero2 | inv2 | i/f | regidx | ffirst2 |
+| 3 | predidx | zero3 | inv3 | i/f | regidx | ffirst3 |
+
+Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding. Its use must
+generate an illegal instruction trap.
+
+8 bit format:
+
+| PrCSR | 7 | 6 | 5 | (4..0) |
+| ----- | - | - | - | ------- |
+| 0 | zero0 | inv0 | i/f | regnum |
+
+Mapping from 8 to 16 bit format, the table becomes:
+
+| PrCSR | (15..11) | 10 | 9 | 8 | (7..1) | 0 |
+| ----- | - | - | - | - | ------- | ------- |
+| 0 | x9 | zero0 | inv0 | i/f | regnum | ff=0 |
+| 1 | x10 | zero1 | inv1 | i/f | regnum | ff=0 |
+| 2 | x11 | zero2 | inv2 | i/f | regnum | ff=0 |
+| 3 | x12 | zero3 | inv3 | i/f | regnum | ff=0 |
+
+Pseudocode for predication:
+
+ struct pred {
+ bool zero; // zeroing
+ bool inv; // register at predidx is inverted
+ bool ffirst; // fail-on-first
+ bool enabled; // use this to tell if the table-entry is active
+ int predidx; // redirection: actual int register to use
+ }
+
+ struct pred fp_pred_reg[32];
+ struct pred int_pred_reg[32];
+
+ for (i = 0; i < len; i++) // number of Predication entries in VBLOCK
+ tb = int_pred_reg if PredicateTable[i].type == 0 else fp_pred_reg;
+ idx = VBLOCKPredicateTable[i].regidx
+ tb[idx].zero = CSRpred[i].zero
+ tb[idx].inv = CSRpred[i].inv
+ tb[idx].ffirst = CSRpred[i].ffirst
+ tb[idx].predidx = CSRpred[i].predidx
+ tb[idx].enabled = true
+
+ def get_pred_val(bool is_fp_op, int reg):
+ tb = int_reg if is_fp_op else fp_reg
+ if (!tb[reg].enabled):
+ return ~0x0, False // all enabled; no zeroing
+ tb = int_pred if is_fp_op else fp_pred
+ if (!tb[reg].enabled):
+ return ~0x0, False // all enabled; no zeroing
+ predidx = tb[reg].predidx // redirection occurs HERE
+ predicate = intreg[predidx] // actual predicate HERE
+ if (tb[reg].inv):
+ predicate = ~predicate // invert ALL bits
+ return predicate, tb[reg].zero
+
+## Fail-on-First Mode <a name="ffirst-mode"></a>
+
+ffirst is a special data-dependent predicate mode. There are two
+variants: one is for faults: typically for LOAD/STORE operations,
+which may encounter end of page faults during a series of operations.
+The other variant is comparisons such as FEQ (or the augmented behaviour
+of Branch), and any operation that returns a result of zero (whether
+integer or floating-point). In the FP case, this includes negative-zero.
+
+Note that the execution order must "appear" to be sequential for ffirst
+mode to work correctly. An in-order architecture must execute the element
+operations in sequence, whilst an out-of-order architecture must *commit*
+the element operations in sequence (giving the appearance of in-order
+execution).
+
+Note also, that if ffirst mode is needed without predication, a special
+"always-on" Predicate Table Entry may be constructed by setting
+inverse-on and using x0 as the predicate register. This
+will have the effect of creating a mask of all ones, allowing ffirst
+to be set.
+
+### Fail-on-first traps
+
+Except for the first element, ffault stops sequential element processing
+when a trap occurs. The first element is treated normally (as if ffirst
+is clear). Should any subsequent element instruction require a trap,
+instead it and subsequent indexed elements are ignored (or cancelled in
+out-of-order designs), and VL is set to the *last* instruction that did
+not take the trap.
+
+Note that predicated-out elements (where the predicate mask bit is zero)
+are clearly excluded (i.e. the trap will not occur). However, note that
+the loop still had to test the predicate bit: thus on return,
+VL is set to include elements that did not take the trap *and* includes
+the elements that were predicated (masked) out (not tested up to the
+point where the trap occurred).
+
+If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements
+will cause a trap as normal (as if ffirst is not set); subsequently,
+the trap must not occur in the *sub-group* of elements. SUBVL will **NOT**
+be modified.
+
+Given that predication bits apply to SUBVL groups, the same rules apply
+to predicated-out (masked-out) sub-groups in calculating the value that VL
+is set to.
+
+### Fail-on-first conditional tests
+
+ffault stops sequential element conditional testing on the first element result
+being zero. VL is set to the number of elements that were processed before
+the fail-condition was encountered.
+
+Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group*
+will cause the processing to end, and, even if there were elements within
+the *sub-group* that passed the test, that sub-group is still (entirely)
+excluded from the count (from setting VL). i.e. VL is set to the total
+number of *sub-groups* that had no fail-condition up until execution was
+stopped.
+
+Note again that, just as with traps, predicated-out (masked-out) elements
+are included in the count leading up to the fail-condition, even though they
+were not tested.
+
+The pseudo-code for Predication makes this clearer and simpler than it is
+in words (the loop ends, VL is set to the current element index, "i").
+
+# Instructions <a name="instructions" />
+
+To illustrate how Scalar operations are turned "vector" and "predicated",
+simplified example pseudo-code for an integer ADD operation is shown below.
+Floating-point would use the FP Register Table.
+
+ function op_add(rd, rs1, rs2) # add not VADD!
+ int i, id=0, irs1=0, irs2=0;
+ predval = get_pred_val(FALSE, rd);
+ rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+ rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+ rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+ for (i = 0; i < VL; i++)
+ xSTATE.srcoffs = i # save context
+ if (predval & 1<<i) # predication uses intregs
+ ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+ if (!int_vec[rd ].isvector) break;
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+
+Note that for simplicity there is quite a lot missing from the above
+pseudo-code.
+
+## SUBVL Pseudocode <a name="subvl-pseudocode"></a>
+
+Adding in support for SUBVL is a matter of adding in an extra inner
+for-loop, where register src and dest are still incremented inside the
+inner part. Not that the predication is still taken from the VL index.
+
+So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are
+indexed by "(i)"
+
+ function op_add(rd, rs1, rs2) # add not VADD!
+ int i, id=0, irs1=0, irs2=0;
+ predval = get_pred_val(FALSE, rd);
+ rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+ rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+ rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+ for (i = 0; i < VL; i++)
+ xSTATE.srcoffs = i # save context
+ for (s = 0; s < SUBVL; s++)
+ xSTATE.ssvoffs = s # save context
+ if (predval & 1<<i) # predication uses intregs
+ # actual add is here (at last)
+ ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+ if (!int_vec[rd ].isvector) break;
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+ if (id == VL or irs1 == VL or irs2 == VL) {
+ # end VL hardware loop
+ xSTATE.srcoffs = 0; # reset
+ xSTATE.ssvoffs = 0; # reset
+ return;
+ }
+
+NOTE: pseudocode simplified greatly: zeroing, proper predicate handling,
+elwidth handling etc. all left out.
+
+## Instruction Format
+
+It is critical to appreciate that there are
+**no operations added to SV, at all**.
+
+Examples are given below where "standard" RV scalar behaviour is augmented.
+
+## Branch Instructions
+
+Branch operations are augmented slightly to be a little more like FP
+Compares (FEQ, FNE etc.), by permitting the cumulation (and storage)
+of multiple comparisons into a register (taken indirectly from the predicate
+table). As such, "ffirst" - fail-on-first - condition mode can be enabled.
+See ffirst mode in the Predication Table section.
+
+### Standard Branch <a name="standard_branch"></a>
+
+Branch operations use standard RV opcodes that are reinterpreted to
+be "predicate variants" in the instance where either of the two src
+registers are marked as vectors (active=1, vector=1).
+
+Note that the predication register to use (if one is enabled) is taken from
+the *first* src register, and that this is used, just as with predicated
+arithmetic operations, to mask whether the comparison operations take
+place or not. If the second register is also marked as predicated,
+that (scalar) predicate register is used as a **destination** to store
+the results of all the comparisons.
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+Where vectorisation is present on either or both src registers, the
+branch may stil go ahead if any only if *all* tests succeed (i.e. excluding
+those tests that are predicated out).
+
+Pseudo-code for branch:
+
+ s1 = reg_is_vectorised(src1);
+ s2 = reg_is_vectorised(src2);
+
+ if not s1 && not s2
+ if cmp(rs1, rs2) # scalar compare
+ goto branch
+ return
+
+ preg = int_pred_reg[rd]
+ reg = int_regfile
+
+ ps = get_pred_val(I/F==INT, rs1);
+ rd = get_pred_val(I/F==INT, rs2); # this may not exist
+
+ if not exists(rd) or zeroing:
+ result = 0
+ else
+ result = preg[rd]
+
+ for (int i = 0; i < VL; ++i)
+ if (zeroing)
+ if not (ps & (1<<i))
+ result &= ~(1<<i);
+ else if (ps & (1<<i))
+ if (cmp(s1 ? reg[src1+i]:reg[src1],
+ s2 ? reg[src2+i]:reg[src2])
+ result |= 1<<i;
+ else
+ result &= ~(1<<i);
+
+ if not exists(rd)
+ if result == ps
+ goto branch
+ else
+ preg[rd] = result # store in destination
+ if preg[rd] == ps
+ goto branch
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+ into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+ Reordering") setting Vector-Length times (number of SIMD elements) bits
+ in Predicate Register rd, as opposed to just Vector-Length bits.
+* The execution of "parallelised" instructions **must** be implemented
+ as "re-entrant" (to use a term from software). If an exception (trap)
+ occurs during the middle of a vectorised
+ Branch (now a SV predicated compare) operation, the partial results
+ of any comparisons must be written out to the destination
+ register before the trap is permitted to begin. If however there
+ is no predicate, the **entire** set of comparisons must be **restarted**,
+ with the offset loop indices set back to zero. This is because
+ there is no place to store the temporary result during the handling
+ of traps.
+
+Note also that where normally, predication requires that there must
+also be a CSR register entry for the register being used in order
+for the **predication** CSR register entry to also be active,
+for branches this is **not** the case. src2 does **not** have
+to have its CSR register entry marked as active in order for
+predication on src2 to be active.
+
+### Floating-point Comparisons
+
+There does not exist floating-point branch operations, only compare.
+Interestingly no change is needed to the instruction format because
+FP Compare already stores a 1 or a zero in its "rd" integer register
+target, i.e. it's not actually a Branch at all: it's a compare.
+
+As RV Scalar does not have "FNE", predication inversion must be used.
+Also: note that FP Compare may be predicated, using the destination
+integer register (rd) to determine the predicate. FP Compare is **not**
+a twin-predication operation, as, again, just as with SV Branches,
+there are three registers involved: FP src1, FP src2 and INT rd.
+
+Also: note that ffirst (fail first mode) applies directly to this operation.
+
+### Compressed Branch Instruction
+
+Compressed Branch instructions are, just like standard Branch instructions,
+reinterpreted to be vectorised and predicated based on the source register
+(rs1s) CSR entries. As however there is only the one source register,
+given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
+to store the results of the comparisions is taken from CSR predication
+table entries for **x0**.
+
+The specific required use of x0 is, with a little thought, quite obvious,
+but is counterintuitive. Clearly it is **not** recommended to redirect
+x0 with a CSR register entry, however as a means to opaquely obtain
+a predication target it is the only sensible option that does not involve
+additional special CSRs (or, worse, additional special opcodes).
+
+Note also that, just as with standard branches, the 2nd source
+(in this case x0 rather than src2) does **not** have to have its CSR
+register table marked as "active" in order for predication to work.
+
+## Vectorised Dual-operand instructions
+
+There is a series of 2-operand instructions involving copying (and
+sometimes alteration):
+
+* C.MV
+* FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
+* C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
+* LOAD(-FP) and STORE(-FP)
+
+All of these operations follow the same two-operand pattern, so it is
+*both* the source *and* destination predication masks that are taken into
+account. This is different from
+the three-operand arithmetic instructions, where the predication mask
+is taken from the *destination* register, and applied uniformly to the
+elements of the source register(s), element-for-element.
+
+The pseudo-code pattern for twin-predicated operations is as
+follows:
+
+ function op(rd, rs):
+ rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+ rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+ ps = get_pred_val(FALSE, rs); # predication on src
+ pd = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
+ if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
+ xSTATE.srcoffs = i # save context
+ xSTATE.destoffs = j # save context
+ reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
+ if (int_csr[rs].isvec) i++;
+ if (int_csr[rd].isvec) j++; else break
+
+This pattern covers scalar-scalar, scalar-vector, vector-scalar
+and vector-vector, and predicated variants of all of those.
+Zeroing is not presently included (TODO). As such, when compared
+to RVV, the twin-predicated variants of C.MV and FMV cover
+**all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
+VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
+
+### C.MV Instruction <a name="c_mv"></a>
+
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
+
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation. The actual instruction's format does not change.
+
+There are several different instructions from RVV that are covered by
+this one opcode:
+
+[[!table data="""
+src | dest | predication | op |
+scalar | vector | none | VSPLAT |
+scalar | vector | destination | sparse VSPLAT |
+scalar | vector | 1-bit dest | VINSERT |
+vector | scalar | 1-bit? src | VEXTRACT |
+vector | vector | none | VCOPY |
+vector | vector | src | Vector Gather |
+vector | vector | dest | Vector Scatter |
+vector | vector | src & dest | Gather/Scatter |
+vector | vector | src == dest | sparse VCOPY |
+"""]]
+
+Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
+operations with inversion on the src and dest predication for one of the
+two C.MV operations.
+
+### FMV, FNEG and FABS Instructions
+
+These are identical in form to C.MV, except covering floating-point
+register copying. The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
+
+(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
+
+### FVCT Instructions
+
+These are again identical in form to C.MV, except that they cover
+floating-point to integer and integer to floating-point. When element
+width in each vector is set to default, the instructions behave exactly
+as they are defined for standard RV (scalar) operations, except vectorised
+in exactly the same fashion as outlined in C.MV.
+
+However when the source or destination element width is not set to default,
+the opcode's explicit element widths are *over-ridden* to new definitions,
+and the opcode's element width is taken as indicative of the SIMD width
+(if applicable i.e. if packed SIMD is requested) instead.
+
+## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
+
+In vectorised architectures there are usually at least two different modes
+for LOAD/STORE:
+
+* Read (or write for STORE) from sequential locations, where one
+ register specifies the address, and the one address is incremented
+ by a fixed amount. This is usually known as "Unit Stride" mode.
+* Read (or write) from multiple indirected addresses, where the
+ vector elements each specify separate and distinct addresses.
+
+To support these different addressing modes, the CSR Register "isvector"
+bit is used. So, for a LOAD, when the src register is set to
+scalar, the LOADs are sequentially incremented by the src register
+element width, and when the src register is set to "vector", the
+elements are treated as indirection addresses. Simplified
+pseudo-code would look like this:
+
+ function op_ld(rd, rs) # LD not VLD!
+ rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
+ rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
+ ps = get_pred_val(FALSE, rs); # predication on src
+ pd = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
+ if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
+ if (int_csr[rd].isvec)
+ # indirect mode (multi mode)
+ srcbase = ireg[rsv+i];
+ else
+ # unit stride mode
+ srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
+ ireg[rdv+j] <= mem[srcbase + imm_offs];
+ if (!int_csr[rs].isvec &&
+ !int_csr[rd].isvec) break # scalar-scalar LD
+ if (int_csr[rs].isvec) i++;
+ if (int_csr[rd].isvec) j++;
+
+## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
+
+C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
+where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
+It is therefore possible to use predicated C.LWSP to efficiently
+pop registers off the stack (by predicating x2 as the source), cherry-picking
+which registers to store to (by predicating the destination). Likewise
+for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
+
+**Note**: it is still possible to redirect x2 to an alternative target
+register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
+general-purpose LOAD/STORE operations.
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
+where the same rules apply and the same pseudo-code apply as for
+non-compressed LOAD/STORE. Again: setting scalar or vector mode
+on the src for LOAD and dest for STORE switches mode from "Unit Stride"
+to "Multi-indirection", respectively.
+
+# Element bitwidth polymorphism <a name="elwidth"></a>
+
+Element bitwidth is best covered as its own special section, as it
+is quite involved and applies uniformly across-the-board. SV restricts
+bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
+
+The effect of setting an element bitwidth is to re-cast each entry
+in the register table, and for all memory operations involving
+load/stores of certain specific sizes, to a completely different width.
+Thus In c-style terms, on an RV64 architecture, effectively each register
+now looks like this:
+
+ typedef union {
+ uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
+ uint8_t b[0]; // array of type uint8_t
+ uint16_t s[0];
+ uint32_t i[0];
+ uint64_t l[0];
+ uint128_t d[0];
+ } reg_t;
+
+ reg_t int_regfile[128];
+
+where when accessing any individual regfile[n].b entry it is permitted
+(in c) to arbitrarily over-run the *declared* length of the array (zero),
+and thus "overspill" to consecutive register file entries in a fashion
+that is completely transparent to a greatly-simplified software / pseudo-code
+representation.
+It is however critical to note that it is clearly the responsibility of
+the implementor to ensure that, towards the end of the register file,
+an exception is thrown if attempts to access beyond the "real" register
+bytes is ever attempted.
+
+The pseudo-code is as follows, to demonstrate how the sign-extending
+and width-extending works:
+
+ typedef union {
+ uint8_t b;
+ uint16_t s;
+ uint32_t i;
+ uint64_t l;
+ } el_reg_t;
+
+ bw(elwidth):
+ if elwidth == 0:
+ return xlen
+ if elwidth == 1:
+ return xlen / 2
+ if elwidth == 2:
+ return xlen * 2
+ // elwidth == 3:
+ return 8
+
+ get_max_elwidth(rs1, rs2):
+ return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
+ bw(int_csr[rs2].elwidth)) # again XLEN if no entry
+
+ get_polymorphed_reg(reg, bitwidth, offset):
+ el_reg_t res;
+ res.l = 0; // TODO: going to need sign-extending / zero-extending
+ if bitwidth == 8:
+ reg.b = int_regfile[reg].b[offset]
+ elif bitwidth == 16:
+ reg.s = int_regfile[reg].s[offset]
+ elif bitwidth == 32:
+ reg.i = int_regfile[reg].i[offset]
+ elif bitwidth == 64:
+ reg.l = int_regfile[reg].l[offset]
+ return res
+
+ set_polymorphed_reg(reg, bitwidth, offset, val):
+ if (!int_csr[reg].isvec):
+ # sign/zero-extend depending on opcode requirements, from
+ # the reg's bitwidth out to the full bitwidth of the regfile
+ val = sign_or_zero_extend(val, bitwidth, xlen)
+ int_regfile[reg].l[0] = val
+ elif bitwidth == 8:
+ int_regfile[reg].b[offset] = val
+ elif bitwidth == 16:
+ int_regfile[reg].s[offset] = val
+ elif bitwidth == 32:
+ int_regfile[reg].i[offset] = val
+ elif bitwidth == 64:
+ int_regfile[reg].l[offset] = val
+
+ maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
+ destwid = int_csr[rs1].elwidth # destination element width
+ for (i = 0; i < VL; i++)
+ if (predval & 1<<i) # predication uses intregs
+ // TODO, calculate if over-run occurs, for each elwidth
+ src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
+ // TODO, sign/zero-extend src1 and src2 as operation requires
+ if (op_requires_sign_extend_src1)
+ src1 = sign_extend(src1, maxsrcwid)
+ src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
+ result = src1 + src2 # actual add here
+ // TODO, sign/zero-extend result, as operation requires
+ if (op_requires_sign_extend_dest)
+ result = sign_extend(result, maxsrcwid)
+ set_polymorphed_reg(rd, destwid, ird, result)
+ if (!int_vec[rd].isvector) break
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+
+## Polymorphic floating-point operation exceptions and error-handling
+
+For floating-point operations, conversion takes place without
+raising any kind of exception. Exactly as specified in the standard
+RV specification, NAN (or appropriate) is stored if the result
+is beyond the range of the destination, and, again, exactly as
+with the standard RV specification just as with scalar
+operations, the floating-point flag is raised (FCSR). And, again, just as
+with scalar operations, it is software's responsibility to check this flag.
+Given that the FCSR flags are "accrued", the fact that multiple element
+operations could have occurred is not a problem.
+
+Note that it is perfectly legitimate for floating-point bitwidths of
+only 8 to be specified. However whilst it is possible to apply IEEE 754
+principles, no actual standard yet exists. Implementors wishing to
+provide hardware-level 8-bit support rather than throw a trap to emulate
+in software should contact the author of this specification before
+proceeding.
+
+## Polymorphic shift operators
+
+A special note is needed for changing the element width of left and right
+shift operators, particularly right-shift.
+
+For SV, where each operand's element bitwidth may be over-ridden, the
+rule about determining the operation's bitwidth *still applies*, being
+defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
+**also applies to the truncation of RS2**. In other words, *after*
+determining the maximum bitwidth, RS2's range must **also be truncated**
+to ensure a correct answer. Example:
+
+* RS1 is over-ridden to a 16-bit width
+* RS2 is over-ridden to an 8-bit width
+* RD is over-ridden to a 64-bit width
+* the maximum bitwidth is thus determined to be 16-bit - max(8,16)
+* RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
+
+Pseudocode (in spike) for this example would therefore be:
+
+ WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
+
+## Polymorphic MULH/MULHU/MULHSU
+
+MULH is designed to take the top half MSBs of a multiply that
+does not fit within the range of the source operands, such that
+smaller width operations may produce a full double-width multiply
+in two cycles. The issue is: SV allows the source operands to
+have variable bitwidth.
+
+Here again special attention has to be paid to the rules regarding
+bitwidth, which, again, are that the operation is performed at
+the maximum bitwidth of the **source** registers. Therefore:
+
+* An 8-bit x 8-bit multiply will create a 16-bit result that must
+ be shifted down by 8 bits
+* A 16-bit x 8-bit multiply will create a 24-bit result that must
+ be shifted down by 16 bits (top 8 bits being zero)
+* A 16-bit x 16-bit multiply will create a 32-bit result that must
+ be shifted down by 16 bits
+* A 32-bit x 16-bit multiply will create a 48-bit result that must
+ be shifted down by 32 bits
+* A 32-bit x 8-bit multiply will create a 40-bit result that must
+ be shifted down by 32 bits
+
+So again, just as with shift-left and shift-right, the result
+is shifted down by the maximum of the two source register bitwidths.
+And, exactly again, truncation or sign-extension is performed on the
+result. If sign-extension is to be carried out, it is performed
+from the same maximum of the two source register bitwidths out
+to the result element's bitwidth.
+
+If truncation occurs, i.e. the top MSBs of the result are lost,
+this is "Officially Not Our Problem", i.e. it is assumed that the
+programmer actually desires the result to be truncated. i.e. if the
+programmer wanted all of the bits, they would have set the destination
+elwidth to accommodate them.
+
+## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
+
+Polymorphic element widths in vectorised form means that the data
+being loaded (or stored) across multiple registers needs to be treated
+(reinterpreted) as a contiguous stream of elwidth-wide items, where
+the source register's element width is **independent** from the destination's.
+
+This makes for a slightly more complex algorithm when using indirection
+on the "addressed" register (source for LOAD and destination for STORE),
+particularly given that the LOAD/STORE instruction provides important
+information about the width of the data to be reinterpreted.
+
+As LOAD/STORE may be twin-predicated, it is important to note that
+the rules on twin predication still apply. Where in previous
+pseudo-code (elwidth=default for both source and target) it was
+the *registers* that the predication was applied to, it is now the
+**elements** that the predication is applied to.
+
+The full pseudocode for all LD operations may be written out
+as follows:
+
+ function LBU(rd, rs):
+ load_elwidthed(rd, rs, 8, true)
+ function LB(rd, rs):
+ load_elwidthed(rd, rs, 8, false)
+ function LH(rd, rs):
+ load_elwidthed(rd, rs, 16, false)
+ ...
+ ...
+ function LQ(rd, rs):
+ load_elwidthed(rd, rs, 128, false)
+
+ # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
+ function load_memory(rs, imm, i, opwidth):
+ elwidth = int_csr[rs].elwidth
+ bitwidth = bw(elwidth);
+ elsperblock = min(1, opwidth / bitwidth)
+ srcbase = ireg[rs+i/(elsperblock)];
+ offs = i % elsperblock;
+ return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
+
+ function load_elwidthed(rd, rs, opwidth, unsigned):
+ destwid = int_csr[rd].elwidth # destination element width
+ rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+ rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+ ps = get_pred_val(FALSE, rs); # predication on src
+ pd = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
+ if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
+ val = load_memory(rs, imm, i, opwidth)
+ if unsigned:
+ val = zero_extend(val, min(opwidth, bitwidth))
+ else:
+ val = sign_extend(val, min(opwidth, bitwidth))
+ set_polymorphed_reg(rd, bitwidth, j, val)
+ if (int_csr[rs].isvec) i++;
+ if (int_csr[rd].isvec) j++; else break;
+
+# Predication Element Zeroing
+
+The introduction of zeroing on traditional vector predication is usually
+intended as an optimisation for lane-based microarchitectures with register
+renaming to be able to save power by avoiding a register read on elements
+that are passed through en-masse through the ALU. Simpler microarchitectures
+do not have this issue: they simply do not pass the element through to
+the ALU at all, and therefore do not store it back in the destination.
+More complex non-lane-based micro-architectures can, when zeroing is
+not set, use the predication bits to simply avoid sending element-based
+operations to the ALUs, entirely: thus, over the long term, potentially
+keeping all ALUs 100% occupied even when elements are predicated out.
+
+SimpleV's design principle is not based on or influenced by
+microarchitectural design factors: it is a hardware-level API.
+Therefore, looking purely at whether zeroing is *useful* or not,
+(whether less instructions are needed for certain scenarios),
+given that a case can be made for zeroing *and* non-zeroing, the
+decision was taken to add support for both.
+
+## Single-predication (based on destination register)
+
+Zeroing on predication for arithmetic operations is taken from
+the destination register's predicate. i.e. the predication *and*
+zeroing settings to be applied to the whole operation come from the
+CSR Predication table entry for the destination register.
+Thus when zeroing is set on predication of a destination element,
+if the predication bit is clear, then the destination element is *set*
+to zero (twin-predication is slightly different, and will be covered
+next).
+
+Thus the pseudo-code loop for a predicated arithmetic operation
+is modified to as follows:
+
+ for (i = 0; i < VL; i++)
+ if not zeroing: # an optimisation
+ while (!(predval & 1<<i) && i < VL)
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+ if i == VL:
+ return
+ if (predval & 1<<i)
+ src1 = ....
+ src2 = ...
+ else:
+ result = src1 + src2 # actual add (or other op) here
+ set_polymorphed_reg(rd, destwid, ird, result)
+ if int_vec[rd].ffirst and result == 0:
+ VL = i # result was zero, end loop early, return VL
+ return
+ if (!int_vec[rd].isvector) return
+ else if zeroing:
+ result = 0
+ set_polymorphed_reg(rd, destwid, ird, result)
+ if (int_vec[rd ].isvector) { id += 1; }
+ else if (predval & 1<<i) return
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
+ if (rd == VL or rs1 == VL or rs2 == VL): return
+
+The optimisation to skip elements entirely is only possible for certain
+micro-architectures when zeroing is not set. However for lane-based
+micro-architectures this optimisation may not be practical, as it
+implies that elements end up in different "lanes". Under these
+circumstances it is perfectly fine to simply have the lanes
+"inactive" for predicated elements, even though it results in
+less than 100% ALU utilisation.
+
+## Twin-predication (based on source and destination register)
+
+Twin-predication is not that much different, except that that
+the source is independently zero-predicated from the destination.
+This means that the source may be zero-predicated *or* the
+destination zero-predicated *or both*, or neither.
+
+When with twin-predication, zeroing is set on the source and not
+the destination, if a predicate bit is set it indicates that a zero
+data element is passed through the operation (the exception being:
+if the source data element is to be treated as an address - a LOAD -
+then the data returned *from* the LOAD is zero, rather than looking up an
+*address* of zero.
+
+When zeroing is set on the destination and not the source, then just
+as with single-predicated operations, a zero is stored into the destination
+element (or target memory address for a STORE).
+
+Zeroing on both source and destination effectively result in a bitwise
+NOR operation of the source and destination predicate: the result is that
+where either source predicate OR destination predicate is set to 0,
+a zero element will ultimately end up in the destination register.
+
+However: this may not necessarily be the case for all operations;
+implementors, particularly of custom instructions, clearly need to
+think through the implications in each and every case.
+
+Here is pseudo-code for a twin zero-predicated operation:
+
+ function op_mv(rd, rs) # MV not VMV!
+ rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
+ rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
+ ps, zerosrc = get_pred_val(FALSE, rs); # predication on src
+ pd, zerodst = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL):
+ if (int_csr[rs].isvec && !zerosrc) while (!(ps & 1<<i)) i++;
+ if (int_csr[rd].isvec && !zerodst) while (!(pd & 1<<j)) j++;
+ if ((pd & 1<<j))
+ if ((pd & 1<<j))
+ sourcedata = ireg[rs+i];
+ else
+ sourcedata = 0
+ ireg[rd+j] <= sourcedata
+ else if (zerodst)
+ ireg[rd+j] <= 0
+ if (int_csr[rs].isvec)
+ i++;
+ if (int_csr[rd].isvec)
+ j++;
+ else
+ if ((pd & 1<<j))
+ break;
+
+Note that in the instance where the destination is a scalar, the hardware
+loop is ended the moment a value *or a zero* is placed into the destination
+register/element. Also note that, for clarity, variable element widths
+have been left out of the above.
+
+# Exceptions
+
+TODO: expand.
+
+# Hints
+
+With Simple-V being capable of issuing *parallel* instructions where
+rd=x0, the space for possible HINTs is expanded considerably. VL
+could be used to indicate different hints. In addition, if predication
+is set, the predication register itself could hypothetically be passed
+in as a *parameter* to the HINT operation.
+
+No specific hints are yet defined in Simple-V
+
+# Vector Block Format <a name="vliw-format"></a>
+
+See ancillary resource: [[vblock_format]]
+
+# Subsets of RV functionality
+
+It is permitted to only implement SVprefix and not the VBLOCK instruction
+format option, and vice-versa. UNIX Platforms **MUST** raise illegal
+instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that
+traps may emulate the format.
+
+It is permitted in SVprefix to either not implement VL or not implement
+SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms
+*MUST* raise illegal instruction on implementations that do not support
+VL or SUBVL.
+
+It is permitted to limit the size of either (or both) the register files
+down to the original size of the standard RV architecture. However, below
+the mandatory limits set in the RV standard will result in non-compliance
+with the SV Specification.
+
* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
* Status: DRAFTv0.6
* Last edited: 21 jun 2019
-* Ancillary resource: [[opcodes]] [[sv_prefix_proposal]]
+* Ancillary resource: [[opcodes]]
+* Ancillary resource: [[sv_prefix_proposal]]
+* Ancillary resource: [[abridged_spec]]
+* Ancillary resource: [[vblock_format]]
With thanks to:
# Vector Block Format <a name="vliw-format"></a>
-One issue with a former revision of SV was the setup and teardown
-time of the CSRs. The cost of the use of a full CSRRW (requiring LI)
-to set up registers and predicates was quite high. A VLIW-like format
-therefore makes sense (named VBLOCK), and is conceptually reminiscent of
-the ARM Thumb2 "IT" instruction.
-
-The format is:
-
-* the standard RISC-V 80 to 192 bit encoding sequence, with bits
- defining the options to follow within the block
-* An optional VL Block (16-bit)
-* Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
-* Optional register entries (8/16-bit blocks: see Register Table, above)
-* finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.
-
-Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used
-as follows:
-
-| base+4 ... base+2 | base | number of bits |
-| ------ ----------------- | ---------------- | -------------------------- |
-| ..xxxx xxxxxxxxxxxxxxxx | xnnnxxxxx1111111 | (80+16\*nnn)-bit, nnn!=111 |
-| {ops}{Pred}{Reg}{VL Block} | SV Prefix | |
-
-A suitable prefix, which fits the Expanded Instruction-Length encoding
-for "(80 + 16 times instruction-length)", as defined in Section 1.5
-of the RISC-V ISA, is as follows:
-
-| 15 | 14:12 | 11:10 | 9:8 | 7 | 6:0 |
-| - | ----- | ----- | ----- | --- | ------- |
-| vlset | 16xil | pplen | rplen | mode | 1111111 |
-
-The VL/MAXVL/SubVL Block format:
-
-| 31-30 | 29:28 | 27:22 | 21:17 - 16 |
-| - | ----- | ------ | ------ - - |
-| 0 | SubVL | VLdest | VLEN vlt |
-| 1 | SubVL | VLdest | VLEN |
-
-Note: this format is very similar to that used in [[sv_prefix_proposal]]
-
-If vlt is 0, VLEN is a 5 bit immediate value, offset by one (i.e
-a bit sequence of 0b00000 represents VL=1 and so on). If vlt is 1,
-it specifies the scalar register from which VL is set by this VBLOCK
-instruction group. VL, whether set from the register or the immediate,
-is then modified (truncated) to be MIN(VL, MAXVL), and the result stored
-in the scalar register specified in VLdest. If VLdest is zero, no store
-in the regfile occurs (however VL is still set).
-
-This option will typically be used to start vectorised loops, where
-the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL"
-sequence (in compact form).
-
-When bit 15 is set to 1, MAXVL and VL are both set to the immediate,
-VLEN (again, offset by one), which is 6 bits in length, and the same
-value stored in scalar register VLdest (if that register is nonzero).
-A value of 0b000000 will set MAXVL=VL=1, a value of 0b000001 will
-set MAXVL=VL= 2 and so on.
-
-This option will typically not be used so much for loops as it will be
-for one-off instructions such as saving the entire register file to the
-stack with a single one-off Vectorised and predicated LD/ST, or as a way
-to save or restore registers in a function call with a single instruction.
-
-CSRs needed:
-
-* mepcvliw
-* sepcvliw
-* uepcvliw
-* hepcvliw
-
-Notes:
-
-* Bit 7 specifies if the prefix block format is the full 16 bit format
- (1) or the compact less expressive format (0). In the 8 bit format,
- pplen is multiplied by 2.
-* 8 bit format predicate numbering is implicit and begins from x9. Thus
- it is critical to put blocks in the correct order as required.
-* Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit
- (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number
- of entries are needed the last may be set to 0x00, indicating "unused".
-* Bit 15 specifies if the VL Block is present. If set to 1, the VL Block
- immediately follows the VBLOCK instruction Prefix
-* Bits 8 and 9 define how many RegCam entries (0 to 3 if bit 15 is 1,
- otherwise 0 to 6) follow the (optional) VL Block.
-* Bits 10 and 11 define how many PredCam entries (0 to 3 if bit 7 is 1,
- otherwise 0 to 6) follow the (optional) RegCam entries
-* Bits 14 to 12 (IL) define the actual length of the instruction: total
- number of bits is 80 + 16 times IL. Standard RV32, RVC and also
- SVPrefix (P48/64-\*-Type) instructions fit into this space, after the
- (optional) VL / RegCam / PredCam entries
-* In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed
- format *MUST* have the RegCam and PredCam entries applied to the
- operation (and the Vectorisation loop activated)
-* P48 and P64 opcodes do **not** take their Register or predication
- context from the VBLOCK tables: they do however have VL or SUBVL
- applied (unless VLtyp or svlen are set).
-* At the end of the VBLOCK Group, the RegCam and PredCam entries
- *no longer apply*. VL, MAXVL and SUBVL on the other hand remain at
- the values set by the last instruction (whether a CSRRW or the VL
- Block header).
-* Although an inefficient use of resources, it is fine to set the MAXVL,
- VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.
-
-All this would greatly reduce the amount of space utilised by Vectorised
-instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes:
-the CSR itself, a LI, and the setting up of the value into the RS
-register of the CSR, which, again, requires a LI / LUI to get the 32
-bit data into the CSR. To get 64-bit data into the register in order
-to put it into the CSR(s), LOAD operations from memory are needed!
-
-Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM
-entries), that's potentially 6 to eight 32-bit instructions, just to
-establish the Vector State!
-
-Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more
-bits if VL needs to be set to greater than 32). Bear in mind that in SV,
-both MAXVL and VL need to be set.
-
-By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is
-only 16 bits, and as long as not too many predicates and register vector
-qualifiers are specified, several 32-bit and 16-bit opcodes can fit into
-the format. If the full flexibility of the 16 bit block formats are not
-needed, more space is saved by using the 8 bit formats.
-
-In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries
-into a VBLOCK format makes a lot of sense.
-
-Bear in mind the warning in an earlier section that use of VLtyp or svlen
-in a P48 or P64 opcode within a VBLOCK Group will result in corruption
-(use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To
-avoid this situation, the STATE CSR may be copied into a temp register
-and restored afterwards.
-
-Open Questions:
-
-* Is it necessary to stick to the RISC-V 1.5 format? Why not go with
- using the 15th bit to allow 80 + 16\*0bnnnn bits? Perhaps to be sane,
- limit to 256 bits (16 times 0-11).
-* Could a "hint" be used to set which operations are parallel and which
- are sequential?
-* Could a new sub-instruction opcode format be used, one that does not
- conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes?
- no need for byte or bit-alignment
-* Could a hardware compression algorithm be deployed? Quite likely,
- because of the sub-execution context (sub-VBLOCK PC)
-
-## Limitations on instructions.
-
-To greatly simplify implementations, it is required to treat the VBLOCK
-group as a separate sub-program with its own separate PC. The sub-pc
-advances separately whilst the main PC remains pointing at the beginning
-of the VBLOCK instruction (not to be confused with how VL works, which
-is exactly the same principle, except it is VStart in the STATE CSR
-that increments).
-
-This has implications, namely that a new set of CSRs identical to xepc
-(mepc, srpc, hepc and uepc) must be created and managed and respected
-as being a sub extension of the xepc set of CSRs. Thus, xepcvliw CSRs
-must be context switched and saved / restored in traps.
-
-The srcoffs and destoffs indices in the STATE CSR may be similarly
-regarded as another sub-execution context, giving in effect two sets of
-nested sub-levels of the RISCV Program Counter (actually, three including
-SUBVL and ssvoffs).
-
-In addition, as xepcvliw CSRs are relative to the beginning of the VBLOCK,
-branches MUST be restricted to within (relative to) the block,
-i.e. addressing is now restricted to the start (and very short) length
-of the block.
-
-Also: calling subroutines is inadviseable, unless they can be entirely
-accomplished within a block.
-
-A normal jump, normal branch and a normal function call may only be taken
-by letting the VBLOCK group end, returning to "normal" standard RV mode,
-and then using standard RVC, 32 bit or P48/64-\*-type opcodes.
-
-## Links
-
-* <https://groups.google.com/d/msg/comp.arch/yIFmee-Cx-c/jRcf0evSAAAJ>
+See ancillary resource: [[vblock_format]]
# Subsets of RV functionality