# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
-* TODO 23may2018: CSR-CAM-ify regfile tables
-* TODO 23may2018: zero-mark predication CSR
-* TODO 28may2018: sort out VSETVL: CSR length to be removed?
-* TODO 09jun2018: Chennai Presentation more up-to-date
-* TODO 09jun2019: elwidth only 4 values (dflt, dflt/2, 8, 16)
-* TODO 09jun2019: extra register banks (future option)
-* TODO 09jun2019: new Reg CSR table (incl. packed=Y/N)
-
-
Key insight: Simple-V is intended as an abstraction layer to provide
a consistent "API" to parallelisation of existing *and future* operations.
*Actual* internal hardware-level parallelism is *not* required, such
*Actual* parallelism, if added independently of Simple-V in the form
of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
[[!toc ]]
* To mark integer and floating-point registers as requiring "redirection"
if it is ever used as a source or destination in any given operation.
+ This involves a level of indirection through a 5-to-6-bit lookup table
+ (where the 6th bit - bank - is always set to 0 for now).
* To indicate whether, after redirection through the lookup table, the
register is a vector (or remains a scalar).
* To over-ride the implicit or explicit bitwidth that the operation would
tb[idx].packed = CSRvec[i].packed // SIMD or not
tb[idx].bank = CSRvec[i].bank // 0 (1=rsvd)
-Note that when using the "vsetl rs1, rs2, vlen" instruction, it becomes:
-
- VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
TODO: move elsewhere
# TODO: use elsewhere (retire for now)
# Instructions
-Despite being a topological remap of RVV concepts, the only instructions
-needed are VSETVL and VGETVL. *All* RVV instructions can be re-mapped,
-however xBitManip becomes a critical dependency for efficient manipulation
-of predication masks (as a bit-field).
-Despite this, *all instructions from RVV are topologically re-mapped and retain
-their complete functionality, intact*.
+Despite being a 98% complete and accurate topological remap of RVV
+concepts and functionality, the only instructions needed are VSETVL
+and VGETVL. *All* RVV instructions can be re-mapped, however xBitManip
+becomes a critical dependency for efficient manipulation of predication
+masks (as a bit-field). Despite the removal of all but VSETVL and VGETVL,
+*all instructions from RVV are topologically re-mapped and retain their
+complete functionality, intact*.
Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
equivalents, so are left out of Simple-V. VSELECT could be included if
there existed a MV.X instruction in RV (MV.X is a hypothetical
non-immediate variant of MV that would allow another register to
-specify which register was to be copied).
+specify which register was to be copied). Note that if any of these three
+instructions are added to any given RV extension, their functionality
+will be inherently parallelised.
## Instruction Format
compare operations, *any* arithmetic, floating point or *any*
memory instructions.
Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on CSR configurations for vector length, bitwidth and
-predication. *This includes Compressed instructions* as well as any
+variants, and implicitly overloads arithmetic operations, MV,
+FCVT, and LOAD/STORE
+depending on CSR configurations for bitwidth and
+predication. **Everything** becomes parallelised. *This includes
+Compressed instructions* as well as any
future instructions and Custom Extensions.
* For analysis of RVV see [[v_comparative_analysis]] which begins to
down to the fact that predication bits fit into a single register of length
XLEN bits.
-The second minor change is that when VSETVL is requested to be stored
-into x0, it is *ignored* silently.
+The second change is that when VSETVL is requested to be stored
+into x0, it is *ignored* silently (VSETVL x0, x5, #4)
-Unlike RVV, implementors *must* provide pseudo-parallelism (using sequential
-loops in hardware) if actual hardware-parallelism in the ALUs is not deployed.
-A hybrid is also permitted (as used in Broadcom's VideoCore-IV) however this
-must be *entirely* transparent to the ISA.
+The third change is that there is an additional immediate added to VSETVL,
+to which VL is set after first going through MIN-filtering.
+So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
-### Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+ VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-So the issue is as follows:
+where RegfileLen <= MAXVECTORDEPTH < XLEN
-* CSRs are used to set the "span" of a vector (how many of the standard
- register file to contiguously use)
-* VSETVL in RVV works as follows: it sets the vector length (copy of which
- is placed in a dest register), and if the "required" length is longer
- than the *available* length, the dest reg is set to the MIN of those
- two.
-* **HOWEVER**... in SV, *EVERY* vector register has its own separate
- length and thus there is no way (at the time that VSETVL is called) to
- know what to set the vector length *to*.
-* At first glance it seems that it would be perfectly fine to just limit
- the vector operation to the length specified in the destination
- register's CSR, at the time that each instruction is issued...
- except that that cannot possibly be guaranteed to match
- with the value *already loaded into the target register from VSETVL*.
+This has implication for the microarchitecture, as VL is required to be
+set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
+requested in the #immediate parameter. RVV has the option to set VL
+to an arbitrary value that suits the conditions and the micro-architecture:
+SV does *not* permit that.
-Therefore a different approach is needed.
+The reason is so that if SV is to be used for a context-switch or as a
+substitute for LOAD/STORE-Multiple, the operation can be done with only
+2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
+single LD/ST operation). If VL does *not* get set to the register file
+length when VSETVL is called, then a software-loop would be needed.
+To avoid this need, VL *must* be set to exactly what is requested
+(limits notwithstanding).
-Possible options include:
-
-* Removing the CSR "Vector Length" and always using the value from
- VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
- destreg equal to MIN(counterreg, lenimmed), with register-based
- variant "VSETVL destreg, counterreg, lenreg" doing the same.
-* Keeping the CSR "Vector Length" and having the lenreg version have
- a "twist": "if lengreg is vectorised, read the length from the CSR"
-* Other (TBD)
-
-The first option (of the ones brainstormed so far) is a lot simpler.
-It does however mean that the length set in VSETVL will apply across-the-board
-to all src1, src2 and dest vectorised registers until it is otherwise changed
-(by another VSETVL call). This is probably desirable behaviour.
+Therefore, in turn, unlike RVV, implementors *must* provide
+pseudo-parallelism (using sequential loops in hardware) if actual
+hardware-parallelism in the ALUs is not deployed. A hybrid is also
+permitted (as used in Broadcom's VideoCore-IV) however this must be
+*entirely* transparent to the ISA.
## Branch Instruction:
-Branch operations use standard RV opcodes that are reinterpreted to be
-"predicate variants" in the instance where either of the two src registers
-have their corresponding CSRvectorlen[src] entry as non-zero. When this
-reinterpretation is enabled the predicate target register rs3 is to be
-treated as a bitfield (up to a maximum of XLEN bits corresponding to a
-maximum of XLEN elements).
+Branch operations use standard RV opcodes that are reinterpreted to
+be "predicate variants" in the instance where either of the two src
+registers are marked as vectors (isvector=1). When this reinterpretation
+is enabled the "immediate" field of the branch operation is taken to be a
+predication target register, rs3. The predicate target register rs3 is
+to be treated as a bitfield (up to a maximum of XLEN bits corresponding
+to a maximum of XLEN elements).
If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
goes ahead as vector-scalar or scalar-vector. Implementors should note that
and whilst in ordinary branch code this is fine because the standard
RVF compare can always be followed up with an integer BEQ or a BNE (or
a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact as an explicit (scalar) instruction is needed
-to invert the predicate bitmask. An additional encoding funct3=011 is
-therefore proposed to cater for this.
+becomes more of an impact. To deal with this, SV's predication has
+had "invert" added to it.
[[!table data="""
31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
5 | 2 | 5 | 5 | 3 | 4 | 7 |
10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
-10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | FNE |
+10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | rsvd |
10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
"""]]
complex), this becomes:
if I/F == INT: # integer type cmp
- pred_enabled = int_pred_enabled # TODO: exception if not set!
preg = int_pred_reg[rd]
reg = int_regfile
else:
- pred_enabled = fp_pred_enabled # TODO: exception if not set!
preg = fp_pred_reg[rd]
reg = fp_regfile
- s1 = CSRvectorlen[src1] > 1;
- s2 = CSRvectorlen[src2] > 1;
- for (int i=0; i<vl; ++i)
- preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
- s2 ? reg[src2+i] : reg[src2]);
+ s1 = reg_is_vectorised(src1);
+ s2 = reg_is_vectorised(src2);
+ if (!s2 && !s1) goto branch;
+ for (int i = 0; i < VL; ++i)
+ if (cmp(s1 ? reg[src1+i]:reg[src1],
+ s2 ? reg[src2+i]:reg[src2])
+ preg[rs3] |= 1<<i; # bitfield not vector
Notes:
Counter, so all of bits 25 through 30 in every case are not needed.
* There are plenty of reserved opcodes for which bits 25 through 30 could
be put to good use if there is a suitable use-case.
-* FEQ and FNE (and BEQ and BNE) are included in order to save one
- instruction having to invert the resultant predicate bitfield.
FLT and FLE may be inverted to FGT and FGE if needed by swapping
src1 and src2 (likewise the integer counterparts).
## Compressed Branch Instruction:
+Compressed Branch instructions are likewise re-interpreted as predicated
+2-register operations, with the result going into rs3. All the bits of
+the immediate are re-interpreted for different purposes, to extend the
+number of comparator operations to beyond the original specification,
+but also to cater for floating-point comparisons as well as integer ones.
+
[[!table data="""
15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
funct3 | imm | rs10 | imm | | op | |
if (unit-strided) stride = elsize;
else stride = areg[as2]; // constant-strided
- pred_enabled = int_pred_enabled
preg = int_pred_reg[rd]
for (int i=0; i<vl; ++i)
- if (preg_enabled[rd] && [!]preg[i])
+ if ([!]preg[rd] & 1<<i)
for (int j=0; j<seglen+1; j++)
{
if CSRvectorised[rs2])
- offs = vreg[rs2][i]
+ offs = vreg[rs2+i]
else
offs = i*(seglen+1)*stride;
vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
in advance, accordingly: other strategies are explored in the Appendix
Section "Virtual Memory Page Faults".
+## Vectorised Copy/Move (and conversion) instructions
+
+There is a series of 2-operand instructions involving copying (and
+alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ. These operations all
+follow the same pattern, as it is *both* the source *and* destination
+predication masks that are taken into account. This is different from
+the three-operand arithmetic instructions, where the predication mask
+is taken from the *destination* register, and applied uniformly to the
+elements of the source register(s), element-for-element.
+
+### C.MV Instruction <a name="c_mv"></a>
+
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
+
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation. The actual instruction's format does not change:
+
+[[!table data="""
+15 12 | 11 7 | 6 2 | 1 0 |
+funct4 | rd | rs | op |
+4 | 5 | 5 | 2 |
+C.MV | dest | src | C0 |
+"""]]
+
+A simplified version of the pseudocode for this operation is as follows:
+
+ function op_mv(rd, rs) # MV not VMV!
+ rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
+ rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
+ ps = get_pred_val(FALSE, rs); # predication on src
+ pd = get_pred_val(FALSE, rd); # ... AND on dest
+ for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
+ if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
+ ireg[rd+j] <= ireg[rs+i];
+ if (int_vec[rs].isvec) i++;
+ if (int_vec[rd].isvec) j++;
+
+Note that:
+
+* elwidth (SIMD) is not covered above
+* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
+ not covered
+
+There are several different instructions from RVV that are covered by
+this one opcode:
+
+[[!table data="""
+src | dest | predication | op |
+scalar | vector | none | VSPLAT |
+scalar | vector | destination | sparse VSPLAT |
+scalar | vector | 1-bit dest | VINSERT |
+vector | scalar | 1-bit? src | VEXTRACT |
+vector | vector | none | VCOPY |
+vector | vector | src | Vector Gather |
+vector | vector | dest | Vector Scatter |
+vector | vector | src & dest | Gather/Scatter |
+vector | vector | src == dest | sparse VCOPY |
+"""]]
+
+Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
+operations with inversion on the src and dest predication for one of the
+two C.MV operations.
+
+Note that in the instance where the Compressed Extension is not implemented,
+MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
+Note that the behaviour is **different** from C.MV because with addi the
+predication mask to use is taken **only** from rd and is applied against
+all elements: rs[i] = rd[i].
+
+### FMV, FNEG and FABS Instructions
+
+These are identical in form to C.MV, except covering floating-point
+register copying. The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
+
+(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
+
+### FVCT Instructions
+
+These are again identical in form to C.MV, except that they cover
+floating-point to integer and integer to floating-point. When element
+width in each vector is set to default, the instructions behave exactly
+as they are defined for standard RV (scalar) operations, except vectorised
+in exactly the same fashion as outlined in C.MV.
+
+However when the source or destination element width is not set to default,
+the opcode's explicit element widths are *over-ridden* to new definitions,
+and the opcode's element width is taken as indicative of the SIMD width
+(if applicable i.e. if packed SIMD is requested) instead.
+
+For example FCVT.S.L would normally be used to convert a 64-bit
+integer in register rs1 to a 64-bit floating-point number in rd.
+If however the source rs1 is set to be a vector, where elwidth is set to
+default/2 and "packed SIMD" is enabled, then the first 32 bits of
+rs1 are converted to a floating-point number to be stored in rd's
+first element and the higher 32-bits *also* converted to floating-point
+and stored in the second. The 32 bit size comes from the fact that
+FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
+divide that by two it means that rs1 element width is to be taken as 32.
+
+Similar rules apply to the destination register.
+
# Exceptions
> What does an ADD of two different-sized vectors do in simple-V?
* Throw an exception. Whether that actually results in spawning threads
as part of the trap-handling remains to be seen.
+# Under consideration <a name="issues"></a>
+
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+ (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+ (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+ (a bit like misaligned addressing... for registers)
+ or just use predication to skip start?
+
+## Future (extra) bank be included (made mandatory)
+
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change. Also it has implications for context-switching.
+
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV. SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
+
+## How large should the Register and Predication CSR key-value stores be?
+
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed. At the time of writing
+(12jul2018) that is too early to tell. An approximate best-guess
+however would be 16 entries.
+
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
+
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
+
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point. However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
+
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free. However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
+
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes. This would be considered quite
+rare but not outside of the realm of possibility.
+
+Conclusion: all needs careful analysis and future work.
+
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear. It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero". The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
+
+## Can CLIP be done as a CSR (mode, like elwidth)
+
+RVV appears to be going this way. At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
+
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis. Other bits may be used for other purposes
+(see SIMD saturation below)
+
+## SIMD saturation (etc.) also set as a mode?
+
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
+
+## Include src1/src2 predication on Comparison Ops?
+
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
+
+The natural question therefore to ask is: where else could this
+flexibility be deployed? What about comparison operations?
+
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+ regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
+
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
+
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
+
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
+
+It is quite complex, in other words, and needs careful consideration.
+
+## 8/16-bit ops is it worthwhile adding a "start offset"?
+
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
+
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
+
+ r3[0] = add r4[0], r6[0]
+ r3[1] = add r4[1], r6[1]
+ r3[2] = add r4[2], r6[2]
+ r3[3] = add r4[3], r6[3]
+
+an offset of 1 would result in four operations as follows, instead:
+
+ r3[0] = add r4[1], r6[0]
+ r3[1] = add r4[2], r6[1]
+ r3[2] = add r4[3], r6[2]
+ r3[3] = add r5[0], r6[3]
+
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in. So this
+leaves just the packed-SIMD case to consider.
+
+Two ways in which this could be implemented / emulated (without special
+hardware):
+
+* bit-manipulation that shuffles the data along by one byte (or one word)
+ either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+ penalties for doing so.
+
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset. On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
+
# Impementing V on top of Simple-V
With Simple-V converting the original RVV draft concept-for-concept
<https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/g3feFnAoKIM>
+## Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**. This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+ register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+ is placed in a dest register), and if the "required" length is longer
+ than the *available* length, the dest reg is set to the MIN of those
+ two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+ length and thus there is no way (at the time that VSETVL is called) to
+ know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+ the vector operation to the length specified in the destination
+ register's CSR, at the time that each instruction is issued...
+ except that that cannot possibly be guaranteed to match
+ with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+ VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+ destreg equal to MIN(counterreg, lenimmed), with register-based
+ variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+ a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call). This is probably desirable behaviour.
+
## Implementation Paradigms <a name="implementation_paradigms"></a>
TODO: assess various implementation paradigms. These are listed roughly