+implementation efforts, without "extra baggage".
+
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism. Options are covered in the Appendix.
+
+# CSRs <a name="csrs"></a>
+
+There are a number of CSRs needed, which are used at the instruction
+decode phase to re-interpret RV opcodes (a practice that has
+precedent in the setting of MISA to enable / disable extensions).
+
+* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+ marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+ of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+ "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+ as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+ state information.
+* TODO: assess whether the same technique could be applied to the other
+ Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+ V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+ needed for context-switches (empty slots need never be stored).
+
+## Predication CSR
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated. The first entry is whether predication
+is enabled. The second entry is whether the register index refers to a
+floating-point or an integer register. The third entry is the index
+of that register which is to be predicated (if referred to). The fourth entry
+is the integer register that is treated as a bitfield, indexable by the
+vector element index.
+
+| RegNo | 6 | 5 | (4..0) | (4..0) |
+| ----- | - | - | ------- | ------- |
+| r0 | pren0 | i/f | regidx | predidx |
+| r1 | pren1 | i/f | regidx | predidx |
+| .. | pren.. | i/f | regidx | predidx |
+| r15 | pren15 | i/f | regidx | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+ fp_pred_enabled[32];
+ int_pred_enabled[32];
+ for (i = 0; i < 16; i++)
+ if CSRpred[i].pren:
+ idx = CSRpred[i].regidx
+ predidx = CSRpred[i].predidx
+ if CSRpred[i].type == 0: # integer
+ int_pred_enabled[idx] = 1
+ int_pred_reg[idx] = predidx
+ else:
+ fp_pred_enabled[idx] = 1
+ fp_pred_reg[idx] = predidx
+
+So when an operation is to be predicated, it is the internal state that
+is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ (d ? vreg[rd][i] : sreg[rd]) =
+ iop(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+This instead becomes an *indirect* reference using the *internal* state
+table generated from the Predication CSR key-value store:
+
+ if type(iop) == INT:
+ pred_enabled = int_pred_enabled
+ preg = int_pred_reg[rd]
+ else:
+ pred_enabled = fp_pred_enabled
+ preg = fp_pred_reg[rd]
+
+ for (int i=0; i<vl; ++i)
+ if (preg_enabled[rd] && [!]preg[i])
+ (d ? vreg[rd][i] : sreg[rd]) =
+ iop(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+## MAXVECTORDEPTH
+
+MAXVECTORDEPTH is the same concept as MVL in RVV. However in Simple-V,
+given that its primary (base, unextended) purpose is for 3D, Video and
+other purposes (not requiring supercomputing capability), it makes sense
+to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
+and so on).
+
+The reason for setting this limit is so that predication registers, when
+marked as such, may fit into a single register as opposed to fanning out
+over several registers. This keeps the implementation a little simpler.
+Note that RVV on top of Simple-V may choose to over-ride this decision.
+
+## Vector-length CSRs
+
+Vector lengths are interpreted as meaning "any instruction referring to
+r(N) generates implicit identical instructions referring to registers
+r(N+M-1) where M is the Vector Length". Vector Lengths may be set to
+use up to 16 registers in the register file.
+
+One separate CSR table is needed for each of the integer and floating-point
+register files:
+
+| RegNo | (3..0) |
+| ----- | ------ |
+| r0 | vlen0 |
+| r1 | vlen1 |
+| .. | vlen.. |
+| r31 | vlen31 |
+
+An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
+whether a register was, if referred to in any standard instructions,
+implicitly to be treated as a vector.
+
+Note:
+
+* A vector length of 1 indicates that it is to be treated as a scalar.
+ Bitwidths (on the same register) are interpreted and meaningful.
+* A vector length of 0 indicates that the parallelism is to be switched
+ off for this register (treated as a scalar). When length is 0,
+ the bitwidth CSR for the register is *ignored*.
+
+Internally, implementations may choose to use the non-zero vector length
+to set a bit-field per register, to be used in the instruction decode phase.
+In this way any standard (current or future) operation involving
+register operands may detect if the operation is to be vector-vector,
+vector-scalar or scalar-scalar (standard) simply through a single
+bit test.
+
+Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
+bitwidth is specifically not set) it becomes:
+
+ CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+
+This is in contrast to RVV:
+
+ CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+
+## Element (SIMD) bitwidth CSRs
+
+Element bitwidths may be specified with a per-register CSR, and indicate
+how a register (integer or floating-point) is to be subdivided.
+
+| RegNo | (2..0) |
+| ----- | ------ |
+| r0 | vew0 |
+| r1 | vew1 |
+| .. | vew.. |
+| r31 | vew31 |
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default |
+| 001 | 8 |
+| 010 | 16 |
+| 011 | 32 |
+| 100 | 64 |
+| 101 | 128 |
+| 110 | rsvd |
+| 111 | rsvd |
+
+Extending this table (with extra bits) is covered in the section
+"Implementing RVV on top of Simple-V".
+
+Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
+into account, it becomes:
+
+ vew = CSRbitwidth[rs1]
+ if (vew == 0)
+ bytesperreg = (XLEN/8) # or FLEN as appropriate
+ else:
+ bytesperreg = bytestable[vew] # 1 2 4 8 16
+ simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+ vlen = CSRvectorlen[rs1] * simdmult
+ CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+The reason for multiplying the vector length by the number of SIMD elements
+(in each individual register) is so that each SIMD element may optionally be
+predicated.
+
+An example of how to subdivide the register file when bitwidth != default
+is given in the section "Bitwidth Virtual Register Reordering".
+
+# Instructions
+
+By being a topological remap of RVV concepts, the following RVV instructions
+remain exactly the same: VMPOP, VMFIRST, VEXTRACT, VINSERT, VMERGE, VSELECT,
+VSLIDE, VCLASS and VPOPC. Two instructions, VCLIP and VCLIPI, do not
+have RV Standard equivalents, so are left out of Simple-V.
+All other instructions from RVV are topologically re-mapped and retain
+their complete functionality, intact.
+
+## Instruction Format
+
+The instruction format for Simple-V does not actually have *any* explicit
+compare operations, *any* arithmetic, floating point or *any*
+memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations and LOAD/STORE
+depending on CSR configurations for vector length, bitwidth and
+predication. *This includes Compressed instructions* as well as any
+future instructions and Custom Extensions.
+
+* For analysis of RVV see [[v_comparative_analysis]] which begins to
+ outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+ for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead. Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit (when combined with C)
+and having the benefit of being explicit.*
+
+## Branch Instruction:
+
+Branch operations use standard RV opcodes that are reinterpreted to be
+"predicate variants" in the instance where either of the two src registers
+have their corresponding CSRvectorlen[src] entry as non-zero. When this
+reinterpretation is enabled the predicate target register rs3 is to be
+treated as a bitfield (up to a maximum of XLEN bits corresponding to a
+maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector. Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
+This is the overloaded table for Integer-base Branch operations. Opcode
+(bits 6..0) is set in all cases to 1100011.
+
+[[!table data="""
+31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+7 | 5 | 5 | 3 | 4 | 1 | 7 |
+reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
+reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
+reserved | src2 | src1 | 001 | predicate rs3 || BNE |
+reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+reserved | src2 | src1 | 100 | predicate rs3 || BLE |
+reserved | src2 | src1 | 101 | predicate rs3 || BGE |
+reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
+reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
+"""]]
+
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
+Below is the overloaded table for Floating-point Predication operations.
+Interestingly no change is needed to the instruction format because
+FP Compare already stores a 1 or a zero in its "rd" integer register
+target, i.e. it's not actually a Branch at all: it's a compare.
+The target needs to simply change to be a predication bitfield (done
+implicitly).
+
+As with
+Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
+Likewise Single-precision, fmt bits 26..25) is still set to 00.
+Double-precision is still set to 01, whilst Quad-precision
+appears not to have a definition in V2.3-Draft (but should be unaffected).
+
+It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
+and whilst in ordinary branch code this is fine because the standard
+RVF compare can always be followed up with an integer BEQ or a BNE (or
+a compressed comparison to zero or non-zero), in predication terms that
+becomes more of an impact as an explicit (scalar) instruction is needed
+to invert the predicate bitmask. An additional encoding funct3=011 is
+therefore proposed to cater for this.
+
+[[!table data="""
+31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
+funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
+5 | 2 | 5 | 5 | 3 | 4 | 7 |
+10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
+10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | FNE |
+10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
+10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
+"""]]
+
+Note (**TBD**): floating-point exceptions will need to be extended
+to cater for multiple exceptions (and statuses of the same). The
+usual approach is to have an array of status codes and bit-fields,
+and one exception, rather than throw separate exceptions for each
+Vector element.
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]);
+
+With associated predication, vector-length adjustments and so on,
+and temporarily ignoring bitwidth (which makes the comparisons more
+complex), this becomes:
+
+ if I/F == INT: # integer type cmp
+ pred_enabled = int_pred_enabled # TODO: exception if not set!
+ preg = int_pred_reg[rd]
+ reg = int_regfile
+ else:
+ pred_enabled = fp_pred_enabled # TODO: exception if not set!
+ preg = fp_pred_reg[rd]
+ reg = fp_regfile
+
+ s1 = CSRvectorlen[src1] > 1;
+ s2 = CSRvectorlen[src2] > 1;
+ for (int i=0; i<vl; ++i)
+ preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
+ s2 ? reg[src2+i] : reg[src2]);
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+ into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+ Reordering") setting Vector-Length times (number of SIMD elements) bits
+ in Predicate Register rs3 as opposed to just Vector-Length bits.
+* Predicated Branches do not actually have an adjustment to the Program
+ Counter, so all of bits 25 through 30 in every case are not needed.
+* There are plenty of reserved opcodes for which bits 25 through 30 could
+ be put to good use if there is a suitable use-case.
+* FEQ and FNE (and BEQ and BNE) are included in order to save one
+ instruction having to invert the resultant predicate bitfield.
+ FLT and FLE may be inverted to FGT and FGE if needed by swapping
+ src1 and src2 (likewise the integer counterparts).
+
+## Compressed Branch Instruction:
+
+[[!table data="""
+15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
+funct3 | imm | rs10 | imm | | op | |
+3 | 3 | 3 | 2 | 3 | 2 | |
+C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
+110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
+111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
+110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
+111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
+"""]]
+
+Notes:
+
+* Bits 5 13 14 and 15 make up the comparator type
+* Bit 6 indicates whether to use integer or floating-point comparisons
+* In both floating-point and integer cases there are four predication
+ comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
+ src1 and src2).
+
+## LOAD / STORE Instructions
+
+For full analysis of topological adaptation of RVV LOAD/STORE
+see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
+
+Revised LOAD:
+
+[[!table data="""
+31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+imm[11:0] |||| rs1 | funct3 | rd | opcode |
+1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
+? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
+"""]]
+
+The exact same corresponding adaptation is also carried out on the single,
+double and quad precision floating-point LOAD-FP and STORE-FP operations,
+which fit the exact same instruction format. Thus all three types
+(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
+as well as FSW, FSD and FSQ.
+
+Notes:
+
+* LOAD remains functionally (topologically) identical to RVV LOAD
+ (for both integer and floating-point variants).
+* Predication CSR-marking register is not explicitly shown in instruction, it's
+ implicit based on the CSR predicate state for the rd (destination) register
+* rs2, the source, may *also be marked as a vector*, which implicitly
+ is taken to indicate "Indexed Load" (LD.X)
+* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
+* Bit 31 is reserved (ideas under consideration: auto-increment)
+* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
+* **TODO**: clarify where width maps to elsize
+
+Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
+
+ if (unit-strided) stride = elsize;
+ else stride = areg[as2]; // constant-strided
+
+ pred_enabled = int_pred_enabled
+ preg = int_pred_reg[rd]
+
+ for (int i=0; i<vl; ++i)
+ if (preg_enabled[rd] && [!]preg[i])
+ for (int j=0; j<seglen+1; j++)
+ {
+ if CSRvectorised[rs2])
+ offs = vreg[rs2][i]
+ else
+ offs = i*(seglen+1)*stride;
+ vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
+ }
+
+Taking CSR (SIMD) bitwidth into account involves using the vector
+length and register encoding according to the "Bitwidth Virtual Register
+Reordering" scheme shown in the Appendix (see function "regoffs").
+
+A similar instruction exists for STORE, with identical topological
+translation of all features. **TODO**
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are of the same format, where bits 2-4 are
+a src register instead of dest:
+
+[[!table data="""
+15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
+funct3 | imm | rs10 | imm | rd0 | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
+"""]]
+
+Unfortunately it is not possible to fit the full functionality
+of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
+require another operand (rs2) in addition to the operand width
+(which is also missing), offset, base, and src/dest.
+
+However a close approximation may be achieved by taking the top bit
+of the offset in each of the five types of LD (and ST), reducing the
+offset to 4 bits and utilising the 5th bit to indicate whether "stride"
+is to be enabled. In this way it is at least possible to introduce
+that functionality.
+
+(**TODO**: *assess whether the loss of one bit from offset is worth having
+"stride" capability.*)
+
+We also assume (including for the "stride" variant) that the "width"
+parameter, which is missing, is derived and implicit, just as it is
+with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
+and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
+C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
+
+Interestingly we note that the Vectorised Simple-V variant of
+LOAD/STORE (Compressed and otherwise), due to it effectively using the
+standard register file(s), is the direct functional equivalent of
+standard load-multiple and store-multiple instructions found in other
+processors.
+
+In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
+page 76, "For virtual memory systems some data accesses could be resident
+in physical memory and some not". The interesting question then arises:
+how does RVV deal with the exact same scenario?
+Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
+of detecting early page / segmentation faults and adjusting the TLB
+in advance, accordingly: other strategies are explored in the Appendix
+Section "Virtual Memory Page Faults".
+
+# Exceptions
+
+> What does an ADD of two different-sized vectors do in simple-V?
+
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+ than the destination, throw an exception.
+
+> And what about instructions like JALR?
+> What does jumping to a vector do?
+
+* Throw an exception. Whether that actually results in spawning threads
+ as part of the trap-handling remains to be seen.
+
+# Impementing V on top of Simple-V
+
+With Simple-V converting the original RVV draft concept-for-concept
+from explicit opcodes to implicit overloading of existing RV Standard
+Extensions, certain features were (deliberately) excluded that need
+to be added back in for RVV to reach its full potential. This is
+made slightly complicated by the fact that RVV itself has two
+levels: Base and reserved future functionality.
+
+* Representation Encoding is entirely left out of Simple-V in favour of
+ implicitly taking the exact (explicit) meaning from RV Standard Extensions.
+* VCLIP and VCLIPI do not have corresponding RV Standard Extension
+ opcodes (and are the only such operations).
+* Extended Element bitwidths (1 through to 24576 bits) were left out
+ of Simple-V as, again, there is no corresponding RV Standard Extension
+ that covers anything even below 32-bit operands.
+* Polymorphism was entirely left out of Simple-V due to the inherent
+ complexity of automatic type-conversion.
+* Vector Register files were specifically left out of Simple-V in favour
+ of fitting on top of the integer and floating-point files. An
+ "RVV re-retro-fit" needs to be able to mark (implicitly marked)
+ registers as being actually in a separate *vector* register file.
+* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
+ register file size is 5 bits (32 registers), whilst the "Extended"
+ variant of RVV specifies 8 bits (256 registers) and has yet to
+ be published.
+* One big difference: Sections 17.12 and 17.17, there are only two possible
+ predication registers in RVV "Base". Through the "indirect" method,
+ Simple-V provides a key-value CSR table that allows (arbitrarily)
+ up to 16 (TBD) of either the floating-point or integer registers to
+ be marked as "predicated" (key), and if so, which integer register to
+ use as the predication mask (value).
+
+**TODO**
+
+# Implementing P (renamed to DSP) on top of Simple-V
+
+* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
+ (caveat: anything not specified drops through to software-emulation / traps)
+* TODO
+
+# Appendix
+
+## V-Extension to Simple-V Comparative Analysis
+
+This section has been moved to its own page [[v_comparative_analysis]]
+
+## P-Ext ISA
+
+This section has been moved to its own page [[p_comparative_analysis]]
+
+## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
+
+This section compares the various parallelism proposals as they stand,
+including traditional SIMD, in terms of features, ease of implementation,
+complexity, flexibility, and die area.
+
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
+### [[alt_rvp]]
+
+Primary benefit of Alt-RVP is the simplicity with which parallelism
+may be introduced (effective multiplication of regfiles and associated ALUs).
+
+* plus: the simplicity of the lanes (combined with the regularity of
+ allocating identical opcodes multiple independent registers) meaning
+ that SRAM or 2R1W can be used for entire regfile (potentially).
+* minus: a more complex instruction set where the parallelism is much
+ more explicitly directly specified in the instruction and
+* minus: if you *don't* have an explicit instruction (opcode) and you
+ need one, the only place it can be added is... in the vector unit and
+* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
+ not useable or accessible in other Extensions.
+* plus-and-minus: Lanes may be utilised for high-speed context-switching
+ but with the down-side that they're an all-or-nothing part of the Extension.
+ No Alt-RVP: no fast register-bank switching.
+* plus: Lane-switching would mean that complex operations not suited to
+ parallelisation can be carried out, followed by further parallel Lane-based
+ work, without moving register contents down to memory (and back)
+* minus: Access to registers across multiple lanes is challenging. "Solution"
+ is to drop data into memory and immediately back in again (like MMX).
+
+### Simple-V
+
+Primary benefit of Simple-V is the OO abstraction of parallel principles
+from actual (internal) parallel hardware. It's an API in effect that's
+designed to be slotted in to an existing implementation (just after
+instruction decode) with minimum disruption and effort.
+
+* minus: the complexity (if full parallelism is to be exploited)
+ of having to use register renames, OoO, VLIW, register file cacheing,
+ all of which has been done before but is a pain
+* plus: transparent re-use of existing opcodes as-is just indirectly
+ saying "this register's now a vector" which
+* plus: means that future instructions also get to be inherently
+ parallelised because there's no "separate vector opcodes"
+* plus: Compressed instructions may also be (indirectly) parallelised
+* minus: the indirect nature of Simple-V means that setup (setting
+ a CSR register to indicate vector length, a separate one to indicate
+ that it is a predicate register and so on) means a little more setup
+ time than Alt-RVP or RVV's "direct and within the (longer) instruction"
+ approach.
+* plus: shared register file meaning that, like Alt-RVP, complex
+ operations not suited to parallelisation may be carried out interleaved
+ between parallelised instructions *without* requiring data to be dropped
+ down to memory and back (into a separate vectorised register engine).
+* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
+ files means that huge parallel workloads would use up considerable
+ chunks of the register file. However in the case of RV64 and 32-bit
+ operations, that effectively means 64 slots are available for parallel
+ operations.
+* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
+ be added, yet the instruction opcodes remain unchanged (and still appear
+ to be parallel). consistent "API" regardless of actual internal parallelism:
+ even an in-order single-issue implementation with a single ALU would still
+ appear to have parallel vectoristion.
+* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
+ hard to say if there would be pluses or minuses (on die area). At worse it
+ would be "no worse" than existing register renaming, OoO, VLIW and register
+ file cacheing schemes.
+
+### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
+
+RVV is extremely well-designed and has some amazing features, including
+2D reorganisation of memory through LOAD/STORE "strides".
+
+* plus: regular predictable workload means that implementations may
+ streamline effects on L1/L2 Cache.
+* plus: regular and clear parallel workload also means that lanes
+ (similar to Alt-RVP) may be used as an implementation detail,
+ using either SRAM or 2R1W registers.
+* plus: separate engine with no impact on the rest of an implementation
+* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
+ really feasible.
+* minus: no ISA abstraction or re-use either: additions to other Extensions
+ do not gain parallelism, resulting in prolific duplication of functionality
+ inside RVV *and out*.
+* minus: when operations require a different approach (scalar operations
+ using the standard integer or FP regfile) an entire vector must be
+ transferred out to memory, into standard regfiles, then back to memory,
+ then back to the vector unit, this to occur potentially multiple times.
+* minus: will never fit into Compressed instruction space (as-is. May
+ be able to do so if "indirect" features of Simple-V are partially adopted).
+* plus-and-slight-minus: extended variants may address up to 256
+ vectorised registers (requires 48/64-bit opcodes to do it).
+* minus-and-partial-plus: separate engine plus complexity increases
+ implementation time and die area, meaning that adoption is likely only
+ to be in high-performance specialist supercomputing (where it will
+ be absolutely superb).
+
+### Traditional SIMD
+
+The only really good things about SIMD are how easy it is to implement and
+get good performance. Unfortunately that makes it quite seductive...
+
+* plus: really straightforward, ALU basically does several packed operations
+ at once. Parallelism is inherent at the ALU, making the addition of
+ SIMD-style parallelism an easy decision that has zero significant impact
+ on the rest of any given architectural design and layout.
+* plus (continuation): SIMD in simple in-order single-issue designs can
+ therefore result in superb throughput, easily achieved even with a very
+ simple execution model.
+* minus: ridiculously complex setup and corner-cases that disproportionately
+ increase instruction count on what would otherwise be a "simple loop",
+ should the number of elements in an array not happen to exactly match
+ the SIMD group width.
+* minus: getting data usefully out of registers (if separate regfiles
+ are used) means outputting to memory and back.
+* minus: quite a lot of supplementary instructions for bit-level manipulation
+ are needed in order to efficiently extract (or prepare) SIMD operands.
+* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
+ dimension and parallelism (width): an at least O(N^2) and quite probably
+ O(N^3) ISA proliferation that often results in several thousand
+ separate instructions. all requiring separate and distinct corner-case
+ algorithms!
+* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
+ 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
+ For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
+ four separate and distinct instructions: one for (r1:low r2:high),
+ one for (r1:high r2:low), one for (r1:high r2:high) and one for
+ (r1:low r2:low) *per function*.
+* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
+ between operand and result bit-widths. In combination with high/low
+ proliferation the situation is made even worse.
+* minor-saving-grace: some implementations *may* have predication masks
+ that allow control over individual elements within the SIMD block.
+
+## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
+
+This section compares the various parallelism proposals as they stand,
+*against* traditional SIMD as opposed to *alongside* SIMD. In other words,
+the question is asked "How can each of the proposals effectively implement
+(or replace) SIMD, and how effective would they be"?
+
+### [[alt_rvp]]
+
+* Alt-RVP would not actually replace SIMD but would augment it: just as with
+ a SIMD architecture where the ALU becomes responsible for the parallelism,
+ Alt-RVP ALUs would likewise be so responsible... with *additional*
+ (lane-based) parallelism on top.
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
+ at least one dimension are avoided (architectural upgrades introducing
+ 128-bit then 256-bit then 512-bit variants of the exact same 64-bit
+ SIMD block)
+* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
+ of instructions as SIMD, albeit not quite as badly (due to Lanes).
+* In the same discussion for Alt-RVP, an additional proposal was made to
+ be able to subdivide the bits of each register lane (columns) down into
+ arbitrary bit-lengths (RGB 565 for example).
+* A recommendation was given instead to make the subdivisions down to 32-bit,
+ 16-bit or even 8-bit, effectively dividing the registerfile into
+ Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane
+ "swapping" instructions were then introduced, some of the disadvantages
+ of SIMD could be mitigated.
+
+### RVV
+
+* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
+ parallelism.
+* However whilst SIMD is usually designed for single-issue in-order simple
+ DSPs with a focus on Multimedia (Audio, Video and Image processing),
+ RVV's primary focus appears to be on Supercomputing: optimisation of
+ mathematical operations that fit into the OpenCL space.
+* Adding functions (operations) that would normally fit (in parallel)
+ into a SIMD instruction requires an equivalent to be added to the
+ RVV Extension, if one does not exist. Given the specialist nature of
+ some SIMD instructions (8-bit or 16-bit saturated or halving add),
+ this possibility seems extremely unlikely to occur, even if the
+ implementation overhead of RVV were acceptable (compared to
+ normal SIMD/DSP-style single-issue in-order simplicity).
+
+### Simple-V
+
+* Simple-V borrows hugely from RVV as it is intended to be easy to
+ topologically transplant every single instruction from RVV (as
+ designed) into Simple-V equivalents, with *zero loss of functionality
+ or capability*.
+* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
+ Extension which contained the basic primitives (non-parallelised
+ 8, 16 or 32-bit SIMD operations) inherently *become* parallel,
+ automatically.
+* Additionally, standard operations (ADD, MUL) that would normally have
+ to have special SIMD-parallel opcodes added need no longer have *any*
+ of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
+ 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
+ *standard* RV opcodes (present and future) and automatically parallelises
+ them.
+* By inheriting the RVV feature of arbitrary vector-length, then just as
+ with RVV the corner-cases and ISA proliferation of SIMD is avoided.
+* Whilst not entirely finalised, registers are expected to be
+ capable of being subdivided down to an implementor-chosen bitwidth
+ in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
+ and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
+ choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
+ ALUs that perform twin 8-bit operations as they see fit, or anything
+ else including no subdivisions at all.
+* Even though implementors have that choice even to have full 64-bit
+ (with RV64) SIMD, they *must* provide predication that transparently
+ switches off appropriate units on the last loop, thus neatly fitting
+ underlying SIMD ALU implementations *into* the arbitrary vector-length
+ RVV paradigm, keeping the uniform consistent API that is a key strategic
+ feature of Simple-V.
+* With Simple-V fitting into the standard register files, certain classes
+ of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
+ can be done by applying *Parallelised* Bit-manipulation operations
+ followed by parallelised *straight* versions of element-to-element
+ arithmetic operations, even if the bit-manipulation operations require
+ changing the bitwidth of the "vectors" to do so. Predication can
+ be utilised to skip high words (or low words) in source or destination.
+* In essence, the key downside of SIMD - massive duplication of
+ identical functions over time as an architecture evolves from 32-bit
+ wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
+ vector-style parallelism being dropped on top of 8-bit or 16-bit
+ operations, all the while keeping a consistent ISA-level "API" irrespective
+ of implementor design choices (or indeed actual implementations).
+
+### Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FIFO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
+## Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+ register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
+ register CSRpredicate[XLEN][4]; # 2^4 is max vector length
+ register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
+ register x[32][XLEN];
+
+ function op_add(rd, rs1, rs2, predr)
+ {
+ /* note that this is ADD, not PADD */
+ int i, id, irs1, irs2;
+ # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
+ # also destination makes no sense as a scalar but what the hell...
+ for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
+ if (CSRpredicate[predr][i]) # i *think* this is right...
+ x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
+ # now increment the idxs
+ if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
+ id += 1;
+ if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
+ irs1 += 1;
+ if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
+ irs2 += 1;
+ }
+
+## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
+
+One of the goals of this parallelism proposal is to avoid instruction
+duplication. However, with the base ISA having been designed explictly
+to *avoid* condition-codes entirely, shoe-horning predication into it
+bcomes quite challenging.
+
+However what if all branch instructions, if referencing a vectorised
+register, were instead given *completely new analogous meanings* that
+resulted in a parallel bit-wise predication register being set? This
+would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
+BLT and BGE.
+
+We might imagine that FEQ, FLT and FLT would also need to be converted,
+however these are effectively *already* in the precise form needed and
+do not need to be converted *at all*! The difference is that FEQ, FLT
+and FLE *specifically* write a 1 to an integer register if the condition
+holds, and 0 if not. All that needs to be done here is to say, "if
+the integer register is tagged with a bit that says it is a predication
+register, the **bit** in the integer register is set based on the
+current vector index" instead.
+
+There is, in the standard Conditional Branch instruction, more than
+adequate space to interpret it in a similar fashion:
+
+[[!table data="""
+31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
+imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+"""]]
+
+This would become:
+
+[[!table data="""
+31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
+"""]]
+
+Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
+with the interesting side-effect that there is space within what is presently
+the "immediate offset" field to reinterpret that to add in not only a bit
+field to distinguish between floating-point compare and integer compare,
+not only to add in a second source register, but also use some of the bits as
+a predication target as well.
+
+[[!table data="""
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | op |
+3 | 3 | 3 | 5 | 2 |
+C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+"""]]
+
+Now uses the CS format:
+
+[[!table data="""
+15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
+"""]]
+
+Bit 6 would be decoded as "operation refers to Integer or Float" including
+interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
+"C" Standard, version 2.0,
+whilst Bit 5 would allow the operation to be extended, in combination with
+funct3 = 110 or 111: a combination of four distinct (predicated) comparison
+operators. In both floating-point and integer cases those could be
+EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
+
+## Register reordering <a name="register_reordering"></a>
+
+### Register File
+
+| Reg Num | Bits |
+| ------- | ---- |
+| r0 | (32..0) |
+| r1 | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) |
+| r5 | (32..0) |
+| r6 | (32..0) |
+| r7 | (32..0) |
+| .. | (32..0) |
+| r31| (32..0) |
+
+### Vectorised CSR
+
+May not be an actual CSR: may be generated from Vector Length CSR:
+single-bit is less burdensome on instruction decode phase.
+
+| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
+| - | - | - | - | - | - | - | - |
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
+
+### Vector Length CSR
+
+| Reg Num | (3..0) |
+| ------- | ---- |
+| r0 | 2 |
+| r1 | 0 |
+| r2 | 1 |
+| r3 | 1 |
+| r4 | 3 |
+| r5 | 0 |
+| r6 | 0 |
+| r7 | 1 |
+
+### Virtual Register Reordering
+
+This example assumes the above Vector Length CSR table
+
+| Reg Num | Bits (0) | Bits (1) | Bits (2) |
+| ------- | -------- | -------- | -------- |
+| r0 | (32..0) | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) | (32..0) | (32..0) |
+| r7 | (32..0) |
+
+### Bitwidth Virtual Register Reordering
+
+This example goes a little further and illustrates the effect that a
+bitwidth CSR has been set on a register. Preconditions:
+
+* RV32 assumed
+* CSRintbitwidth[2] = 010 # integer r2 is 16-bit
+* CSRintvlength[2] = 3 # integer r2 is a vector of length 3
+* vsetl rs1, 5 # set the vector length to 5
+
+This is interpreted as follows:
+
+* Given that the context is RV32, ELEN=32.
+* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
+* Therefore the actual vector length is up to *six* elements
+* However vsetl sets a length 5 therefore the last "element" is skipped
+
+So when using an operation that uses r2 as a source (or destination)
+the operation is carried out as follows:
+
+* 16-bit operation on r2(15..0) - vector element index 0
+* 16-bit operation on r2(31..16) - vector element index 1
+* 16-bit operation on r3(15..0) - vector element index 2
+* 16-bit operation on r3(31..16) - vector element index 3
+* 16-bit operation on r4(15..0) - vector element index 4
+* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5
+
+Predication has been left out of the above example for simplicity, however
+predication is ANDed with the latter stages (vsetl not equal to maximum
+capacity).
+
+Note also that it is entirely an implementor's choice as to whether to have
+actual separate ALUs down to the minimum bitwidth, or whether to have something
+more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD
+operations carried out 32-bits at a time is perfectly acceptable, as is
+8-bit SIMD operations carried out 16-bits at a time requiring two ALUs).
+Regardless of the internal parallelism choice, *predication must
+still be respected*, making Simple-V in effect the "consistent public API".
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
+
+Pseudocode for vector length taking CSR SIMD-bitwidth into account:
+
+ vew = CSRbitwidth[rs1]
+ if (vew == 0)
+ bytesperreg = (XLEN/8) # or FLEN as appropriate
+ else:
+ bytesperreg = bytestable[vew] # 1 2 4 8 16
+ simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+ vlen = CSRvectorlen[rs1] * simdmult
+
+To index an element in a register rnum where the vector element index is i:
+
+ function regoffs(rnum, i):
+ regidx = floor(i / simdmult) # integer-div rounded down
+ byteidx = i % simdmult # integer-remainder
+ return rnum + regidx, # actual real register
+ byteidx * 8, # low
+ byteidx * 8 + (vew-1), # high
+
+### Insights
+
+SIMD register file splitting still to consider. For RV64, benefits of doubling
+(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
+size of the floating point register file to 64 (128 in the case of HP)
+seem pretty clear and worth the complexity.
+
+64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
+done on 64-bit registers it's not so conceptually difficult. May even
+be achieved by *actually* splitting the regfile into 64 virtual 32-bit
+registers such that a 64-bit FP scalar operation is dropped into (r0.H
+r0.L) tuples. Implementation therefore hidden through register renaming.
+
+Implementations intending to introduce VLIW, OoO and parallelism
+(even without Simple-V) would then find that the instructions are
+generated quicker (or in a more compact fashion that is less heavy
+on caches). Interestingly we observe then that Simple-V is about
+"consolidation of instruction generation", where actual parallelism
+of underlying hardware is an implementor-choice that could just as
+equally be applied *without* Simple-V even being implemented.
+
+## Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
+
+It could indeed have been logically deduced (or expected), that there
+would be additional decode latency in this proposal, because if
+overloading the opcodes to have different meanings, there is guaranteed
+to be some state, some-where, directly related to registers.
+
+There are several cases:
+
+* All operands vector-length=1 (scalars), all operands
+ packed-bitwidth="default": instructions are passed through direct as if
+ Simple-V did not exist. Simple-V is, in effect, completely disabled.
+* At least one operand vector-length > 1, all operands
+ packed-bitwidth="default": any parallel vector ALUs placed on "alert",
+ virtual parallelism looping may be activated.
+* All operands vector-length=1 (scalars), at least one
+ operand packed-bitwidth != default: degenerate case of SIMD,
+ implementation-specific complexity here (packed decode before ALUs or
+ *IN* ALUs)
+* At least one operand vector-length > 1, at least one operand
+ packed-bitwidth != default: parallel vector ALUs (if any)
+ placed on "alert", virtual parallelsim looping may be activated,
+ implementation-specific SIMD complexity kicks in (packed decode before
+ ALUs or *IN* ALUs).
+
+Bear in mind that the proposal includes that the decision whether
+to parallelise in hardware or whether to virtual-parallelise (to
+dramatically simplify compilers and also not to run into the SIMD
+instruction proliferation nightmare) *or* a transprent combination
+of both, be done on a *per-operand basis*, so that implementors can
+specifically choose to create an application-optimised implementation
+that they believe (or know) will sell extremely well, without having
+"Extra Standards-Mandated Baggage" that would otherwise blow their area
+or power budget completely out the window.
+
+Additionally, two possible CSR schemes have been proposed, in order to
+greatly reduce CSR space:
+
+* per-register CSRs (vector-length and packed-bitwidth)
+* a smaller number of CSRs with the same information but with an *INDEX*
+ specifying WHICH register in one of three regfiles (vector, fp, int)
+ the length and bitwidth applies to.
+
+(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
+
+In addition, LOAD/STORE has its own associated proposed CSRs that
+mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
+V (and Hwacha).
+
+Also bear in mind that, for reasons of simplicity for implementors,
+I was coming round to the idea of permitting implementors to choose
+exactly which bitwidths they would like to support in hardware and which
+to allow to fall through to software-trap emulation.
+
+So the question boils down to:
+
+* whether either (or both) of those two CSR schemes have significant
+ latency that could even potentially require an extra pipeline decode stage
+* whether there are implementations that can be thought of which do *not*
+ introduce significant latency
+* whether it is possible to explicitly (through quite simply
+ disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
+ all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
+ the extreme of skipping an entire pipeline stage (if one is needed)
+* whether packed bitwidth and associated regfile splitting is so complex
+ that it should definitely, definitely be made mandatory that implementors
+ move regfile splitting into the ALU, and what are the implications of that
+* whether even if that *is* made mandatory, is software-trapped
+ "unsupported bitwidths" still desirable, on the basis that SIMD is such
+ a complete nightmare that *even* having a software implementation is
+ better, making Simple-V have more in common with a software API than
+ anything else.
+
+Whilst the above may seem to be severe minuses, there are some strong
+pluses:
+
+* Significant reduction of V's opcode space: over 95%.
+* Smaller reduction of P's opcode space: around 10%.
+* The potential to use Compressed instructions in both Vector and SIMD
+ due to the overloading of register meaning (implicit vectorisation,
+ implicit packing)
+* Not only present but also future extensions automatically gain parallelism.
+* Already mentioned but worth emphasising: the simplification to compiler
+ writers and assembly-level writers of having the same consistent ISA
+ regardless of whether the internal level of parallelism (number of
+ parallel ALUs) is only equal to one ("virtual" parallelism), or is
+ greater than one, should not be underestimated.
+
+## Reducing Register Bank porting
+
+This looks quite reasonable.
+<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+
+The main details are outlined on page 4. They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
+
+## Overflow registers in combination with predication