whitespace cleanup

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 2338b6cafbe52f22599486382fe3e49a90b3abf3..5d4e67ab4afb3c193f6383a9b805dc2e1f0aee01 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,26 +1,188 @@
-# SIMD / Simple-V Extension Proposal
+# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
+
+[[!toc ]]
+
+# Summary
+
+Key insight: Simple-V is intended as an abstraction layer to provide
+a consistent "API" to parallelisation of existing *and future* operations.
+*Actual* internal hardware-level parallelism is *not* required, such
+that Simple-V may be viewed as providing a "compact" or "consolidated"
+means of issuing multiple near-identical arithmetic instructions to an
+instruction queue (FILO), pending execution.
+
+*Actual* parallelism, if added independently of Simple-V in the form
+of Out-of-order restructuring (including parallel ALU lanes) or VLIW
+implementations, or SIMD, or anything else, would then benefit *if*
+Simple-V was added on top.
+
+# Introduction
  
  This proposal exists so as to be able to satisfy several disparate
-requirements: area-conscious designs and performance-conscious designs.
-Also, the existing P (SIMD) proposal and the V (Vector) proposals,
+requirements: power-conscious, area-conscious, and performance-conscious
+designs all pull an ISA and its implementation in different conflicting
+directions, as do the specific intended uses for any given implementation.
+
+Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
  whilst each extremely powerful in their own right and clearly desirable,
  are also:
  
-* Clearly independent in their origins (Cray and AndeStar v3 respectively)
-* Both contain duplication of pre-existing RISC-V instructions
+* Clearly independent in their origins (Cray and AndesStar v3 respectively)
+  so need work to adapt to the RISC-V ethos and paradigm
+* Are sufficiently large so as to make adoption (and exploration for
+  analysis and review purposes) prohibitively expensive
+* Both contain partial duplication of pre-existing RISC-V instructions
+  (an undesirable characteristic)
  * Both have independent and disparate methods for introducing parallelism
    at the instruction level.
  * Both require that their respective parallelism paradigm be implemented
-  along-side their respective functionality *or not at all*.
-* Both independently have methods for introducing parallelism that could,
-  if separated, benefit *other areas of RISC-V not just DSP and Floating-point*.
+  along-side and integral to their respective functionality *or not at all*.
+* Both independently have methods for introducing parallelism that
+  could, if separated, benefit
+  *other areas of RISC-V not just DSP or Floating-point respectively*.
  
  Therefore it makes a huge amount of sense to have a means and method
  of introducing instruction parallelism in a flexible way that provides
  implementors with the option to choose exactly where they wish to offer
  performance improvements and where they wish to optimise for power
-and area.  If that can be offered even on a per-operation basis that
-would provide even more flexibility.
+and/or area (and if that can be offered even on a per-operation basis that
+would provide even more flexibility).
+
+Additionally it makes sense to *split out* the parallelism inherent within
+each of P and V, and to see if each of P and V then, in *combination* with
+a "best-of-both" parallelism extension, could be added on *on top* of
+this proposal, to topologically provide the exact same functionality of
+each of P and V.
+
+Furthermore, an additional goal of this proposal is to reduce the number
+of opcodes utilised by each of P and V as they currently stand, leveraging
+existing RISC-V opcodes where possible, and also potentially allowing
+P and V to make use of Compressed Instructions as a result.
+
+**TODO**: reword this to better suit this document:
+
+Having looked at both P and V as they stand, they're _both_ very much
+"separate engines" that, despite both their respective merits and
+extremely powerful features, don't really cleanly fit into the RV design
+ethos (or the flexible extensibility) and, as such, are both in danger
+of not being widely adopted.  I'm inclined towards recommending:
+
+* splitting out the DSP aspects of P-SIMD to create a single-issue DSP
+* splitting out the polymorphism, esoteric data types (GF, complex
+  numbers) and unusual operations of V to create a single-issue "Esoteric
+  Floating-Point" extension
+* splitting out the loop-aspects, vector aspects and data-width aspects
+  of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they
+  apply across *all* Extensions, whether those be DSP, M, Base, V, P -
+  everything.
+
+**TODO**: propose overflow registers be actually one of the integer regs
+(flowing to multiple regs).
+
+**TODO**: propose "mask" (predication) registers likewise.  combination with
+standard RV instructions and overflow registers extremely powerful
+
+## CSRs marking registers as Vector
+
+A 32-bit CSR would be needed (1 bit per integer register) to indicate
+whether a register was, if referred to, implicitly to be treated as
+a vector.
+
+A second 32-bit CSR would be needed (1 bit per floating-point register)
+to indicate whether a floating-point register was to be treated as a
+vector.
+
+In this way any standard (current or future) operation involving
+register operands may detect if the operation is to be vector-vector,
+vector-scalar or scalar-scalar (standard) simply through a single
+bit test.
+
+## CSR vector-length and CSR SIMD packed-bitwidth
+
+**TODO** analyse each of these:
+
+* splitting out the loop-aspects, vector aspects and data-width aspects
+* integer reg 0 *and* fp reg0 share CSR vlen 0 *and* CSR packed-bitwidth 0
+* integer reg 1 *and* fp reg1 share CSR vlen 1 *and* CSR packed-bitwidth 1
+* ....
+* .... 
+
+instead:
+
+* CSR vlen 0 *and* CSR packed-bitwidth 0 register contain extra bits
+  specifying an *INDEX* of WHICH int/fp register they refer to
+* CSR vlen 1 *and* CSR packed-bitwidth 1 register contain extra bits
+  specifying an *INDEX* of WHICH int/fp register they refer to
+* ...
+* ...
+
+Have to be very *very* careful about not implementing too few of those
+(or too many).  Assess implementation impact on decode latency.  Is it
+worth it?
+
+Implementation of the latter:
+
+Operation involving (referring to) register M:
+
+    bitwidth = default # default for opcode?
+    vectorlen = 1 # scalar
+    
+    for (o = 0, o < 2, o++)
+      if (CSR-Vector_registernum[o] == M)
+          bitwidth = CSR-Vector_bitwidth[o]
+          vectorlen = CSR-Vector_len[o]
+          break
+
+and for the former it would simply be:
+
+    bitwidth = CSR-Vector_bitwidth[M]
+    vectorlen = CSR-Vector_len[M]
+
+Alternatives:
+
+* One single "global" vector-length CSR
+
+## Stride
+
+**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular
+register as being "if you use this reg in LOAD/STORE, use the offset
+amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous".
+can be used for matrix spanning.
+
+> For LOAD/STORE, could a better option be to interpret the offset in the 
+> opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is 
+> configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12), 
+> t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the 
+> vector-control CSRs to select between offset-as-stride and unit-stride 
+> memory accesses? 
+
+So there would be an instruction like this:
+
+| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM |
+| opcode | 5 bit | 1 bit             | 1 bit             | 5 bit, OFFn=XLEN |
+
+
+which would mean:
+
+* CSR-Offset register n <= (float|int) register number N
+* CSR-Offset Stride-mode = offset or unit
+* CSR-Offset amount register n = contents of register M
+
+LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set):
+
+    offs = 0
+    stride = 1
+    vector-len = CSR-Vector-length register N
+  
+    for (o = 0, o < 2, o++)
+      if (CSR-Offset register o == M)
+          offs = CSR-Offset amount register o
+          if CSR-Offset Stride-mode == offset:
+              stride = ldoffs
+          break
+   
+    for (i = 0, i < vector-len; i++)
+      r[N+i] = mem[(offs*i + r[M+i])*stride]
  
  # Analysis and discussion of Vector vs SIMD
  
@@ -103,27 +265,156 @@ contains an extremely interesting feature: zero-overhead loops.  This
  proposal would basically allow an inner loop of instructions to be
  repeated indefinitely, a fixed number of times.
  
-Its specific advantage over explicit loops is that the pipeline in a
-DSP can potentially be kept completely full *even in an in-order
+Its specific advantage over explicit loops is that the pipeline in a DSP
+can potentially be kept completely full *even in an in-order single-issue
  implementation*.  Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in order
-to keep ALU pipelines 100% occupied.
+out-of-order execution capabilities to "pre-process" instructions in
+order to keep ALU pipelines 100% occupied.
+
+By bringing that capability in, this proposal could offer a way to increase
+pipeline activity even in simpler implementations in the one key area
+which really matters: the inner loop.
+
+However when looking at much more comprehensive schemes
+"A portable specification of zero-overhead loop control hardware
+applied to embedded processors" (ZOLC), optimising only the single
+inner loop seems inadequate, tending to suggest that ZOLC may be
+better off being proposed as an entirely separate Extension.
+
+## Mask and Tagging (Predication)
+
+Tagging (aka Masks aka Predication) is a pseudo-method of implementing
+simplistic branching in a parallel fashion, by allowing execution on
+elements of a vector to be switched on or off depending on the results
+of prior operations in the same array position.
+
+The reason for considering this is simple: by *definition* it
+is not possible to perform individual parallel branches in a SIMD
+(Single-Instruction, **Multiple**-Data) context.  Branches (modifying
+of the Program Counter) will result in *all* parallel data having
+a different instruction executed on it: that's just the definition of
+SIMD, and it is simply unavoidable.
+
+So these are the ways in which conditional execution may be implemented:
+
+* explicit compare and branch: BNE x, y -> offs would jump offs
+  instructions if x was not equal to y
+* explicit store of tag condition: CMP x, y -> tagbit
+* implicit (condition-code) ADD results in a carry, carry bit implicitly
+  (or sometimes explicitly) goes into a "tag" (mask) register
+
+The first of these is a "normal" branch method, which is flat-out impossible
+to parallelise without look-ahead and effectively rewriting instructions.
+This would defeat the purpose of RISC.
+
+The latter two are where parallelism becomes easy to do without complexity:
+every operation is modified to be "conditionally executed" (in an explicit
+way directly in the instruction format *or* implicitly).
+
+RVV (Vector-Extension) proposes to have *explicit* storing of the compare
+in a tag/mask register, and to *explicitly* have every vector operation
+*require* that its operation be "predicated" on the bits within an
+explicitly-named tag/mask register.
+
+SIMD (P-Extension) has not yet published precise documentation on what its
+schema is to be: there is however verbal indication at the time of writing
+that:
  
-This very simple proposal offers a way to increase pipeline activity in the
-one key area which really matters: the inner loop.
+> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will
+> be executed using the same compare ALU logic for the base ISA with some
+> minor modifications to handle smaller data types. The function will not
+> be duplicated.
  
-## Mask and Tagging
+This is an *implicit* form of predication as the base RV ISA does not have
+condition-codes or predication.  By adding a CSR it becomes possible
+to also tag certain registers as "predicated if referenced as a destination".
+Example:
  
-*TODO: research masks as they can be superb and extremely powerful.
-If B-Extension is implemented and provides Bit-Gather-Scatter it
-becomes really cool and easy to switch out certain indexed values
-from an array of data, but actually BGS **on its own** might be
-sufficient.  Bottom line, this is complex, and needs a proper analysis.
-The other sections are pretty straightforward.*
+    // in future operations if r0 is the destination use r5 as 
+    // the PREDICATION register
+    IMPLICICSRPREDICATE r0, r5
+    // store the compares in r5 as the PREDICATION register 
+    CMPEQ8 r5, r1, r2
+    // r0 is used here.  ah ha!  that means it's predicated using r5! 
+    ADD8 r0, r1, r3
+
+With enough registers (and there are enough registers) some fairly
+complex predication can be set up and yet still execute without significant
+stalling, even in a simple non-superscalar architecture.
+
+### Retro-fitting Predication into branch-explicit ISA
+
+One of the goals of this parallelism proposal is to avoid instruction
+duplication.  However, with the base ISA having been designed explictly
+to *avoid* condition-codes entirely, shoe-horning predication into it
+bcomes quite challenging.
+
+However what if all branch instructions, if referencing a vectorised
+register, were instead given *completely new analogous meanings* that
+resulted in a parallel bit-wise predication register being set?  This
+would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
+BLT and BGE.
+
+We might imagine that FEQ, FLT and FLT would also need to be converted,
+however these are effectively *already* in the precise form needed and
+do not need to be converted *at all*!  The difference is that FEQ, FLT
+and FLE *specifically* write a 1 to an integer register if the condition
+holds, and 0 if not.  All that needs to be done here is to say, "if
+the integer register is tagged with a bit that says it is a predication
+register, the **bit** in the integer register is set based on the
+current vector index" instead.
+
+There is, in the standard Conditional Branch instruction, more than
+adequate space to interpret it in a similar fashion:
+
+[[!table  data="""
+  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
+imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
+ 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
+   offset[12,10:5]  ||    src2  |    src1   |  BEQ         |    offset[11,4:1]      || BRANCH      |
+"""]]
+
+This would become:
+
+[[!table  data="""
+  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
+imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
+ 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
+   reserved         ||    src2  |    src1   |  BEQ         |   predicate rs3        || BRANCH      |
+"""]]
+
+Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
+with the interesting side-effect that there is space within what is presently
+the "immediate offset" field to reinterpret that to add in not only a bit
+field to distinguish between floating-point compare and integer compare,
+not only to add in a second source register, but also use some of the bits as
+a predication target as well.
+
+[[!table  data="""
+15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
+   funct3    |       imm          |   rs10   |         imm           |   op   |
+      3      |         3          |    3     |           5           |   2    |
+   C.BEQZ    |   offset[8,4:3]    |   src    |   offset[7:6,2:1,5]   |   C1   |
+"""]]
+
+Now uses the CS format:
+
+[[!table  data="""
+15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
+   funct3    |       imm          |   rs10   |  imm   |              |   op   |
+      3      |         3          |    3     |  2     |  3           |   2    |
+   C.BEQZ    |   predicate rs3    |   src1   |  I/F B | src2         |   C1   |
+"""]]
+
+Bit 6 would be decoded as "operation refers to Integer or Float" including
+interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
+"C" Standard, version 2.0,
+whilst Bit 5 would allow the operation to be extended, in combination with
+funct3 = 110 or 111: a combination of four distinct comparison operators.
  
  ## Conclusions
  
-In the above sections the four different ways where parallel instruction
+In the above sections the five different ways where parallel instruction
  execution has closely and loosely inter-related implications for the ISA and
  for implementors, were outlined.  The pluses and minuses came out as
  follows:
@@ -131,19 +422,1130 @@ follows:
  * Fixed vs variable parallelism: <b>variable</b>
  * Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
  * Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit</b>
-* Tag or no-tag: <b>TODO</b>
+* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
+* Tag or no-tag: <b>Complex but highly beneficial</b>
+
+In particular:
+
+* variable-length vectors came out on top because of the high setup, teardown
+  and corner-cases associated with the fixed width of SIMD.
+* Implicit bit-width helps to extend the ISA to escape from
+  former limitations and restrictions (in a backwards-compatible fashion),
+  whilst also leaving implementors free to simmplify implementations
+  by using actual explicit internal parallelism.
+* Implicit (zero-overhead) loops provide a means to keep pipelines
+  potentially 100% occupied in a single-issue in-order implementation
+  i.e. *without* requiring a super-scalar or out-of-order architecture,
+  but doing a proper, full job (ZOLC) is an entirely different matter.
+
+Constructing a SIMD/Simple-Vector proposal based around four of these five
+requirements would therefore seem to be a logical thing to do.
+
+# Instruction Format
+
+**TODO** *basically borrow from both P and V, which should be quite simple
+to do, with the exception of Tag/no-tag, which needs a bit more
+thought.  V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
+gather-scatterer, and, if implemented, could actually be a really useful
+way to span 8-bit up to 64-bit groups of data, where BGS as it stands
+and described by Clifford does **bits** of up to 16 width.  Lots to
+look at and investigate!*
+
+# Note on implementation of parallelism
+
+One extremely important aspect of this proposal is to respect and support
+implementors desire to focus on power, area or performance.  In that regard,
+it is proposed that implementors be free to choose whether to implement
+the Vector (or variable-width SIMD) parallelism as sequential operations
+with a single ALU, fully parallel (if practical) with multiple ALUs, or
+a hybrid combination of both.
+
+In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
+Parallelism".  They achieve a 16-way SIMD at an **instruction** level
+by providing a combination of a 4-way parallel ALU *and* an externally
+transparent loop that feeds 4 sequential sets of data into each of the
+4 ALUs.
+
+Also in the same core, it is worth noting that particularly uncommon
+but essential operations (Reciprocal-Square-Root for example) are
+*not* part of the 4-way parallel ALU but instead have a *single* ALU.
+Under the proposed Vector (varible-width SIMD) implementors would
+be free to do precisely that: i.e. free to choose *on a per operation
+basis* whether and how much "Virtual Parallelism" to deploy.
+
+It is absolutely critical to note that it is proposed that such choices MUST
+be **entirely transparent** to the end-user and the compiler.  Whilst
+a Vector (varible-width SIM) may not precisely match the width of the
+parallelism within the implementation, the end-user **should not care**
+and in this way the performance benefits are gained but the ISA remains
+straightforward.  All that happens at the end of an instruction run is: some
+parallel units (if there are any) would remain offline, completely
+transparently to the ISA, the program, and the compiler.
+
+The "SIMD considered harmful" trap of having huge complexity and extra
+instructions to deal with corner-cases is thus avoided, and implementors
+get to choose precisely where to focus and target the benefits of their
+implementation efforts, without "extra baggage".
+
+# Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
+    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
+    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
+    register x[32][XLEN];
+
+    function op_add(rd, rs1, rs2, predr)
+    {
+       /* note that this is ADD, not PADD */
+       int i, id, irs1, irs2;
+       # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
+       # also destination makes no sense as a scalar but what the hell...
+       for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
+          if (CSRpredicate[predr][i]) # i *think* this is right...
+             x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
+          # now increment the idxs
+          if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
+             id += 1;
+          if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
+             irs1 += 1;
+          if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
+             irs2 += 1;
+    }
+
+# V-Extension to Simple-V Comparative Analysis
+
+This section covers the ways in which Simple-V is comparable
+to, or more flexible than, V-Extension (V2.3-draft).  Also covered is
+one major weak-point (register files are fixed size, where V is
+arbitrary length), and how best to deal with that, should V be adapted
+to be on top of Simple-V.
+
+The first stages of this section go over each of the sections of V2.3-draft V
+where appropriate
+
+## 17.3 Shape Encoding
+
+Simple-V's proposed means of expressing whether a register (from the
+standard integer or the standard floating-point file) is a scalar or
+a vector is to simply set the vector length to 1.  The instruction
+would however have to specify which register file (integer or FP) that
+the vector-length was to be applied to.
+
+Extended shapes (2-D etc) would not be part of Simple-V at all.
+
+## 17.4 Representation Encoding
+
+Simple-V would not have representation-encoding.  This is part of
+polymorphism, which is considered too complex to implement (TODO: confirm?)
+
+## 17.5 Element Bitwidth
+
+This is directly equivalent to Simple-V's "Packed", and implies that
+integer (or floating-point) are divided down into vector-indexable
+chunks of size Bitwidth.
+
+In this way it becomes possible to have ADD effectively and implicitly
+turn into ADDb (8-bit add), ADDw (16-bit add) and so on, and where
+vector-length has been set to greater than 1, it becomes a "Packed"
+(SIMD) instruction.
+
+It remains to be decided what should be done when RV32 / RV64 ADD (sized)
+opcodes are used.  One useful idea would be, on an RV64 system where
+a 32-bit-sized ADD was performed, to simply use the least significant
+32-bits of the register (exactly as is currently done) but at the same
+time to *respect the packed bitwidth as well*.
+
+The extended encoding (Table 17.6) would not be part of Simple-V.
+
+## 17.6 Base Vector Extension Supported Types
+
+TODO: analyse.  probably exactly the same.
+
+## 17.7 Maximum Vector Element Width
+
+No equivalent in Simple-V
+
+## 17.8 Vector Configuration Registers
+
+TODO: analyse.
+
+## 17.9 Legal Vector Unit Configurations
+
+TODO: analyse
+
+## 17.10 Vector Unit CSRs
+
+TODO: analyse
+
+> Ok so this is an aspect of Simple-V that I hadn't thought through,
+> yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section
+> 17.10 the CSRs are listed.  I note that there's some general-purpose
+> CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i
+> don't precisely know what those are for.
+
+>  In the Simple-V proposal, *every* register in both the integer
+> register-file *and* the floating-point register-file would have at
+> least a 2-bit "data-width" CSR and probably something like an 8-bit
+> "vector-length" CSR (less in RV32E, by exactly one bit).
+
+>  What I *don't* know is whether that would be considered perfectly
+> reasonable or completely insane.  If it turns out that the proposed
+> Simple-V CSRs can indeed be stored in SRAM then I would imagine that
+> adding somewhere in the region of 10 bits per register would be... okay? 
+> I really don't honestly know.
+
+>  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
+> be multi-ported? No I don't believe they would.
+
+## 17.11 Maximum Vector Length (MVL)
+
+Basically implicitly this is set to the maximum size of the register
+file multiplied by the number of 8-bit packed ints that can fit into
+a register (4 for RV32, 8 for RV64 and 16 for RV128).
+
+## !7.12 Vector Instruction Formats
+
+No equivalent in Simple-V because *all* instructions of *all* Extensions
+are implicitly parallelised (and packed).
+
+## 17.13 Polymorphic Vector Instructions
+
+Polymorphism (implicit type-casting) is deliberately not supported
+in Simple-V.
+
+## 17.14 Rapid Configuration Instructions
+
+TODO: analyse if this is useful to have an equivalent in Simple-V
+
+## 17.15 Vector-Type-Change Instructions
+
+TODO: analyse if this is useful to have an equivalent in Simple-V
+
+## 17.16 Vector Length
+
+Has a direct corresponding equivalent.
+
+## 17.17 Predicated Execution
+
+Predicated Execution is another name for "masking" or "tagging".  Masked
+(or tagged) implies that there is a bit field which is indexed, and each
+bit associated with the corresponding indexed offset register within
+the "Vector".  If the tag / mask bit is 1, when a parallel operation is
+issued, the indexed element of the vector has the operation carried out.
+However if the tag / mask bit is *zero*, that particular indexed element
+of the vector does *not* have the requested operation carried out.
+
+In V2.3-draft V, there is a significant (not recommended) difference:
+the zero-tagged elements are *set to zero*.  This loses a *significant*
+advantage of mask / tagging, particularly if the entire mask register
+is itself a general-purpose register, as that general-purpose register
+can be inverted, shifted, and'ed, or'ed and so on.  In other words
+it becomes possible, especially if Carry/Overflow from each vector
+operation is also accessible, to do conditional (step-by-step) vector
+operations including things like turn vectors into 1024-bit or greater
+operands with very few instructions, by treating the "carry" from
+one instruction as a way to do "Conditional add of 1 to the register
+next door".  If V2.3-draft V sets zero-tagged elements to zero, such
+extremely powerful techniques are simply not possible.
+
+It is noted that there is no mention of an equivalent to BEXT (element
+skipping) which would be particularly fascinating and powerful to have.
+In this mode, the "mask" would skip elements where its mask bit was zero
+in either the source or the destination operand.
+
+Lots to be discussed.
+
+## 17.18 Vector Load/Store Instructions
+
+The Vector Load/Store instructions as proposed in V are extremely powerful
+and can be used for reordering and regular restructuring.
+
+Vector Load:
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+          vreg[vd+j][i] = mem[areg[as1] + (i*(seglen+1)+j)*stride];
+
+Store:
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+          mem[areg[base] + (i*(seglen+1)+j)*stride] = vreg[vd+j][i];
+
+Indexed Load:
+
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+          vreg[vd+j][i] = mem[sreg[base] + vreg[vs2][i] + j*elsize];
+
+Indexed Store:
+
+    for (int i=0; i<vl; ++i)
+    if ([!]preg[p][i])
+      for (int j=0; j<seglen+1; j++)
+        mem[sreg[base] + vreg[vs2][i] + j*elsize] = vreg[vd+j][i];
+
+Keeping these instructions as-is for Simple-V is highly recommended.
+However: one of the goals of this Extension is to retro-fit (re-use)
+existing RV Load/Store:
+
+[[!table  data="""
+31                  20 | 19      15 | 14    12 | 11           7 | 6         0 |
+       imm[11:0]       |     rs1    |  funct3  |       rd       |    opcode |
+            12         |      5     |    3     |        5       |      7 |
+       offset[11:0]    |    base    |  width   |      dest      |    LOAD |
+"""]]
+
+[[!table  data="""
+31          25 | 24    20 | 19     15 | 14    12 | 11          7 | 6         0 |
+ imm[11:5]     |   rs2    |    rs1    |  funct3  |   imm[4:0]    |    opcode |
+      7        |    5     |     5     |    3     |       5       |      7 |
+ offset[11:5]  |   src    |   base    |  width   |  offset[4:0]  |   STORE |
+"""]]
+
+The RV32 instruction opcodes as follows:
+
+[[!table  data="""
+31 28  27 | 26 25 | 24  20 |19  15  |14| 13 12 | 11   7 | 6     0 | op  |
+imm[4:0]  | 00    | 00000  |    rs1 | 1| m     | vd     | 0000111 | VLD |
+imm[4:0]  | 01    |   rs2  |    rs1 | 1| m     | vd     | 0000111 | VLDS|
+imm[4:0]  | 11    |   vs2  |    rs1 | 1| m     | vd     | 0000111 | VLDX|
+vs3       | 00    | 00000  |    rs1 |1 | m     |imm[4:0]| 0100111 |VST  |
+vs3       | 01    | rs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTS |
+vs3       | 11    | vs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTX |
+"""]]
+
+Conversion on LOAD as follows:
+
+* rd or rs1 are CSR-vectorised indicating "Vector Mode"
+* rd equivalent to vd
+* rs1 equivalent to rs1
+* imm[4:0] from RV format (11..7]) is same
+* imm[9:5] from RV format (29..25] is rs2 (rs2=00000 for VLD)
+* imm[11:10] from RV format (31..30] is opcode (VLD, VLDS, VLDX)
+* width from RV format (14..12) is same (width and zero/sign extend)
+
+[[!table  data="""
+31 30 | 29 25 | 24    20 | 19 15 | 14  12 | 11      7 | 6    0 |
+imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
+2     | 5     | 5        | 5     | 3      | 5         | 7      |
+00    | 00000 | imm[4:0] | base  | width  | dest      | LOAD   |
+01    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.S |
+11    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.X |
+"""]]
+
+Similar conversion on STORE as follows:
+
+[[!table  data="""
+31 30 | 29  25 | 24   20 | 19 15 | 14  12 | 11      7 | 6    0 |
+imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
+2     | 5      | 5       | 5     | 3      | 5         | 7      |
+00    | 00000  | src     | base  | width  | offs[4:0] | LOAD   |
+01    | rs3    | src     | base  | width  | offs[4:0] | LOAD.S |
+11    | rs3    | src     | base  | width  | offs[4:0] | LOAD.X |
+"""]]
+
+Notes:
+
+* Predication CSR-marking register is not explicitly shown in instruction
+* In both LOAD and STORE, it is possible now to rs2 (or rs3) as a vector.
+* That in turn means that Indexed Load need not have an explicit opcode
+* That in turn means that bit 30 may indicate "stride" and bit 31 is free
+
+Revised LOAD:
+
+[[!table  data="""
+31 | 30 | 29  25      |  24   20 | 19    15 | 14    12 | 11   7 | 6     0 |
+   imm[11:0]                  ||||    rs1   |  funct3  |   rd   |  opcode |
+ 1 | 1  |  5          |    5     |     5    |    3     |    5   |     7   |
+ ? | s  |  rs2        | imm[4:0] |   base   |  width   |  dest  |  LOAD   |
+"""]]
+
+Where in turn the pseudo-code may now combine the two:
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+        {
+          if CSRvectorised[rs2])
+             offs = vreg[rs2][i]
+          else
+             offs = i*(seglen+1)*stride;
+          vreg[vd+j][i] = mem[sreg[base] + offs + j*stride];
+        }
+
+Notes:
+
+* j is multiplied by stride, not elsize, including in the rs2 vectorised case.
+* There may be more sophisticated variants involving the 31st bit, however
+  it would be nice to reserve that bit for post-increment of address registers
+
+## 17.19 Vector Register Gather
+
+TODO
+
+## TODO, sort
+
+> However, there are also several features that go beyond simply attaching VL
+> to a scalar operation and are crucial to being able to vectorize a lot of
+> code.  To name a few:
+> - Conditional execution (i.e., predicated operations)
+> - Inter-lane data movement (e.g. SLIDE, SELECT)
+> - Reductions (e.g., VADD with a scalar destination)
+
+ Ok so the Conditional and also the Reductions is one of the reasons
+ why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
+ of a decent name) i proposed that it be implemented as "if you say r0
+ is to be a vector / SIMD that means operations actually take place on
+ r0,r1,r2... r(N-1)".
+
+ Consequently any parallel operation could be paused (or... more
+ specifically: vectors disabled by resetting it back to a default /
+ scalar / vector-length=1) yet the results would actually be in the
+ *main register file* (integer or float) and so anything that wasn't
+ possible to easily do in "simple" parallel terms could be done *out*
+ of parallel "mode" instead.
+
+ I do appreciate that the above does imply that there is a limit to the
+ length that SimpleV (whatever) can be parallelised, namely that you
+ run out of registers!  my thought there was, "leave space for the main
+ V-Ext proposal to extend it to the length that V currently supports".
+ Honestly i had not thought through precisely how that would work.
+
+ Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that,
+ it reminds me of the discussion with Clifford on bit-manipulation
+ (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if
+ applied "globally and outside of V and P" SLIDE and SELECT might become
+ an extremely powerful way to do fast memory copy and reordering [2[.
+
+ However I haven't quite got my head round how that would work: i am
+ used to the concept of register "tags" (the modern term is "masks")
+ and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
+ STORE you would get the exact same thing as SELECT.
+
+ SLIDE you could do simply by setting say r0 vector-length to say 16
+ (meaning that if referred to in any operation it would be an implicit
+ parallel operation on *all* registers r0 through r15), and temporarily
+ set say.... r7 vector-length to say... 5.  Do a LOAD on r7 and it would
+ implicitly mean "load from memory into r7 through r11".  Then you go
+ back and do an operation on r0 and ta-daa, you're actually doing an
+ operation on a SLID {SLIDED?) vector.
+
+ The advantage of Simple-V (whatever) over V would be that you could
+ actually do *operations* in the middle of vectors (not just SLIDEs)
+ simply by (as above) setting r0 vector-length to 16 and r7 vector-length
+ to 5.  There would be nothing preventing you from doing an ADD on r0
+ (which meant do an ADD on r0 through r15) followed *immediately in the
+ next instruction with no setup cost* a MUL on r7 (which actually meant
+ "do a parallel MUL on r7 through r11").
+
+ btw it's worth mentioning that you'd get scalar-vector and vector-scalar
+ implicitly by having one of the source register be vector-length 1
+ (the default) and one being N > 1.  but without having special opcodes
+ to do it.  i *believe* (or more like "logically infer or deduce" as
+ i haven't got access to the spec) that that would result in a further
+ opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V.
+
+ Also, Reduction *might* be possible by specifying that the destination be
+ a scalar (vector-length=1) whilst the source be a vector.  However... it
+ would be an awful lot of work to go through *every single instruction*
+ in *every* Extension, working out which ones could be parallelised (ADD,
+ MUL, XOR) and those that definitely could not (DIV, SUB).  Is that worth
+ the effort?  maybe.  Would it result in huge complexity? probably.
+ Could an implementor just go "I ain't doing *that* as parallel!
+ let's make it virtual-parallelism (sequential reduction) instead"?
+ absolutely.  So, now that I think it through, Simple-V (whatever)
+ covers Reduction as well.  huh, that's a surprise.
+
+
+> - Vector-length speculation (making it possible to vectorize some loops with
+> unknown trip count) - I don't think this part of the proposal is written
+> down yet.
+
+ Now that _is_ an interesting concept.  A little scary, i imagine, with
+ the possibility of putting a processor into a hard infinite execution
+ loop... :)
+
+
+> Also, note the vector ISA consumes relatively little opcode space (all the
+> arithmetic fits in 7/8ths of a major opcode).  This is mainly because data
+> type and size is a function of runtime configuration, rather than of opcode.
+
+ yes.  i love that aspect of V, i am a huge fan of polymorphism [1]
+ which is why i am keen to advocate that the same runtime principle be
+ extended to the rest of the RISC-V ISA [3]
+
+ Yikes that's a lot.  I'm going to need to pull this into the wiki to
+ make sure it's not lost.
+
+[1] inherent data type conversion: 25 years ago i designed a hypothetical
+hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit
+(escape-extended) opcodes and 2-bit (escape-extended) operands that
+only required a fixed 8-bit instruction length.  that relied heavily
+on polymorphism and runtime size configurations as well.  At the time
+I thought it would have meant one HELL of a lot of CSRs... but then I
+met RISC-V and was cured instantly of that delusion^Wmisapprehension :)
+
+[2] Interestingly if you then also add in the other aspect of Simple-V
+(the data-size, which is effectively functionally orthogonal / identical
+to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE
+operations become byte / half-word / word augmenters of B-Ext's proposed
+"BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored
+LOAD / STORE would deal with 8 / 16 / 32 bits at a time.  Where it
+would get really REALLY interesting would be masked-packed-vectored
+B-Ext BGS instructions.  I can't even get my head fully round that,
+which is a good sign that the combination would be *really* powerful :)
+
+[3] ok sadly maybe not the polymorphism, it's too complicated and I
+think would be much too hard for implementors to easily "slide in" to an
+existing non-Simple-V implementation.  i say that despite really *really*
+wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some
+fashion, for optimising 3D Graphics.  *sigh*.
+
+## TODO: analyse, auto-increment on unit-stride and constant-stride
+
+so i thought about that for a day or so, and wondered if it would be
+possible to propose a variant of zero-overhead loop that included
+auto-incrementing the two address registers a2 and a3, as well as
+providing a means to interact between the zero-overhead loop and the
+vsetvl instruction.  a sort-of pseudo-assembly of that would look like:
+
+    # a2 to be auto-incremented by t0 times 4
+    zero-overhead-set-auto-increment a2, t0, 4
+    # a2 to be auto-incremented by t0 times 4
+    zero-overhead-set-auto-increment a3, t0, 4
+    zero-overhead-set-loop-terminator-condition a0 zero
+    zero-overhead-set-start-end stripmine, stripmine+endoffset
+    stripmine:
+    vsetvl t0,a0
+    vlw v0, a2
+    vlw v1, a3
+    vfma v1, a1, v0, v1
+    vsw v1, a3
+    sub a0, a0, t0
+    stripmine+endoffset:
+
+the question is: would something like this even be desirable?  it's a
+variant of auto-increment [1].  last time i saw any hint of auto-increment
+register opcodes was in the 1980s... 68000 if i recall correctly... yep
+see [1]
+
+[1] http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node6.html
+
+Reply:
+
+Another option for auto-increment is for vector-memory-access instructions
+to support post-increment addressing for unit-stride and constant-stride
+modes.  This can be implemented by the scalar unit passing the operation
+to the vector unit while itself executing an appropriate multiply-and-add
+to produce the incremented address.  This does *not* require additional
+ports on the scalar register file, unlike scalar post-increment addressing
+modes.
+
+## TODO: instructions (based on Hwacha) V-Ext duplication analysis
+
+This is partly speculative due to lack of access to an up-to-date
+V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing).  However
+basin an analysis instead on Hwacha, a cursory examination shows over
+an **85%** duplication of V-Ext operand-related instructions when
+compared to Simple-V on a standard RG64G base.   Even Vector Fetch
+is analogous to "zero-overhead loop".
+
+Exceptions are:
+
+* Vector Indexed Memory Instructions (non-contiguous)
+* Vector Atomic Memory Instructions.
+* Some of the Vector Misc ops: VEIDX, VFIRST, VCLASS, VPOPC
+  and potentially more.
+* Consensual Jump
+
+Table of RV32V Instructions
+
+| RV32V      | RV Equivalent (FP)   | RV Equivalent (Int) | Notes |
+| -----      | --- | |   |
+| VADD       | FADD    | ADD |   |
+| VSUB       | FSUB    | SUB |   |
+| VSL        |     | SLL |   |
+| VSR        |     | SRL |   |
+| VAND       |     | AND |   |
+| VOR        |     | OR |   |
+| VXOR       |     | XOR |   |
+| VSEQ       | FEQ | BEQ | {1} |
+| VSNE       | !FEQ | BNE | {1} |
+| VSLT       | FLT    | BLT | {1} |
+| VSGE       | !FLE | BGE | {1} |
+| VCLIP      |     | |   |
+| VCVT       | FCVT    | |   |
+| VMPOP      |     | |   |
+| VMFIRST    |     | |   |
+| VEXTRACT   |     | |   |
+| VINSERT    |     | |   |
+| VMERGE     |     | |   |
+| VSELECT    |     | |   |
+| VSLIDE     |     | |   |
+| VDIV       | FDIV    | DIV |   |
+| VREM       |     | REM |   |
+| VMUL       | FMUL    | MUL |   |
+| VMULH      |     | |   |
+| VMIN       | FMIN    | |   |
+| VMAX       | FMUX    | |   |
+| VSGNJ      | FSGNJ    | |   |
+| VSGNJN     | FSGNJN    | |   |
+| VSGNJX     | FSNGJX    | |   |
+| VSQRT      | FSQRT    | |   |
+| VCLASS     |     | |   |
+| VPOPC      |     | |   |
+| VADDI      |     | ADDI |   |
+| VSLI       |     | SLI |   |
+| VSRI       |     | SRI |   |
+| VANDI      |     | ANDI |   |
+| VORI       |     | ORI |   |
+| VXORI      |     | XORI |   |
+| VCLIPI     |     | |   |
+| VMADD      | FMADD    | |   |
+| VMSUB      | FMSUB    | |   |
+| VNMADD     | FNMSUB    | |   |
+| VNMSUB     | FNMADD    | |   |
+| VLD        | FLD    | LD |   |
+| VLDS       |     | LW |   |
+| VLDX       |     | LWU |   |
+| VST        | FST    | ST |   |
+| VSTS       |     | |   |
+| VSTX       |     | |   |
+| VAMOSWAP   |     | AMOSWAP |   |
+| VAMOADD    |     | AMOADD |   |
+| VAMOAND    |     | AMOAND |   |
+| VAMOOR     |     | AMOOR |   |
+| VAMOXOR    |     | AMOXOR |   |
+| VAMOMIN    |     | AMOMIN |   |
+| VAMOMAX    |     | AMOMAX |   |
+
+Notes:
+
+* {1} retro-fit predication variants into branch instructions (base and C),
+  decoding triggered by CSR bit marking register as "Vector type".
+
+## TODO: sort
+
+> I suspect that the "hardware loop" in question is actually a zero-overhead
+> loop unit that diverts execution from address X to address Y if a certain
+> condition is met.
+
+ not quite.  The zero-overhead loop unit interestingly would be at
+an [independent] level above vector-length.  The distinctions are
+as follows:
+
+* Vector-length issues *virtual* instructions where the register
+  operands are *specifically* altered (to cover a range of registers),
+  whereas zero-overhead loops *specifically* do *NOT* alter the operands
+  in *ANY* way.
+
+* Vector-length-driven "virtual" instructions are driven by *one*
+ and *only* one instruction (whether it be a LOAD, STORE, or pure
+ one/two/three-operand opcode) whereas zero-overhead loop units
+ specifically apply to *multiple* instructions.
+
+Where vector-length-driven "virtual" instructions might get conceptually
+blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD /
+STORE, to actually be useful, vector-length-driven LOAD / STORE should
+increment the LOAD / STORE memory address to correspondingly match the
+increment in the register bank.  example:
+
+* set vector-length for r0 to 4
+* issue RV32 LOAD from addr 0x1230 to r0
+
+translates effectively to:
+
+* RV32 LOAD from addr 0x1230 to r0
+* ...
+* ...
+* RV32 LOAD from addr 0x123B to r3
+
+# P-Ext ISA
+
+## 16-bit Arithmetic
+
+| Mnemonic           | 16-bit Instruction        | Simple-V Equivalent |
+| ------------------ | ------------------------- | ------------------- |
+| ADD16 rt, ra, rb   | add                       | RV ADD (bitwidth=16) |
+| RADD16 rt, ra, rb  | Signed Halving add        | |
+| URADD16 rt, ra, rb | Unsigned Halving add      | |
+| KADD16 rt, ra, rb  | Signed Saturating add     | |
+| UKADD16 rt, ra, rb | Unsigned Saturating add   | |
+| SUB16 rt, ra, rb   | sub                       | RV SUB (bitwidth=16) |
+| RSUB16 rt, ra, rb  | Signed Halving sub        | |
+| URSUB16 rt, ra, rb | Unsigned Halving sub                | |
+| KSUB16 rt, ra, rb  | Signed Saturating sub               | |
+| UKSUB16 rt, ra, rb | Unsigned Saturating sub             | |
+| CRAS16 rt, ra, rb  | Cross Add & Sub                     | |
+| RCRAS16 rt, ra, rb | Signed Halving Cross Add & Sub      | |
+| URCRAS16 rt, ra, rb| Unsigned Halving Cross Add & Sub    | |
+| KCRAS16 rt, ra, rb | Signed Saturating Cross Add & Sub   | |
+| UKCRAS16 rt, ra, rb| Unsigned Saturating Cross Add & Sub | |
+| CRSA16 rt, ra, rb  | Cross Sub & Add                     | |
+| RCRSA16 rt, ra, rb | Signed Halving Cross Sub & Add      | |
+| URCRSA16 rt, ra, rb| Unsigned Halving Cross Sub & Add    | |
+| KCRSA16 rt, ra, rb | Signed Saturating Cross Sub & Add   | |
+| UKCRSA16 rt, ra, rb| Unsigned Saturating Cross Sub & Add | |
+
+## 8-bit Arithmetic
+
+| Mnemonic           | 16-bit Instruction        | Simple-V Equivalent |
+| ------------------ | ------------------------- | ------------------- |
+| ADD8 rt, ra, rb    | add                       | RV ADD (bitwidth=8)|
+| RADD8 rt, ra, rb   | Signed Halving add        | |
+| URADD8 rt, ra, rb  | Unsigned Halving add      | |
+| KADD8 rt, ra, rb   | Signed Saturating add     | |
+| UKADD8 rt, ra, rb  | Unsigned Saturating add   | |
+| SUB8 rt, ra, rb    | sub                       | RV SUB (bitwidth=8)|
+| RSUB8 rt, ra, rb   | Signed Halving sub        | |
+| URSUB8 rt, ra, rb  | Unsigned Halving sub      | |
+
+# Exceptions
+
+> What does an ADD of two different-sized vectors do in simple-V?
+
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+  than the destination, throw an exception.
+
+> And what about instructions like JALR? 
+> What does jumping to a vector do? 
+
+* Throw an exception.  Whether that actually results in spawning threads
+  as part of the trap-handling remains to be seen.
+
+# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
+
+This section compares the various parallelism proposals as they stand,
+including traditional SIMD, in terms of features, ease of implementation,
+complexity, flexibility, and die area.
+
+## [[alt_rvp]]
+
+Primary benefit of Alt-RVP is the simplicity with which parallelism
+may be introduced (effective multiplication of regfiles and associated ALUs).
+
+* plus: the simplicity of the lanes (combined with the regularity of
+  allocating identical opcodes multiple independent registers) meaning
+  that SRAM or 2R1W can be used for entire regfile (potentially).
+* minus: a more complex instruction set where the parallelism is much
+  more explicitly directly specified in the instruction and
+* minus: if you *don't* have an explicit instruction (opcode) and you
+  need one, the only place it can be added is... in the vector unit and
+* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
+  not useable or accessible in other Extensions.
+* plus-and-minus: Lanes may be utilised for high-speed context-switching
+  but with the down-side that they're an all-or-nothing part of the Extension.
+  No Alt-RVP: no fast register-bank switching.
+* plus: Lane-switching would mean that complex operations not suited to
+  parallelisation can be carried out, followed by further parallel Lane-based
+  work, without moving register contents down to memory (and back)
+* minus: Access to registers across multiple lanes is challenging. "Solution"
+  is to drop data into memory and immediately back in again (like MMX).
+
+## Simple-V
+
+Primary benefit of Simple-V is the OO abstraction of parallel principles
+from actual (internal) parallel hardware.  It's an API in effect that's
+designed to be slotted in to an existing implementation (just after
+instruction decode) with minimum disruption and effort.
+
+* minus: the complexity of having to use register renames, OoO, VLIW,
+  register file cacheing, all of which has been done before but is a
+  pain
+* plus: transparent re-use of existing opcodes as-is just indirectly
+  saying "this register's now a vector" which
+* plus: means that future instructions also get to be inherently
+  parallelised because there's no "separate vector opcodes"
+* plus: Compressed instructions may also be (indirectly) parallelised
+* minus: the indirect nature of Simple-V means that setup (setting
+  a CSR register to indicate vector length, a separate one to indicate
+  that it is a predicate register and so on) means a little more setup
+  time than Alt-RVP or RVV's "direct and within the (longer) instruction"
+  approach.
+* plus: shared register file meaning that, like Alt-RVP, complex
+  operations not suited to parallelisation may be carried out interleaved
+  between parallelised instructions *without* requiring data to be dropped
+  down to memory and back (into a separate vectorised register engine).
+* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
+  files means that huge parallel workloads would use up considerable
+  chunks of the register file.  However in the case of RV64 and 32-bit
+  operations, that effectively means 64 slots are available for parallel
+  operations.
+* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
+  be added, yet the instruction opcodes remain unchanged (and still appear
+  to be parallel).  consistent "API" regardless of actual internal parallelism:
+  even an in-order single-issue implementation with a single ALU would still
+  appear to have parallel vectoristion.
+* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
+  hard to say if there would be pluses or minuses (on die area).  At worse it
+  would be "no worse" than existing register renaming, OoO, VLIW and register
+  file cacheing schemes.
+
+## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
+
+RVV is extremely well-designed and has some amazing features, including
+2D reorganisation of memory through LOAD/STORE "strides".
+
+* plus: regular predictable workload means that implementations may
+  streamline effects on L1/L2 Cache.
+* plus: regular and clear parallel workload also means that lanes
+  (similar to Alt-RVP) may be used as an implementation detail,
+  using either SRAM or 2R1W registers.
+* plus: separate engine with no impact on the rest of an implementation
+* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
+  really feasible.
+* minus: no ISA abstraction or re-use either: additions to other Extensions
+  do not gain parallelism, resulting in prolific duplication of functionality
+  inside RVV *and out*.
+* minus: when operations require a different approach (scalar operations
+  using the standard integer or FP regfile) an entire vector must be
+  transferred out to memory, into standard regfiles, then back to memory,
+  then back to the vector unit, this to occur potentially multiple times.
+* minus: will never fit into Compressed instruction space (as-is.  May
+  be able to do so if "indirect" features of Simple-V are partially adopted).
+* plus-and-slight-minus: extended variants may address up to 256
+  vectorised registers (requires 48/64-bit opcodes to do it).
+* minus-and-partial-plus: separate engine plus complexity increases
+  implementation time and die area, meaning that adoption is likely only
+  to be in high-performance specialist supercomputing (where it will
+  be absolutely superb).
+
+## Traditional SIMD
+
+The only really good things about SIMD are how easy it is to implement and
+get good performance.  Unfortunately that makes it quite seductive...
+
+* plus: really straightforward, ALU basically does several packed operations
+  at once.  Parallelism is inherent at the ALU, making the addition of
+  SIMD-style parallelism an easy decision that has zero significant impact
+  on the rest of any given architectural design and layout.
+* plus (continuation): SIMD in simple in-order single-issue designs can
+  therefore result in superb throughput, easily achieved even with a very
+  simple execution model.
+* minus: ridiculously complex setup and corner-cases that disproportionately
+  increase instruction count on what would otherwise be a "simple loop",
+  should the number of elements in an array not happen to exactly match
+  the SIMD group width.
+* minus: getting data usefully out of registers (if separate regfiles
+  are used) means outputting to memory and back.
+* minus: quite a lot of supplementary instructions for bit-level manipulation
+  are needed in order to efficiently extract (or prepare) SIMD operands.
+* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
+  dimension and parallelism (width): an at least O(N^2) and quite probably
+  O(N^3) ISA proliferation that often results in several thousand
+  separate instructions.  all requiring separate and distinct corner-case
+  algorithms!
+* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
+  8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
+  For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
+  four separate and distinct instructions: one for (r1:low r2:high),
+  one for (r1:high r2:low), one for (r1:high r2:high) and one for
+  (r1:low r2:low) *per function*.
+* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
+  between operand and result bit-widths.  In combination with high/low
+  proliferation the situation is made even worse.
+* minor-saving-grace: some implementations *may* have predication masks
+  that allow control over individual elements within the SIMD block.
+
+# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
+
+This section compares the various parallelism proposals as they stand,
+*against* traditional SIMD as opposed to *alongside* SIMD.  In other words,
+the question is asked "How can each of the proposals effectively implement
+(or replace) SIMD, and how effective would they be"?
+
+## [[alt_rvp]]
+
+* Alt-RVP would not actually replace SIMD but would augment it: just as with
+  a SIMD architecture where the ALU becomes responsible for the parallelism,
+  Alt-RVP ALUs would likewise be so responsible... with *additional*
+  (lane-based) parallelism on top.
+* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+  at least one dimension are avoided (architectural upgrades introducing
+  128-bit then 256-bit then 512-bit variants of the exact same 64-bit
+  SIMD block)
+* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
+  of instructions as SIMD, albeit not quite as badly (due to Lanes).
+* In the same discussion for Alt-RVP, an additional proposal was made to
+  be able to subdivide the bits of each register lane (columns) down into
+  arbitrary bit-lengths (RGB 565 for example).
+* A recommendation was given instead to make the subdivisions down to 32-bit,
+  16-bit or even 8-bit, effectively dividing the registerfile into
+  Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further.  If inter-lane
+  "swapping" instructions were then introduced, some of the disadvantages
+  of SIMD could be mitigated.
+
+## RVV
+
+* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
+  parallelism.
+* However whilst SIMD is usually designed for single-issue in-order simple
+  DSPs with a focus on Multimedia (Audio, Video and Image processing),
+  RVV's primary focus appears to be on Supercomputing: optimisation of
+  mathematical operations that fit into the OpenCL space.
+* Adding functions (operations) that would normally fit (in parallel) 
+  into a SIMD instruction requires an equivalent to be added to the
+  RVV Extension, if one does not exist.  Given the specialist nature of
+  some SIMD instructions (8-bit or 16-bit saturated or halving add),
+  this possibility seems extremely unlikely to occur, even if the
+  implementation overhead of RVV were acceptable (compared to
+  normal SIMD/DSP-style single-issue in-order simplicity).
+
+## Simple-V
+
+* Simple-V borrows hugely from RVV as it is intended to be easy to
+  topologically transplant every single instruction from RVV (as
+  designed) into Simple-V equivalents, with *zero loss of functionality
+   or capability*.
+* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
+  Extension which contained the basic primitives (non-parallelised
+  8, 16 or 32-bit SIMD operations) inherently *become* parallel,
+  automatically.
+* Additionally, standard operations (ADD, MUL) that would normally have
+  to have special SIMD-parallel opcodes added need no longer have *any*
+  of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
+  4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
+  *standard* RV opcodes (present and future) and automatically parallelises
+  them.
+* By inheriting the RVV feature of arbitrary vector-length, then just as
+  with RVV the corner-cases and ISA proliferation of SIMD is avoided.
+* Whilst not entirely finalised, registers are expected to be
+  capable of being subdivided down to an implementor-chosen bitwidth
+  in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
+  and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
+  choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
+  ALUs that perform twin 8-bit operations as they see fit, or anything
+  else including no subdivisions at all.
+* Even though implementors have that choice even to have full 64-bit
+  (with RV64) SIMD, they *must* provide predication that transparently
+  switches off appropriate units on the last loop, thus neatly fitting
+  underlying SIMD ALU implementations *into* the arbitrary vector-length
+  RVV paradigm, keeping the uniform consistent API that is a key strategic
+  feature of Simple-V.
+* With Simple-V fitting into the standard register files, certain classes
+  of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
+  can be done by applying *Parallelised* Bit-manipulation operations
+  followed by parallelised *straight* versions of element-to-element
+  arithmetic operations, even if the bit-manipulation operations require
+  changing the bitwidth of the "vectors" to do so.  Predication can
+  be utilised to skip high words (or low words) in source or destination.
+* In essence, the key downside of SIMD - massive duplication of
+  identical functions over time as an architecture evolves from 32-bit
+  wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
+  vector-style parallelism being dropped on top of 8-bit or 16-bit
+  operations, all the while keeping a consistent ISA-level "API" irrespective
+  of implementor design choices (or indeed actual implementations).
+
+# Impementing V on top of Simple-V
+
+* Number of Offset CSRs extends from 2
+* Extra register file: vector-file
+* Setup of Vector length and bitwidth CSRs now can specify vector-file
+  as well as integer or float file.
+* TODO
+
+# Implementing P (renamed to DSP) on top of Simple-V
+
+* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
+  (caveat: anything not specified drops through to software-emulation / traps)
+* TODO
+
+# Register reordering <a name="register_reordering"></a>
+
+## Register File 
+
+| Reg Num | Bits |
+| ------- | ---- |
+| r0 | (32..0) |
+| r1 | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) |
+| r5 | (32..0) |
+| r6 | (32..0) |
+| r7 | (32..0) |
+
+## Vectorised CSR
+
+May not be an actual CSR: may be generated from Vector Length CSR:
+single-bit is less burdensome on instruction decode phase.
+
+| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
+| - | - | - | - | - | - | - | - |  
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
+
+## Vector Length CSR
+
+| Reg Num | (3..0) |
+| ------- | ---- |
+| r0 | 2 |
+| r1 | 0 |
+| r2 | 1 |
+| r3 | 1 |
+| r4 | 3 |
+| r5 | 0 |
+| r6 | 0 |
+| r7 | 1 |
+
+## Virtual Register Reordering:
+
+| Reg Num | Bits (0) | Bits (1) | Bits (2) |
+| ------- | -------- | -------- | -------- |
+| r0 | (32..0) | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) | (32..0) | (32..0) |
+| r7 | (32..0) |
+
+## Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FILO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
+## Insights 
+
+SIMD register file splitting still to consider.  For RV64, benefits of doubling
+(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
+size of the floating point register file to 64 (128 in the case of HP)
+seem pretty clear and worth the complexity.
+
+64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
+done on 64-bit registers it's not so conceptually difficult.  May even
+be achieved by *actually* splitting the regfile into 64 virtual 32-bit
+registers such that a 64-bit FP scalar operation is dropped into (r0.H
+r0.L) tuples.  Implementation therefore hidden through register renaming.
+
+Implementations intending to introduce VLIW, OoO and parallelism
+(even without Simple-V) would then find that the instructions are
+generated quicker (or in a more compact fashion that is less heavy
+on caches).  Interestingly we observe then that Simple-V is about
+"consolidation of instruction generation", where actual parallelism
+of underlying hardware is an implementor-choice that could just as
+equally be applied *without* Simple-V even being implemented.
+
+# Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
+
+It could indeed have been logically deduced (or expected), that there
+would be additional decode latency in this proposal, because if
+overloading the opcodes to have different meanings, there is guaranteed
+to be some state, some-where, directly related to registers.
+
+There are several cases:
+
+* All operands vector-length=1 (scalars), all operands
+  packed-bitwidth="default": instructions are passed through direct as if
+  Simple-V did not exist.  Simple-V is, in effect, completely disabled.
+* At least one operand vector-length > 1, all operands
+  packed-bitwidth="default": any parallel vector ALUs placed on "alert",
+  virtual parallelism looping may be activated.
+* All operands vector-length=1 (scalars), at least one
+  operand packed-bitwidth != default: degenerate case of SIMD,
+  implementation-specific complexity here (packed decode before ALUs or
+  *IN* ALUs)
+* At least one operand vector-length > 1, at least one operand
+  packed-bitwidth != default: parallel vector ALUs (if any)
+  placed on "alert", virtual parallelsim looping may be activated,
+  implementation-specific SIMD complexity kicks in (packed decode before
+  ALUs or *IN* ALUs).
+
+Bear in mind that the proposal includes that the decision whether
+to parallelise in hardware or whether to virtual-parallelise (to
+dramatically simplify compilers and also not to run into the SIMD
+instruction proliferation nightmare) *or* a transprent combination
+of both, be done on a *per-operand basis*, so that implementors can
+specifically choose to create an application-optimised implementation
+that they believe (or know) will sell extremely well, without having
+"Extra Standards-Mandated Baggage" that would otherwise blow their area
+or power budget completely out the window.
+
+Additionally, two possible CSR schemes have been proposed, in order to
+greatly reduce CSR space:
+
+* per-register CSRs (vector-length and packed-bitwidth)
+* a smaller number of CSRs with the same information but with an *INDEX*
+  specifying WHICH register in one of three regfiles (vector, fp, int)
+  the length and bitwidth applies to.
+
+(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
+
+In addition, LOAD/STORE has its own associated proposed CSRs that
+mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
+V (and Hwacha).
+
+Also bear in mind that, for reasons of simplicity for implementors,
+I was coming round to the idea of permitting implementors to choose
+exactly which bitwidths they would like to support in hardware and which
+to allow to fall through to software-trap emulation.
+
+So the question boils down to:
+
+* whether either (or both) of those two CSR schemes have significant
+  latency that could even potentially require an extra pipeline decode stage
+* whether there are implementations that can be thought of which do *not*
+  introduce significant latency
+* whether it is possible to explicitly (through quite simply
+  disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
+  all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
+  the extreme of skipping an entire pipeline stage (if one is needed)
+* whether packed bitwidth and associated regfile splitting is so complex
+  that it should definitely, definitely be made mandatory that implementors
+  move regfile splitting into the ALU, and what are the implications of that
+* whether even if that *is* made mandatory, is software-trapped
+  "unsupported bitwidths" still desirable, on the basis that SIMD is such
+  a complete nightmare that *even* having a software implementation is
+  better, making Simple-V have more in common with a software API than
+  anything else.
+
+Whilst the above may seem to be severe minuses, there are some strong
+pluses:
+
+* Significant reduction of V's opcode space: over 85%.
+* Smaller reduction of P's opcode space: around 10%.
+* The potential to use Compressed instructions in both Vector and SIMD
+  due to the overloading of register meaning (implicit vectorisation,
+  implicit packing)
+* Not only present but also future extensions automatically gain parallelism.
+* Already mentioned but worth emphasising: the simplification to compiler
+  writers and assembly-level writers of having the same consistent ISA
+  regardless of whether the internal level of parallelism (number of
+  parallel ALUs) is only equal to one ("virtual" parallelism), or is
+  greater than one, should not be underestimated.
+
+# Appendix
+
+# Reducing Register Bank porting
+
+This looks quite reasonable.
+<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+
+The main details are outlined on page 4.  They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
  
-In particular: variable-length vectors came out on top because of the
-high setup, teardown and corner-cases associated with the fixed width
-of SIMD.  Implicit bit-width helps to extend the ISA to escape from
-former limitations and restrictions (in a backwards-compatible fashion),
-and implicit (zero-overhead) loops provide a means to keep pipelines
-potentially 100% occupied *without* requiring a super-scalar or out-of-order
-architecture.
  
-Constructing a SIMD/Simple-Vector proposal based around even only these four
-(five?) requirements would therefore seem to be a logical thing to do.
  
  # References
  
@@ -153,4 +1555,19 @@ Constructing a SIMD/Simple-Vector proposal based around even only these four
    "implicit program-counter" <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/vYVi95gF2Mo/SHz6a4_lAgAJ>
  * Re-continuing P-Extension proposal <https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/IkLkQn3HvXQ/SEMyC9IlAgAJ>
  * First Draft P-SIMD (DSP) proposal <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/vYVi95gF2Mo>
-
+* B-Extension discussion <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/zi_7B15kj6s>
+* Broadcom VideoCore-IV <https://docs.broadcom.com/docs/12358545>
+  Figure 2 P17 and Section 3 on P16.
+* Hwacha <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.html>
+* Hwacha <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-263.html>
+* Vector Workshop <http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>
+* Predication <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/XoP4BfYSLXA>
+* Branch Divergence <https://jbush001.github.io/2014/12/07/branch-divergence-in-parallel-kernels.html>
+* Life of Triangles (3D) <https://jbush001.github.io/2016/02/27/life-of-triangle.html>
+* Videocore-IV <https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-3d-Graphics-Pipeline>
+* Discussion proposing CSRs that change ISA definition
+  <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/InzQ1wr_3Ak>
+* Zero-overhead loops <https://pdfs.semanticscholar.org/dbaa/66985cc730d4b44d79f519e96ec9c43ab5b7.pdf>
+* Multi-ported VLIW Register File Implementation <https://ce-publications.et.tudelft.nl/publications/1517_multiple_contexts_in_a_multiported_vliw_register_file_impl.pdf>
+* Fast context save/restore proposal <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57F823FA.6030701%40gmail.com>
+* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>