clarify

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index a448b487d4cd7ad3e08d91e7d095c74c073a78da..afa1bd1acbdea400b1b8548e35dd819ff8fef407 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -5,7 +5,7 @@ a consistent "API" to parallelisation of existing *and future* operations.
  *Actual* internal hardware-level parallelism is *not* required, such
  that Simple-V may be viewed as providing a "compact" or "consolidated"
  means of issuing multiple near-identical arithmetic instructions to an
-instruction queue (FILO), pending execution.
+instruction queue (FIFO), pending execution.
  
  *Actual* parallelism, if added independently of Simple-V in the form
  of Out-of-order restructuring (including parallel ALU lanes) or VLIW
@@ -21,7 +21,7 @@ requirements: power-conscious, area-conscious, and performance-conscious
  designs all pull an ISA and its implementation in different conflicting
  directions, as do the specific intended uses for any given implementation.
  
-Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
+The existing P (SIMD) proposal and the V (Vector) proposals,
  whilst each extremely powerful in their own right and clearly desirable,
  are also:
  
@@ -31,8 +31,8 @@ are also:
    analysis and review purposes) prohibitively expensive
  * Both contain partial duplication of pre-existing RISC-V instructions
    (an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
-  at the instruction level.
+* Both have independent, incompatible and disparate methods for introducing
+  parallelism at the instruction level
  * Both require that their respective parallelism paradigm be implemented
    along-side and integral to their respective functionality *or not at all*.
  * Both independently have methods for introducing parallelism that
@@ -52,10 +52,13 @@ details outlined in the Appendix), the key points being:
  * Vectorisation typically includes much more comprehensive memory load
    and store schemes (unit stride, constant-stride and indexed), which
    in turn have ramifications: virtual memory misses (TLB cache misses)
-  and even multiple page-faults... all caused by a *single instruction*.
+  and even multiple page-faults... all caused by a *single instruction*,
+  yet with a clear benefit that the regularisation of LOAD/STOREs can
+  be optimised for minimal impact on caches and maximised throughput.
  * By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
    to pages), and these load/stores have absolutely nothing to do with the
-  SIMD / ALU engine, no matter how wide the operand.
+  SIMD / ALU engine, no matter how wide the operand.  Simplicity but with
+  more impact on instruction and data caches.
  
  Overall it makes a huge amount of sense to have a means and method
  of introducing instruction parallelism in a flexible way that provides
@@ -135,7 +138,7 @@ burdensome to implementations, given that instruction decode already has
  to direct the operation to a correctly-sized width ALU engine, anyway.
  
  Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
+whatever reason, including limitations of the available operand space),
  implicit bit-width allows the meaning of certain operations to be
  type-overloaded *without* pollution or alteration of frozen and immutable
  instructions, in a fully backwards-compatible fashion.
@@ -208,10 +211,10 @@ Interestingly, none of this complexity is faced in SIMD architectures...
  but then they do not get the opportunity to optimise for highly-streamlined
  memory accesses either.
  
-With the "bang-per-buck" ratio being so high and the direct improvement
-in L1 Instruction Cache usage, as well as the opportunity to optimise
-L1 and L2 cache usage, the case for including Vector LOAD/STORE is
-compelling.
+With the "bang-per-buck" ratio being so high and the indirect improvement
+in L1 Instruction Cache usage (reduced instruction count), as well as
+the opportunity to optimise L1 and L2 cache usage, the case for including
+Vector LOAD/STORE is compelling.
  
  ## Mask and Tagging (Predication)
  
@@ -232,8 +235,8 @@ So these are the ways in which conditional execution may be implemented:
  * explicit compare and branch: BNE x, y -> offs would jump offs
    instructions if x was not equal to y
  * explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) ADD results in a carry, carry bit implicitly
-  (or sometimes explicitly) goes into a "tag" (mask) register
+* implicit (condition-code) such as ADD results in a carry, carry bit
+  implicitly (or sometimes explicitly) goes into a "tag" (mask) register
  
  The first of these is a "normal" branch method, which is flat-out impossible
  to parallelise without look-ahead and effectively rewriting instructions.
@@ -304,17 +307,269 @@ In particular:
    i.e. *without* requiring a super-scalar or out-of-order architecture,
    but doing a proper, full job (ZOLC) is an entirely different matter.
  
-Constructing a SIMD/Simple-Vector proposal based around four of these five
+Constructing a SIMD/Simple-Vector proposal based around four of these six
  requirements would therefore seem to be a logical thing to do.
  
-# Instruction Format
+# Note on implementation of parallelism
+
+One extremely important aspect of this proposal is to respect and support
+implementors desire to focus on power, area or performance.  In that regard,
+it is proposed that implementors be free to choose whether to implement
+the Vector (or variable-width SIMD) parallelism as sequential operations
+with a single ALU, fully parallel (if practical) with multiple ALUs, or
+a hybrid combination of both.
+
+In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
+Parallelism".  They achieve a 16-way SIMD at an **instruction** level
+by providing a combination of a 4-way parallel ALU *and* an externally
+transparent loop that feeds 4 sequential sets of data into each of the
+4 ALUs.
+
+Also in the same core, it is worth noting that particularly uncommon
+but essential operations (Reciprocal-Square-Root for example) are
+*not* part of the 4-way parallel ALU but instead have a *single* ALU.
+Under the proposed Vector (varible-width SIMD) implementors would
+be free to do precisely that: i.e. free to choose *on a per operation
+basis* whether and how much "Virtual Parallelism" to deploy.
+
+It is absolutely critical to note that it is proposed that such choices MUST
+be **entirely transparent** to the end-user and the compiler.  Whilst
+a Vector (varible-width SIMD) may not precisely match the width of the
+parallelism within the implementation, the end-user **should not care**
+and in this way the performance benefits are gained but the ISA remains
+straightforward.  All that happens at the end of an instruction run is: some
+parallel units (if there are any) would remain offline, completely
+transparently to the ISA, the program, and the compiler.
+
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
+instructions to deal with corner-cases is thus avoided, and implementors
+get to choose precisely where to focus and target the benefits of their
+implementation efforts, without "extra baggage".
+
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism.  Options are covered in the Appendix.
+
+# CSRs <a name="csrs"></a>
+
+There are a number of CSRs needed, which are used at the instruction
+decode phase to re-interpret RV opcodes (a practice that has
+precedent in the setting of MISA to enable / disable extensions).
+
+* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+  marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+  of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+  "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+  as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+  state information.
+* TODO: assess whether the same technique could be applied to the other
+  Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+  V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+  needed for context-switches (empty slots need never be stored).
+
+## Predication CSR
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated.  The first entry is whether predication
+is enabled.  The second entry is whether the register index refers to a
+floating-point or an integer register.  The third entry is the index
+of that register which is to be predicated (if referred to).  The fourth entry
+is the integer register that is treated as a bitfield, indexable by the
+vector element index.
+
+| RegNo | 6      | 5   | (4..0)  | (4..0)  |
+| ----- | -      | -   | ------- | ------- |
+| r0    | pren0  | i/f | regidx  | predidx |
+| r1    | pren1  | i/f | regidx  | predidx |
+| ..    | pren.. | i/f | regidx  | predidx |
+| r15   | pren15 | i/f | regidx  | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+    fp_pred_enabled[32];
+    int_pred_enabled[32];
+    for (i = 0; i < 16; i++)
+       if CSRpred[i].pren:
+          idx = CSRpred[i].regidx
+          predidx = CSRpred[i].predidx
+          if CSRpred[i].type == 0: # integer
+            int_pred_enabled[idx] = 1
+            int_pred_reg[idx] = predidx
+          else:
+            fp_pred_enabled[idx] = 1
+            fp_pred_reg[idx] = predidx
+
+So when an operation is to be predicated, it is the internal state that
+is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+    for (int i=0; i<vl; ++i)
+        if ([!]preg[p][i])
+           (d ? vreg[rd][i] : sreg[rd]) =
+            iop(s1 ? vreg[rs1][i] : sreg[rs1],
+                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+This instead becomes an *indirect* reference using the *internal* state
+table generated from the Predication CSR key-value store:
+
+    if type(iop) == INT:
+        pred_enabled = int_pred_enabled
+        preg = int_pred_reg[rd]
+    else:
+        pred_enabled = fp_pred_enabled
+        preg = fp_pred_reg[rd]
+
+    for (int i=0; i<vl; ++i)
+        if (preg_enabled[rd] && [!]preg[i])
+           (d ? vreg[rd][i] : sreg[rd]) =
+            iop(s1 ? vreg[rs1][i] : sreg[rs1],
+                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+## MAXVECTORDEPTH
+
+MAXVECTORDEPTH is the same concept as MVL in RVV.  However in Simple-V,
+given that its primary (base, unextended) purpose is for 3D, Video and
+other purposes (not requiring supercomputing capability), it makes sense
+to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
+and so on).
+
+The reason for setting this limit is so that predication registers, when
+marked as such, may fit into a single register as opposed to fanning out
+over several registers.  This keeps the implementation a little simpler.
+Note that RVV on top of Simple-V may choose to over-ride this decision.
+
+## Vector-length CSRs
+
+Vector lengths are interpreted as meaning "any instruction referring to
+r(N) generates implicit identical instructions referring to registers
+r(N+M-1) where M is the Vector Length".  Vector Lengths may be set to
+use up to 16 registers in the register file.
+
+One separate CSR table is needed for each of the integer and floating-point
+register files:
+
+| RegNo | (3..0) |
+| ----- | ------ |
+| r0    | vlen0  |
+| r1    | vlen1  |
+| ..    | vlen.. |
+| r31   | vlen31 |
+
+An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
+whether a register was, if referred to in any standard instructions,
+implicitly to be treated as a vector.
+
+Note:
+
+* A vector length of 1 indicates that it is to be treated as a scalar.
+  Bitwidths (on the same register) are interpreted and meaningful.
+* A vector length of 0 indicates that the parallelism is to be switched
+  off for this register (treated as a scalar).  When length is 0,
+  the bitwidth CSR for the register is *ignored*.
+
+Internally, implementations may choose to use the non-zero vector length
+to set a bit-field per register, to be used in the instruction decode phase.
+In this way any standard (current or future) operation involving
+register operands may detect if the operation is to be vector-vector,
+vector-scalar or scalar-scalar (standard) simply through a single
+bit test.
+
+Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
+bitwidth is specifically not set) it becomes:
+
+    CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+
+This is in contrast to RVV:
+
+    CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+
+## Element (SIMD) bitwidth CSRs
+
+Element bitwidths may be specified with a per-register CSR, and indicate
+how a register (integer or floating-point) is to be subdivided.
+
+| RegNo | (2..0) |
+| ----- | ------ |
+| r0    | vew0   |
+| r1    | vew1   |
+| ..    | vew..  |
+| r31   | vew31  |
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default  |
+| 001 | 8        |
+| 010 | 16       |
+| 011 | 32       |
+| 100 | 64       |
+| 101 | 128      |
+| 110 | rsvd     |
+| 111 | rsvd     |
+
+Extending this table (with extra bits) is covered in the section
+"Implementing RVV on top of Simple-V".
+
+Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
+into account, it becomes:
+
+    vew = CSRbitwidth[rs1]
+    if (vew == 0)
+        bytesperreg = (XLEN/8) # or FLEN as appropriate
+    else:
+        bytesperreg = bytestable[vew] # 1 2 4 8 16
+    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+    vlen = CSRvectorlen[rs1] * simdmult
+    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+The reason for multiplying the vector length by the number of SIMD elements
+(in each individual register) is so that each SIMD element may optionally be
+predicated.
+
+An example of how to subdivide the register file when bitwidth != default
+is given in the section "Bitwidth Virtual Register Reordering".
+
+# Instructions
+
+By being a topological remap of RVV concepts, the following RVV instructions
+remain exactly the same: VMPOP, VMFIRST, VEXTRACT, VINSERT, VMERGE, VSELECT,
+VSLIDE, VCLASS and VPOPC.  Two instructions, VCLIP and VCLIPI, do not
+have RV Standard equivalents, so are left out of Simple-V.
+All other instructions from RVV are topologically re-mapped and retain
+their complete functionality, intact.
+
+## Instruction Format
  
  The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or memory instructions.
+compare operations, *any* arithmetic, floating point or *any*
+memory instructions.
  Instead it *overloads* pre-existing branch operations into predicated
  variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on implicit CSR configurations for both vector length and
-bitwidth.  *This includes Compressed instructions* as well as future ones.
+depending on CSR configurations for vector length, bitwidth and
+predication.  *This includes Compressed instructions* as well as any
+future instructions and Custom Extensions.
  
  * For analysis of RVV see [[v_comparative_analysis]] which begins to
    outline topologically-equivalent mappings of instructions
@@ -333,12 +588,29 @@ and having the benefit of being explicit.*
  
  ## Branch Instruction:
  
+Branch operations use standard RV opcodes that are reinterpreted to be
+"predicate variants" in the instance where either of the two src registers
+have their corresponding CSRvectorlen[src] entry as non-zero.  When this
+reinterpretation is enabled the predicate target register rs3 is to be
+treated as a bitfield (up to a maximum of XLEN bits corresponding to a
+maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector.  Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
  This is the overloaded table for Integer-base Branch operations.  Opcode
  (bits 6..0) is set in all cases to 1100011.
  
  [[!table  data="""
  31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
-imm[12|10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
+imm[12,10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
  7           | 5        | 5     | 3      | 4             | 1  | 7       |
  reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
  reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
@@ -351,11 +623,16 @@ reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
  reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
  """]]
  
-This is the overloaded table for Floating-point Predication operations.
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
+Below is the overloaded table for Floating-point Predication operations.
  Interestingly no change is needed to the instruction format because
  FP Compare already stores a 1 or a zero in its "rd" integer register
  target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield.
+The target needs to simply change to be a predication bitfield (done
+implicitly).
  
  As with
  Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
@@ -368,15 +645,15 @@ and whilst in ordinary branch code this is fine because the standard
  RVF compare can always be followed up with an integer BEQ or a BNE (or
  a compressed comparison to zero or non-zero), in predication terms that
  becomes more of an impact as an explicit (scalar) instruction is needed
-to invert the predicate.  An additional encoding funct3=011 is therefore
-proposed to cater for this.
+to invert the predicate bitmask.  An additional encoding funct3=011 is
+therefore proposed to cater for this.
  
  [[!table  data="""
  31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
  funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
  5       | 2        | 5        | 5     | 3      | 4        | 7       |
  10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
-10100   | 00/01/11 | src2     | src1  | *011*  | pred rs3 | FNE     |
+10100   | 00/01/11 | src2     | src1  | **011**| pred rs3 | FNE     |
  10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
  10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
  """]]
@@ -402,9 +679,11 @@ complex), this becomes:
      if I/F == INT: # integer type cmp
          pred_enabled = int_pred_enabled # TODO: exception if not set!
          preg = int_pred_reg[rd]
+        reg = int_regfile
      else:
          pred_enabled = fp_pred_enabled # TODO: exception if not set!
          preg = fp_pred_reg[rd]
+        reg = fp_regfile
  
      s1 = CSRvectorlen[src1] > 1;
      s2 = CSRvectorlen[src2] > 1;
@@ -416,7 +695,7 @@ Notes:
  
  * Predicated SIMD comparisons would break src1 and src2 further down
    into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
-  Reordering") setting Vector-Length * (number of SIMD elements) bits
+  Reordering") setting Vector-Length times (number of SIMD elements) bits
    in Predicate Register rs3 as opposed to just Vector-Length bits.
  * Predicated Branches do not actually have an adjustment to the Program
    Counter, so all of bits 25 through 30 in every case are not needed.
@@ -443,6 +722,7 @@ C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
  Notes:
  
  * Bits 5 13 14 and 15 make up the comparator type
+* Bit 6 indicates whether to use integer or floating-point comparisons
  * In both floating-point and integer cases there are four predication
    comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
    src1 and src2).
@@ -451,7 +731,8 @@ Notes:
  
  For full analysis of topological adaptation of RVV LOAD/STORE
  see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction.
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
  
  Revised LOAD:
  
@@ -481,7 +762,7 @@ Notes:
  * **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
  * **TODO**: clarify where width maps to elsize
  
-Pseudo-code (excludes CSR SIMD bitwidth):
+Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
  
      if (unit-strided) stride = elsize;
      else stride = areg[as2]; // constant-strided
@@ -554,226 +835,6 @@ of detecting early page / segmentation faults and adjusting the TLB
  in advance, accordingly: other strategies are explored in the Appendix
  Section "Virtual Memory Page Faults".
  
-# Note on implementation of parallelism
-
-One extremely important aspect of this proposal is to respect and support
-implementors desire to focus on power, area or performance.  In that regard,
-it is proposed that implementors be free to choose whether to implement
-the Vector (or variable-width SIMD) parallelism as sequential operations
-with a single ALU, fully parallel (if practical) with multiple ALUs, or
-a hybrid combination of both.
-
-In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
-Parallelism".  They achieve a 16-way SIMD at an **instruction** level
-by providing a combination of a 4-way parallel ALU *and* an externally
-transparent loop that feeds 4 sequential sets of data into each of the
-4 ALUs.
-
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
-
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler.  Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward.  All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
-
-The "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
-
-# CSRs <a name="csrs"></a>
-
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
-
-* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
-  marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
-  of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
-  "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
-  as opposed to having the predicate register explicitly in the instruction.
-
-## Predication CSR
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated.  The first entry is whether predication
-is enabled.  The second entry is whether the register index refers to a
-floating-point or an integer register.  The third entry is the index
-of that register which is to be predicated (if referred to).  The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6      | 5   | (4..0)  | (4..0)  |
-| ----- | -      | -   | ------- | ------- |
-| r0    | pren0  | i/f | regidx  | predidx |
-| r1    | pren1  | i/f | regidx  | predidx |
-| ..    | pren.. | i/f | regidx  | predidx |
-| r15   | pren15 | i/f | regidx  | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
-    fp_pred_enabled[32];
-    int_pred_enabled[32];
-    for (i = 0; i < 16; i++)
-       if CSRpred[i].pren:
-          idx = CSRpred[i].regidx
-          predidx = CSRpred[i].predidx
-          if CSRpred[i].type == 0: # integer
-            int_pred_enabled[idx] = 1
-            int_pred_reg[idx] = predidx
-          else:
-            fp_pred_enabled[idx] = 1
-            fp_pred_reg[idx] = predidx
-
-So when an operation is to be predicated, it is the internal state that
-is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
-    for (int i=0; i<vl; ++i)
-        if ([!]preg[p][i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store:
-
-    if type(iop) == INT:
-        pred_enabled = int_pred_enabled
-        preg = int_pred_reg[rd]
-    else:
-        pred_enabled = fp_pred_enabled
-        preg = fp_pred_reg[rd]
-
-    for (int i=0; i<vl; ++i)
-        if (preg_enabled[rd] && [!]preg[i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-## MAXVECTORDEPTH
-
-MAXVECTORDEPTH is the same concept as MVL in RVV.  However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
-
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers.  This keeps the implementation a little simpler.
-Note that RVV on top of Simple-V may choose to over-ride this decision.
-
-## Vector-length CSRs
-
-Vector lengths are interpreted as meaning "any instruction referring to
-r(N) generates implicit identical instructions referring to registers
-r(N+M-1) where M is the Vector Length".  Vector Lengths may be set to
-use up to 16 registers in the register file.
-
-One separate CSR table is needed for each of the integer and floating-point
-register files:
-
-| RegNo | (3..0) |
-| ----- | ------ |
-| r0    | vlen0  |
-| r1    | vlen1  |
-| ..    | vlen.. |
-| r31   | vlen31 |
-
-An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
-whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector.  A vector length of 1 indicates
-that it is to be treated as a scalar.  Vector lengths of 0 are reserved.
-
-Internally, implementations may choose to use the non-zero vector length
-to set a bit-field per register, to be used in the instruction decode phase.
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
-
-Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
-bitwidth is specifically not set) it becomes:
-
-    CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
-
-This is in contrast to RVV:
-
-    CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
-
-## Element (SIMD) bitwidth CSRs
-
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
-
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0    | vew0   |
-| r1    | vew1   |
-| ..    | vew..  |
-| r31   | vew31  |
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
-
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
-
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
-
-    vew = CSRbitwidth[rs1]
-    if (vew == 0)
-        bytesperreg = (XLEN/8) # or FLEN as appropriate
-    else:
-        bytesperreg = bytestable[vew] # 1 2 4 8 16
-    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
-    vlen = CSRvectorlen[rs1] * simdmult
-    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
-
  # Exceptions
  
  > What does an ADD of two different-sized vectors do in simple-V?
@@ -790,12 +851,38 @@ is given in the section "Bitwidth Virtual Register Reordering".
  
  # Impementing V on top of Simple-V
  
-* Number of Offset CSRs extends from 2
-* Extra register file: vector-file
-* Setup of Vector length and bitwidth CSRs now can specify vector-file
-  as well as integer or float file.
-* Extend CSR tables (bitwidth) with extra bits
-* TODO
+With Simple-V converting the original RVV draft concept-for-concept
+from explicit opcodes to implicit overloading of existing RV Standard
+Extensions, certain features were (deliberately) excluded that need
+to be added back in for RVV to reach its full potential.  This is
+made slightly complicated by the fact that RVV itself has two
+levels: Base and reserved future functionality.
+
+* Representation Encoding is entirely left out of Simple-V in favour of
+  implicitly taking the exact (explicit) meaning from RV Standard Extensions.
+* VCLIP and VCLIPI do not have corresponding RV Standard Extension
+  opcodes (and are the only such operations).
+* Extended Element bitwidths (1 through to 24576 bits) were left out
+  of Simple-V as, again, there is no corresponding RV Standard Extension
+  that covers anything even below 32-bit operands.
+* Polymorphism was entirely left out of Simple-V due to the inherent
+  complexity of automatic type-conversion.
+* Vector Register files were specifically left out of Simple-V in favour
+  of fitting on top of the integer and floating-point files.  An
+  "RVV re-retro-fit" needs to be able to mark (implicitly marked)
+  registers as being actually in a separate *vector* register file.
+* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
+  register file size is 5 bits (32 registers), whilst the "Extended"
+  variant of RVV specifies 8 bits (256 registers) and has yet to
+  be published.
+* One big difference: Sections 17.12 and 17.17, there are only two possible
+  predication registers in RVV "Base".  Through the "indirect" method,
+  Simple-V provides a key-value CSR table that allows (arbitrarily)
+  up to 16 (TBD) of either the floating-point or integer registers to
+  be marked as "predicated" (key), and if so, which integer register to
+  use as the predication mask (value).
+
+**TODO**
  
  # Implementing P (renamed to DSP) on top of Simple-V
  
@@ -849,9 +936,9 @@ from actual (internal) parallel hardware.  It's an API in effect that's
  designed to be slotted in to an existing implementation (just after
  instruction decode) with minimum disruption and effort.
  
-* minus: the complexity of having to use register renames, OoO, VLIW,
-  register file cacheing, all of which has been done before but is a
-  pain
+* minus: the complexity (if full parallelism is to be exploited)
+  of having to use register renames, OoO, VLIW, register file cacheing,
+  all of which has been done before but is a pain
  * plus: transparent re-use of existing opcodes as-is just indirectly
    saying "this register's now a vector" which
  * plus: means that future instructions also get to be inherently
@@ -960,7 +1047,7 @@ the question is asked "How can each of the proposals effectively implement
    a SIMD architecture where the ALU becomes responsible for the parallelism,
    Alt-RVP ALUs would likewise be so responsible... with *additional*
    (lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
    at least one dimension are avoided (architectural upgrades introducing
    128-bit then 256-bit then 512-bit variants of the exact same 64-bit
    SIMD block)
@@ -1036,6 +1123,15 @@ the question is asked "How can each of the proposals effectively implement
    operations, all the while keeping a consistent ISA-level "API" irrespective
    of implementor design choices (or indeed actual implementations).
  
+### Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FIFO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
  ## Example of vector / vector, vector / scalar, scalar / scalar => vector add
  
      register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
@@ -1087,10 +1183,10 @@ There is, in the standard Conditional Branch instruction, more than
  adequate space to interpret it in a similar fashion:
  
  [[!table  data="""
-  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
-imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
- 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
-   offset[12,10:5]  ||    src2  |    src1   |  BEQ         |    offset[11,4:1]      || BRANCH      |
+31      |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7       | 6....0 |
+imm[12] | imm[10:5]  |rs2   | rs1  | funct3 | imm[4:1] | imm[11] | opcode |
+ 1      | 6          | 5    | 5    | 3      | 4        | 1       |   7    |
+   offset[12,10:5]  || src2 | src1 | BEQ    | offset[11,4:1]    || BRANCH |
  """]]
  
  This would become:
@@ -1110,19 +1206,19 @@ not only to add in a second source register, but also use some of the bits as
  a predication target as well.
  
  [[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |         imm           |   op   |
-      3      |         3          |    3     |           5           |   2    |
-   C.BEQZ    |   offset[8,4:3]    |   src    |   offset[7:6,2:1,5]   |   C1   |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2     | 1 .. 0 |
+funct3 | imm           | rs10  | imm               | op     |
+3      | 3             | 3     | 5                 | 2      |
+C.BEQZ | offset[8,4:3] | src   | offset[7:6,2:1,5] | C1     |
  """]]
  
  Now uses the CS format:
  
  [[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |  imm   |              |   op   |
-      3      |         3          |    3     |  2     |  3           |   2    |
-   C.BEQZ    |   predicate rs3    |   src1   |  I/F B | src2         |   C1   |
+15..13 | 12 .  10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm      | rs10   | imm    |      | op     |
+3      | 3        | 3      | 2      | 3    | 2      |
+C.BEQZ | pred rs3 | src1   | I/F B  | src2 | C1     |
  """]]
  
  Bit 6 would be decoded as "operation refers to Integer or Float" including
@@ -1225,16 +1321,16 @@ still be respected*, making Simple-V in effect the "consistent public API".
  
  vew may be one of the following (giving a table "bytestable", used below):
  
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default  | XLEN/8     |
+| 001 | 8        | 1          |
+| 010 | 16       | 2          |
+| 011 | 32       | 4          |
+| 100 | 64       | 8          |
+| 101 | 128      | 16         |
+| 110 | rsvd     | rsvd       |
+| 111 | rsvd     | rsvd       |
  
  Pseudocode for vector length taking CSR SIMD-bitwidth into account:
  
@@ -1255,15 +1351,6 @@ To index an element in a register rnum where the vector element index is i:
                 byteidx * 8,           # low
                 byteidx * 8 + (vew-1), # high
  
-### Example Instruction translation: <a name="example_translation"></a>
-
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FILO:
-
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
-
  ### Insights
  
  SIMD register file splitting still to consider.  For RV64, benefits of doubling
@@ -1361,7 +1448,7 @@ So the question boils down to:
  Whilst the above may seem to be severe minuses, there are some strong
  pluses:
  
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
  * Smaller reduction of P's opcode space: around 10%.
  * The potential to use Compressed instructions in both Vector and SIMD
    due to the overloading of register meaning (implicit vectorisation,
@@ -1411,7 +1498,10 @@ RVV nor base RV have taken integer-overflow (carry) into account, which
  makes proposing it quite challenging given that the relevant (Base) RV
  sections are frozen.  Consequently it makes sense to forgo this feature.
  
-## Virtual Memory page-faults
+## Virtual Memory page-faults on LOAD/STORE
+
+
+### Notes from conversations
  
  > I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
  > riscv-isa-manual in order to work out how to re-map RVV onto the standard
@@ -1515,6 +1605,78 @@ Am still thinking through the implications as any dependent operations
  (particularly ones already decoded and moved into the execution FIFO)
  would still be there (and stalled).  hmmm.
  
+----
+
+    > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
+    > > VSETL r0, 8
+    > > FADD x1, x2, x3
+    >
+    > > x3[0]: ok
+    > > x3[1]: exception
+    > > x3[2]: ok
+    > > ...
+    > > ...
+    > > x3[7]: ok
+    >
+    > > what happens to result elements 2-7?  those may be *big* results
+    > > (RV128)
+    > > or in the RVV-Extended may be arbitrary bit-widths far greater.
+    >
+    >  (you replied:)
+    >
+    > Thrown away.
+
+discussion then led to the question of OoO architectures
+
+> The costs of the imprecise-exception model are greater than the benefit.
+> Software doesn't want to cope with it.  It's hard to debug.  You can't
+> migrate state between different microarchitectures--unless you force all
+> implementations to support the same imprecise-exception model, which would
+> greatly limit implementation flexibility.  (Less important, but still
+> relevant, is that the imprecise model increases the size of the context
+> structure, as the microarchitectural guts have to be spilled to memory.)
+
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
+
+TODO: assess various implementation paradigms.  These are listed roughly
+in order of simplicity (minimum compliance, for ultra-light-weight
+embedded systems or to reduce design complexity and the burden of
+design implementation and compliance, in non-critical areas), right the
+way to high-performance systems.
+
+* Full (or partial) software-emulated (via traps): full support for CSRs
+  required, however when a register is used that is detected (in hardware)
+  to be vectorised, an exception is thrown.
+* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
+* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
+* Out-of-order with instruction FIFOs and aggressive register-renaming
+* VLIW
+
+Also to be taken into consideration:
+
+* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
+* Comphrensive vectorisation: FIFOs and internal parallelism
+* Hybrid Parallelism
+
+# TODO Research
+
+> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
+
+Idea: basic simple butterfly swap on a few element indices, primarily targetted
+at SIMD / DSP.  High-byte low-byte swapping, high-word low-word swapping,
+perhaps allow reindexing of permutations up to 4 elements?  8?  Reason:
+such operations are less costly than a full indexed-shuffle, which requires
+a separate instruction cycle.
+
+Predication "all zeros" needs to be "leave alone".  Detection of
+ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
+ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
+Destruction of destination indices requires a copy of the entire vector
+in advance to avoid.
+
+TBD: floating-point compare and other exception handling
+
  # References
  
  * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
@@ -1544,3 +1706,4 @@ would still be there (and stalled).  hmmm.
  * Discussion on RVV "re-entrant" capabilities allowing operations to be
    restarted if an exception occurs (VM page-table miss)
    <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
+* Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>