(no commit message)

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 65d835d95c732f2cd65d9239d482b18a4df2ff8e..7aff13daf9a43aecfb01b46f9f6183b6ce440da7 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -31,8 +31,8 @@ are also:
    analysis and review purposes) prohibitively expensive
  * Both contain partial duplication of pre-existing RISC-V instructions
    (an undesirable characteristic)
    analysis and review purposes) prohibitively expensive
  * Both contain partial duplication of pre-existing RISC-V instructions
    (an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
-  at the instruction level.
+* Both have independent, incompatible and disparate methods for introducing
+  parallelism at the instruction level
  * Both require that their respective parallelism paradigm be implemented
    along-side and integral to their respective functionality *or not at all*.
  * Both independently have methods for introducing parallelism that
  * Both require that their respective parallelism paradigm be implemented
    along-side and integral to their respective functionality *or not at all*.
  * Both independently have methods for introducing parallelism that
@@ -52,10 +52,13 @@ details outlined in the Appendix), the key points being:
  * Vectorisation typically includes much more comprehensive memory load
    and store schemes (unit stride, constant-stride and indexed), which
    in turn have ramifications: virtual memory misses (TLB cache misses)
  * Vectorisation typically includes much more comprehensive memory load
    and store schemes (unit stride, constant-stride and indexed), which
    in turn have ramifications: virtual memory misses (TLB cache misses)
-  and even multiple page-faults... all caused by a *single instruction*.
+  and even multiple page-faults... all caused by a *single instruction*,
+  yet with a clear benefit that the regularisation of LOAD/STOREs can
+  be optimised for minimal impact on caches and maximised throughput.
  * By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
    to pages), and these load/stores have absolutely nothing to do with the
  * By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
    to pages), and these load/stores have absolutely nothing to do with the
-  SIMD / ALU engine, no matter how wide the operand.
+  SIMD / ALU engine, no matter how wide the operand.  Simplicity but with
+  more impact on instruction and data caches.
  
  Overall it makes a huge amount of sense to have a means and method
  of introducing instruction parallelism in a flexible way that provides
  
  Overall it makes a huge amount of sense to have a means and method
  of introducing instruction parallelism in a flexible way that provides
@@ -135,7 +138,7 @@ burdensome to implementations, given that instruction decode already has
  to direct the operation to a correctly-sized width ALU engine, anyway.
  
  Not least: in places where an ISA was previously constrained (due for
  to direct the operation to a correctly-sized width ALU engine, anyway.
  
  Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
+whatever reason, including limitations of the available operand space),
  implicit bit-width allows the meaning of certain operations to be
  type-overloaded *without* pollution or alteration of frozen and immutable
  instructions, in a fully backwards-compatible fashion.
  implicit bit-width allows the meaning of certain operations to be
  type-overloaded *without* pollution or alteration of frozen and immutable
  instructions, in a fully backwards-compatible fashion.
@@ -208,10 +211,10 @@ Interestingly, none of this complexity is faced in SIMD architectures...
  but then they do not get the opportunity to optimise for highly-streamlined
  memory accesses either.
  
  but then they do not get the opportunity to optimise for highly-streamlined
  memory accesses either.
  
-With the "bang-per-buck" ratio being so high and the direct improvement
-in L1 Instruction Cache usage, as well as the opportunity to optimise
-L1 and L2 cache usage, the case for including Vector LOAD/STORE is
-compelling.
+With the "bang-per-buck" ratio being so high and the indirect improvement
+in L1 Instruction Cache usage (reduced instruction count), as well as
+the opportunity to optimise L1 and L2 cache usage, the case for including
+Vector LOAD/STORE is compelling.
  
  ## Mask and Tagging (Predication)
  
  
  ## Mask and Tagging (Predication)
  
@@ -232,8 +235,8 @@ So these are the ways in which conditional execution may be implemented:
  * explicit compare and branch: BNE x, y -> offs would jump offs
    instructions if x was not equal to y
  * explicit store of tag condition: CMP x, y -> tagbit
  * explicit compare and branch: BNE x, y -> offs would jump offs
    instructions if x was not equal to y
  * explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) ADD results in a carry, carry bit implicitly
-  (or sometimes explicitly) goes into a "tag" (mask) register
+* implicit (condition-code) such as ADD results in a carry, carry bit
+  implicitly (or sometimes explicitly) goes into a "tag" (mask) register
  
  The first of these is a "normal" branch method, which is flat-out impossible
  to parallelise without look-ahead and effectively rewriting instructions.
  
  The first of these is a "normal" branch method, which is flat-out impossible
  to parallelise without look-ahead and effectively rewriting instructions.
@@ -304,9 +307,250 @@ In particular:
    i.e. *without* requiring a super-scalar or out-of-order architecture,
    but doing a proper, full job (ZOLC) is an entirely different matter.
  
    i.e. *without* requiring a super-scalar or out-of-order architecture,
    but doing a proper, full job (ZOLC) is an entirely different matter.
  
-Constructing a SIMD/Simple-Vector proposal based around four of these five
+Constructing a SIMD/Simple-Vector proposal based around four of these six
  requirements would therefore seem to be a logical thing to do.
  
  requirements would therefore seem to be a logical thing to do.
  
+# Note on implementation of parallelism
+
+One extremely important aspect of this proposal is to respect and support
+implementors desire to focus on power, area or performance.  In that regard,
+it is proposed that implementors be free to choose whether to implement
+the Vector (or variable-width SIMD) parallelism as sequential operations
+with a single ALU, fully parallel (if practical) with multiple ALUs, or
+a hybrid combination of both.
+
+In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
+Parallelism".  They achieve a 16-way SIMD at an **instruction** level
+by providing a combination of a 4-way parallel ALU *and* an externally
+transparent loop that feeds 4 sequential sets of data into each of the
+4 ALUs.
+
+Also in the same core, it is worth noting that particularly uncommon
+but essential operations (Reciprocal-Square-Root for example) are
+*not* part of the 4-way parallel ALU but instead have a *single* ALU.
+Under the proposed Vector (varible-width SIMD) implementors would
+be free to do precisely that: i.e. free to choose *on a per operation
+basis* whether and how much "Virtual Parallelism" to deploy.
+
+It is absolutely critical to note that it is proposed that such choices MUST
+be **entirely transparent** to the end-user and the compiler.  Whilst
+a Vector (varible-width SIMD) may not precisely match the width of the
+parallelism within the implementation, the end-user **should not care**
+and in this way the performance benefits are gained but the ISA remains
+straightforward.  All that happens at the end of an instruction run is: some
+parallel units (if there are any) would remain offline, completely
+transparently to the ISA, the program, and the compiler.
+
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
+instructions to deal with corner-cases is thus avoided, and implementors
+get to choose precisely where to focus and target the benefits of their
+implementation efforts, without "extra baggage".
+
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism.  Options are covered in the Appendix.
+
+# CSRs <a name="csrs"></a>
+
+There are a number of CSRs needed, which are used at the instruction
+decode phase to re-interpret RV opcodes (a practice that has
+precedent in the setting of MISA to enable / disable extensions).
+
+* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+  marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+  of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+  "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+  as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+  state information.
+* TODO: assess whether the same technique could be applied to the other
+  Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+  V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+  needed for context-switches (empty slots need never be stored).
+
+## Predication CSR
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated.  The first entry is whether predication
+is enabled.  The second entry is whether the register index refers to a
+floating-point or an integer register.  The third entry is the index
+of that register which is to be predicated (if referred to).  The fourth entry
+is the integer register that is treated as a bitfield, indexable by the
+vector element index.
+
+| RegNo | 6      | 5   | (4..0)  | (4..0)  |
+| ----- | -      | -   | ------- | ------- |
+| r0    | pren0  | i/f | regidx  | predidx |
+| r1    | pren1  | i/f | regidx  | predidx |
+| ..    | pren.. | i/f | regidx  | predidx |
+| r15   | pren15 | i/f | regidx  | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+    fp_pred_enabled[32];
+    int_pred_enabled[32];
+    for (i = 0; i < 16; i++)
+       if CSRpred[i].pren:
+          idx = CSRpred[i].regidx
+          predidx = CSRpred[i].predidx
+          if CSRpred[i].type == 0: # integer
+            int_pred_enabled[idx] = 1
+            int_pred_reg[idx] = predidx
+          else:
+            fp_pred_enabled[idx] = 1
+            fp_pred_reg[idx] = predidx
+
+So when an operation is to be predicated, it is the internal state that
+is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+    for (int i=0; i<vl; ++i)
+        if ([!]preg[p][i])
+           (d ? vreg[rd][i] : sreg[rd]) =
+            iop(s1 ? vreg[rs1][i] : sreg[rs1],
+                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+This instead becomes an *indirect* reference using the *internal* state
+table generated from the Predication CSR key-value store:
+
+    if type(iop) == INT:
+        pred_enabled = int_pred_enabled
+        preg = int_pred_reg[rd]
+    else:
+        pred_enabled = fp_pred_enabled
+        preg = fp_pred_reg[rd]
+
+    for (int i=0; i<vl; ++i)
+        if (preg_enabled[rd] && [!]preg[i])
+           (d ? vreg[rd][i] : sreg[rd]) =
+            iop(s1 ? vreg[rs1][i] : sreg[rs1],
+                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+## MAXVECTORDEPTH
+
+MAXVECTORDEPTH is the same concept as MVL in RVV.  However in Simple-V,
+given that its primary (base, unextended) purpose is for 3D, Video and
+other purposes (not requiring supercomputing capability), it makes sense
+to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
+and so on).
+
+The reason for setting this limit is so that predication registers, when
+marked as such, may fit into a single register as opposed to fanning out
+over several registers.  This keeps the implementation a little simpler.
+Note that RVV on top of Simple-V may choose to over-ride this decision.
+
+## Vector-length CSRs
+
+Vector lengths are interpreted as meaning "any instruction referring to
+r(N) generates implicit identical instructions referring to registers
+r(N+M-1) where M is the Vector Length".  Vector Lengths may be set to
+use up to 16 registers in the register file.
+
+One separate CSR table is needed for each of the integer and floating-point
+register files:
+
+| RegNo | (3..0) |
+| ----- | ------ |
+| r0    | vlen0  |
+| r1    | vlen1  |
+| ..    | vlen.. |
+| r31   | vlen31 |
+
+An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
+whether a register was, if referred to in any standard instructions,
+implicitly to be treated as a vector.
+
+Note:
+
+* A vector length of 1 indicates that it is to be treated as a scalar.
+  Bitwidths (on the same register) are interpreted and meaningful.
+* A vector length of 0 indicates that the parallelism is to be switched
+  off for this register (treated as a scalar).  When length is 0,
+  the bitwidth CSR for the register is *ignored*.
+
+Internally, implementations may choose to use the non-zero vector length
+to set a bit-field per register, to be used in the instruction decode phase.
+In this way any standard (current or future) operation involving
+register operands may detect if the operation is to be vector-vector,
+vector-scalar or scalar-scalar (standard) simply through a single
+bit test.
+
+Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
+bitwidth is specifically not set) it becomes:
+
+    CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+
+This is in contrast to RVV:
+
+    CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+
+## Element (SIMD) bitwidth CSRs
+
+Element bitwidths may be specified with a per-register CSR, and indicate
+how a register (integer or floating-point) is to be subdivided.
+
+| RegNo | (2..0) |
+| ----- | ------ |
+| r0    | vew0   |
+| r1    | vew1   |
+| ..    | vew..  |
+| r31   | vew31  |
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default  |
+| 001 | 8        |
+| 010 | 16       |
+| 011 | 32       |
+| 100 | 64       |
+| 101 | 128      |
+| 110 | rsvd     |
+| 111 | rsvd     |
+
+Extending this table (with extra bits) is covered in the section
+"Implementing RVV on top of Simple-V".
+
+Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
+into account, it becomes:
+
+    vew = CSRbitwidth[rs1]
+    if (vew == 0)
+        bytesperreg = (XLEN/8) # or FLEN as appropriate
+    else:
+        bytesperreg = bytestable[vew] # 1 2 4 8 16
+    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+    vlen = CSRvectorlen[rs1] * simdmult
+    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+The reason for multiplying the vector length by the number of SIMD elements
+(in each individual register) is so that each SIMD element may optionally be
+predicated.
+
+An example of how to subdivide the register file when bitwidth != default
+is given in the section "Bitwidth Virtual Register Reordering".
+
  # Instructions
  
  By being a topological remap of RVV concepts, the following RVV instructions
  # Instructions
  
  By being a topological remap of RVV concepts, the following RVV instructions
@@ -323,9 +567,9 @@ compare operations, *any* arithmetic, floating point or *any*
  memory instructions.
  Instead it *overloads* pre-existing branch operations into predicated
  variants, and implicitly overloads arithmetic operations and LOAD/STORE
  memory instructions.
  Instead it *overloads* pre-existing branch operations into predicated
  variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on implicit CSR configurations for both vector length and
-bitwidth.  *This includes Compressed instructions* as well as any
-future ones, *including* future Extensions.
+depending on CSR configurations for vector length, bitwidth and
+predication.  *This includes Compressed instructions* as well as any
+future instructions and Custom Extensions.
  
  * For analysis of RVV see [[v_comparative_analysis]] which begins to
    outline topologically-equivalent mappings of instructions
  
  * For analysis of RVV see [[v_comparative_analysis]] which begins to
    outline topologically-equivalent mappings of instructions
@@ -344,12 +588,29 @@ and having the benefit of being explicit.*
  
  ## Branch Instruction:
  
  
  ## Branch Instruction:
  
+Branch operations use standard RV opcodes that are reinterpreted to be
+"predicate variants" in the instance where either of the two src registers
+have their corresponding CSRvectorlen[src] entry as non-zero.  When this
+reinterpretation is enabled the predicate target register rs3 is to be
+treated as a bitfield (up to a maximum of XLEN bits corresponding to a
+maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector.  Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
  This is the overloaded table for Integer-base Branch operations.  Opcode
  (bits 6..0) is set in all cases to 1100011.
  
  [[!table  data="""
  31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
  This is the overloaded table for Integer-base Branch operations.  Opcode
  (bits 6..0) is set in all cases to 1100011.
  
  [[!table  data="""
  31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
-imm[12|10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
+imm[12,10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
  7           | 5        | 5     | 3      | 4             | 1  | 7       |
  reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
  reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
  7           | 5        | 5     | 3      | 4             | 1  | 7       |
  reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
  reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
@@ -362,11 +623,16 @@ reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
  reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
  """]]
  
  reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
  """]]
  
-This is the overloaded table for Floating-point Predication operations.
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
+Below is the overloaded table for Floating-point Predication operations.
  Interestingly no change is needed to the instruction format because
  FP Compare already stores a 1 or a zero in its "rd" integer register
  target, i.e. it's not actually a Branch at all: it's a compare.
  Interestingly no change is needed to the instruction format because
  FP Compare already stores a 1 or a zero in its "rd" integer register
  target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield.
+The target needs to simply change to be a predication bitfield (done
+implicitly).
  
  As with
  Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
  
  As with
  Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
@@ -379,15 +645,15 @@ and whilst in ordinary branch code this is fine because the standard
  RVF compare can always be followed up with an integer BEQ or a BNE (or
  a compressed comparison to zero or non-zero), in predication terms that
  becomes more of an impact as an explicit (scalar) instruction is needed
  RVF compare can always be followed up with an integer BEQ or a BNE (or
  a compressed comparison to zero or non-zero), in predication terms that
  becomes more of an impact as an explicit (scalar) instruction is needed
-to invert the predicate.  An additional encoding funct3=011 is therefore
-proposed to cater for this.
+to invert the predicate bitmask.  An additional encoding funct3=011 is
+therefore proposed to cater for this.
  
  [[!table  data="""
  31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
  funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
  5       | 2        | 5        | 5     | 3      | 4        | 7       |
  10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
  
  [[!table  data="""
  31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
  funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
  5       | 2        | 5        | 5     | 3      | 4        | 7       |
  10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
-10100   | 00/01/11 | src2     | src1  | *011*  | pred rs3 | FNE     |
+10100   | 00/01/11 | src2     | src1  | **011**| pred rs3 | FNE     |
  10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
  10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
  """]]
  10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
  10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
  """]]
@@ -413,9 +679,11 @@ complex), this becomes:
      if I/F == INT: # integer type cmp
          pred_enabled = int_pred_enabled # TODO: exception if not set!
          preg = int_pred_reg[rd]
      if I/F == INT: # integer type cmp
          pred_enabled = int_pred_enabled # TODO: exception if not set!
          preg = int_pred_reg[rd]
+        reg = int_regfile
      else:
          pred_enabled = fp_pred_enabled # TODO: exception if not set!
          preg = fp_pred_reg[rd]
      else:
          pred_enabled = fp_pred_enabled # TODO: exception if not set!
          preg = fp_pred_reg[rd]
+        reg = fp_regfile
  
      s1 = CSRvectorlen[src1] > 1;
      s2 = CSRvectorlen[src2] > 1;
  
      s1 = CSRvectorlen[src1] > 1;
      s2 = CSRvectorlen[src2] > 1;
@@ -427,7 +695,7 @@ Notes:
  
  * Predicated SIMD comparisons would break src1 and src2 further down
    into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
  
  * Predicated SIMD comparisons would break src1 and src2 further down
    into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
-  Reordering") setting Vector-Length * (number of SIMD elements) bits
+  Reordering") setting Vector-Length times (number of SIMD elements) bits
    in Predicate Register rs3 as opposed to just Vector-Length bits.
  * Predicated Branches do not actually have an adjustment to the Program
    Counter, so all of bits 25 through 30 in every case are not needed.
    in Predicate Register rs3 as opposed to just Vector-Length bits.
  * Predicated Branches do not actually have an adjustment to the Program
    Counter, so all of bits 25 through 30 in every case are not needed.
@@ -454,6 +722,7 @@ C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
  Notes:
  
  * Bits 5 13 14 and 15 make up the comparator type
  Notes:
  
  * Bits 5 13 14 and 15 make up the comparator type
+* Bit 6 indicates whether to use integer or floating-point comparisons
  * In both floating-point and integer cases there are four predication
    comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
    src1 and src2).
  * In both floating-point and integer cases there are four predication
    comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
    src1 and src2).
@@ -462,7 +731,8 @@ Notes:
  
  For full analysis of topological adaptation of RVV LOAD/STORE
  see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
  
  For full analysis of topological adaptation of RVV LOAD/STORE
  see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction.
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
  
  Revised LOAD:
  
  
  Revised LOAD:
  
@@ -565,226 +835,6 @@ of detecting early page / segmentation faults and adjusting the TLB
  in advance, accordingly: other strategies are explored in the Appendix
  Section "Virtual Memory Page Faults".
  
  in advance, accordingly: other strategies are explored in the Appendix
  Section "Virtual Memory Page Faults".
  
-# Note on implementation of parallelism
-
-One extremely important aspect of this proposal is to respect and support
-implementors desire to focus on power, area or performance.  In that regard,
-it is proposed that implementors be free to choose whether to implement
-the Vector (or variable-width SIMD) parallelism as sequential operations
-with a single ALU, fully parallel (if practical) with multiple ALUs, or
-a hybrid combination of both.
-
-In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
-Parallelism".  They achieve a 16-way SIMD at an **instruction** level
-by providing a combination of a 4-way parallel ALU *and* an externally
-transparent loop that feeds 4 sequential sets of data into each of the
-4 ALUs.
-
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
-
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler.  Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward.  All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
-
-The "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
-
-# CSRs <a name="csrs"></a>
-
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
-
-* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
-  marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
-  of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
-  "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
-  as opposed to having the predicate register explicitly in the instruction.
-
-## Predication CSR
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated.  The first entry is whether predication
-is enabled.  The second entry is whether the register index refers to a
-floating-point or an integer register.  The third entry is the index
-of that register which is to be predicated (if referred to).  The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6      | 5   | (4..0)  | (4..0)  |
-| ----- | -      | -   | ------- | ------- |
-| r0    | pren0  | i/f | regidx  | predidx |
-| r1    | pren1  | i/f | regidx  | predidx |
-| ..    | pren.. | i/f | regidx  | predidx |
-| r15   | pren15 | i/f | regidx  | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
-    fp_pred_enabled[32];
-    int_pred_enabled[32];
-    for (i = 0; i < 16; i++)
-       if CSRpred[i].pren:
-          idx = CSRpred[i].regidx
-          predidx = CSRpred[i].predidx
-          if CSRpred[i].type == 0: # integer
-            int_pred_enabled[idx] = 1
-            int_pred_reg[idx] = predidx
-          else:
-            fp_pred_enabled[idx] = 1
-            fp_pred_reg[idx] = predidx
-
-So when an operation is to be predicated, it is the internal state that
-is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
-    for (int i=0; i<vl; ++i)
-        if ([!]preg[p][i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store:
-
-    if type(iop) == INT:
-        pred_enabled = int_pred_enabled
-        preg = int_pred_reg[rd]
-    else:
-        pred_enabled = fp_pred_enabled
-        preg = fp_pred_reg[rd]
-
-    for (int i=0; i<vl; ++i)
-        if (preg_enabled[rd] && [!]preg[i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-## MAXVECTORDEPTH
-
-MAXVECTORDEPTH is the same concept as MVL in RVV.  However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
-
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers.  This keeps the implementation a little simpler.
-Note that RVV on top of Simple-V may choose to over-ride this decision.
-
-## Vector-length CSRs
-
-Vector lengths are interpreted as meaning "any instruction referring to
-r(N) generates implicit identical instructions referring to registers
-r(N+M-1) where M is the Vector Length".  Vector Lengths may be set to
-use up to 16 registers in the register file.
-
-One separate CSR table is needed for each of the integer and floating-point
-register files:
-
-| RegNo | (3..0) |
-| ----- | ------ |
-| r0    | vlen0  |
-| r1    | vlen1  |
-| ..    | vlen.. |
-| r31   | vlen31 |
-
-An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
-whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector.  A vector length of 1 indicates
-that it is to be treated as a scalar.  Vector lengths of 0 are reserved.
-
-Internally, implementations may choose to use the non-zero vector length
-to set a bit-field per register, to be used in the instruction decode phase.
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
-
-Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
-bitwidth is specifically not set) it becomes:
-
-    CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
-
-This is in contrast to RVV:
-
-    CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
-
-## Element (SIMD) bitwidth CSRs
-
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
-
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0    | vew0   |
-| r1    | vew1   |
-| ..    | vew..  |
-| r31   | vew31  |
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
-
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
-
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
-
-    vew = CSRbitwidth[rs1]
-    if (vew == 0)
-        bytesperreg = (XLEN/8) # or FLEN as appropriate
-    else:
-        bytesperreg = bytestable[vew] # 1 2 4 8 16
-    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
-    vlen = CSRvectorlen[rs1] * simdmult
-    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
-
  # Exceptions
  
  > What does an ADD of two different-sized vectors do in simple-V?
  # Exceptions
  
  > What does an ADD of two different-sized vectors do in simple-V?
@@ -831,7 +881,7 @@ levels: Base and reserved future functionality.
    up to 16 (TBD) of either the floating-point or integer registers to
    be marked as "predicated" (key), and if so, which integer register to
    use as the predication mask (value).
    up to 16 (TBD) of either the floating-point or integer registers to
    be marked as "predicated" (key), and if so, which integer register to
    use as the predication mask (value).
-  
+
  **TODO**
  
  # Implementing P (renamed to DSP) on top of Simple-V
  **TODO**
  
  # Implementing P (renamed to DSP) on top of Simple-V
@@ -886,9 +936,9 @@ from actual (internal) parallel hardware.  It's an API in effect that's
  designed to be slotted in to an existing implementation (just after
  instruction decode) with minimum disruption and effort.
  
  designed to be slotted in to an existing implementation (just after
  instruction decode) with minimum disruption and effort.
  
-* minus: the complexity of having to use register renames, OoO, VLIW,
-  register file cacheing, all of which has been done before but is a
-  pain
+* minus: the complexity (if full parallelism is to be exploited)
+  of having to use register renames, OoO, VLIW, register file cacheing,
+  all of which has been done before but is a pain
  * plus: transparent re-use of existing opcodes as-is just indirectly
    saying "this register's now a vector" which
  * plus: means that future instructions also get to be inherently
  * plus: transparent re-use of existing opcodes as-is just indirectly
    saying "this register's now a vector" which
  * plus: means that future instructions also get to be inherently
@@ -997,7 +1047,7 @@ the question is asked "How can each of the proposals effectively implement
    a SIMD architecture where the ALU becomes responsible for the parallelism,
    Alt-RVP ALUs would likewise be so responsible... with *additional*
    (lane-based) parallelism on top.
    a SIMD architecture where the ALU becomes responsible for the parallelism,
    Alt-RVP ALUs would likewise be so responsible... with *additional*
    (lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
    at least one dimension are avoided (architectural upgrades introducing
    128-bit then 256-bit then 512-bit variants of the exact same 64-bit
    SIMD block)
    at least one dimension are avoided (architectural upgrades introducing
    128-bit then 256-bit then 512-bit variants of the exact same 64-bit
    SIMD block)
@@ -1133,10 +1183,10 @@ There is, in the standard Conditional Branch instruction, more than
  adequate space to interpret it in a similar fashion:
  
  [[!table  data="""
  adequate space to interpret it in a similar fashion:
  
  [[!table  data="""
-  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
-imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
- 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
-   offset[12,10:5]  ||    src2  |    src1   |  BEQ         |    offset[11,4:1]      || BRANCH      |
+31      |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7       | 6....0 |
+imm[12] | imm[10:5]  |rs2   | rs1  | funct3 | imm[4:1] | imm[11] | opcode |
+ 1      | 6          | 5    | 5    | 3      | 4        | 1       |   7    |
+   offset[12,10:5]  || src2 | src1 | BEQ    | offset[11,4:1]    || BRANCH |
  """]]
  
  This would become:
  """]]
  
  This would become:
@@ -1156,19 +1206,19 @@ not only to add in a second source register, but also use some of the bits as
  a predication target as well.
  
  [[!table  data="""
  a predication target as well.
  
  [[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |         imm           |   op   |
-      3      |         3          |    3     |           5           |   2    |
-   C.BEQZ    |   offset[8,4:3]    |   src    |   offset[7:6,2:1,5]   |   C1   |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2     | 1 .. 0 |
+funct3 | imm           | rs10  | imm               | op     |
+3      | 3             | 3     | 5                 | 2      |
+C.BEQZ | offset[8,4:3] | src   | offset[7:6,2:1,5] | C1     |
  """]]
  
  Now uses the CS format:
  
  [[!table  data="""
  """]]
  
  Now uses the CS format:
  
  [[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |  imm   |              |   op   |
-      3      |         3          |    3     |  2     |  3           |   2    |
-   C.BEQZ    |   predicate rs3    |   src1   |  I/F B | src2         |   C1   |
+15..13 | 12 .  10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm      | rs10   | imm    |      | op     |
+3      | 3        | 3      | 2      | 3    | 2      |
+C.BEQZ | pred rs3 | src1   | I/F B  | src2 | C1     |
  """]]
  
  Bit 6 would be decoded as "operation refers to Integer or Float" including
  """]]
  
  Bit 6 would be decoded as "operation refers to Integer or Float" including
@@ -1271,16 +1321,16 @@ still be respected*, making Simple-V in effect the "consistent public API".
  
  vew may be one of the following (giving a table "bytestable", used below):
  
  
  vew may be one of the following (giving a table "bytestable", used below):
  
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default  | XLEN/8     |
+| 001 | 8        | 1          |
+| 010 | 16       | 2          |
+| 011 | 32       | 4          |
+| 100 | 64       | 8          |
+| 101 | 128      | 16         |
+| 110 | rsvd     | rsvd       |
+| 111 | rsvd     | rsvd       |
  
  Pseudocode for vector length taking CSR SIMD-bitwidth into account:
  
  
  Pseudocode for vector length taking CSR SIMD-bitwidth into account:
  
@@ -1398,7 +1448,7 @@ So the question boils down to:
  Whilst the above may seem to be severe minuses, there are some strong
  pluses:
  
  Whilst the above may seem to be severe minuses, there are some strong
  pluses:
  
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
  * Smaller reduction of P's opcode space: around 10%.
  * The potential to use Compressed instructions in both Vector and SIMD
    due to the overloading of register meaning (implicit vectorisation,
  * Smaller reduction of P's opcode space: around 10%.
  * The potential to use Compressed instructions in both Vector and SIMD
    due to the overloading of register meaning (implicit vectorisation,
@@ -1576,9 +1626,9 @@ would still be there (and stalled).  hmmm.
      >
      > Thrown away.
  
      >
      > Thrown away.
  
-discussion then led to the question of OoO architectures 
+discussion then led to the question of OoO architectures
  
  
-> The costs of the imprecise-exception model are greater than the benefit. 
+> The costs of the imprecise-exception model are greater than the benefit.
  > Software doesn't want to cope with it.  It's hard to debug.  You can't
  > migrate state between different microarchitectures--unless you force all
  > implementations to support the same imprecise-exception model, which would
  > Software doesn't want to cope with it.  It's hard to debug.  You can't
  > migrate state between different microarchitectures--unless you force all
  > implementations to support the same imprecise-exception model, which would
@@ -1625,6 +1675,8 @@ ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
  Destruction of destination indices requires a copy of the entire vector
  in advance to avoid.
  
  Destruction of destination indices requires a copy of the entire vector
  in advance to avoid.
  
+TBD: floating-point compare and other exception handling
+
  # References
  
  * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
  # References
  
  * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>