(no commit message)

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 629f1895e94dd0aefaf361fee162c237c84f9193..1f479badad0209ee49ebd85864e2376ee9613fb4 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,9 +1,5 @@
  # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
  
-[[!toc ]]
-
-# Summary
-
  Key insight: Simple-V is intended as an abstraction layer to provide
  a consistent "API" to parallelisation of existing *and future* operations.
  *Actual* internal hardware-level parallelism is *not* required, such
@@ -16,6 +12,8 @@ of Out-of-order restructuring (including parallel ALU lanes) or VLIW
  implementations, or SIMD, or anything else, would then benefit *if*
  Simple-V was added on top.
  
+[[!toc ]]
+
  # Introduction
  
  This proposal exists so as to be able to satisfy several disparate
@@ -52,35 +50,21 @@ Additionally it makes sense to *split out* the parallelism inherent within
  each of P and V, and to see if each of P and V then, in *combination* with
  a "best-of-both" parallelism extension, could be added on *on top* of
  this proposal, to topologically provide the exact same functionality of
-each of P and V.
+each of P and V.  Each of P and V then can focus on providing the best
+operations possible for their respective target areas, without being
+hugely concerned about the actual parallelism.
  
  Furthermore, an additional goal of this proposal is to reduce the number
  of opcodes utilised by each of P and V as they currently stand, leveraging
  existing RISC-V opcodes where possible, and also potentially allowing
  P and V to make use of Compressed Instructions as a result.
  
-**TODO**: reword this to better suit this document:
-
-Having looked at both P and V as they stand, they're _both_ very much
-"separate engines" that, despite both their respective merits and
-extremely powerful features, don't really cleanly fit into the RV design
-ethos (or the flexible extensibility) and, as such, are both in danger
-of not being widely adopted.  I'm inclined towards recommending:
-
-* splitting out the DSP aspects of P-SIMD to create a single-issue DSP
-* splitting out the polymorphism, esoteric data types (GF, complex
-  numbers) and unusual operations of V to create a single-issue "Esoteric
-  Floating-Point" extension
-* splitting out the loop-aspects, vector aspects and data-width aspects
-  of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they
-  apply across *all* Extensions, whether those be DSP, M, Base, V, P -
-  everything.
-
  **TODO**: propose overflow registers be actually one of the integer regs
  (flowing to multiple regs).
  
  **TODO**: propose "mask" (predication) registers likewise.  combination with
-standard RV instructions and overflow registers extremely powerful
+standard RV instructions and overflow registers extremely powerful, see
+Aspex ASP.
  
  # Analysis and discussion of Vector vs SIMD
  
@@ -109,6 +93,13 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable
  for general-purpose computation, and in the context of developing a
  general-purpose ISA, is never going to satisfy 100 percent of implementors.
  
+To explain this further: for increased workloads over time, as the
+performance requirements increase for new target markets, implementors
+choose to extend the SIMD width (so as to again avoid mixing parallelism
+into the instruction issue phases: the primary "simplicity" benefit of
+SIMD in the first place), with the result that the entire opcode space
+effectively doubles with each new SIMD width that's added to the ISA.
+
  That basically leaves "variable-length vector" as the clear *general-purpose*
  winner, at least in terms of greatly simplifying the instruction set,
  reducing the number of instructions required for any given task, and thus
@@ -144,13 +135,16 @@ integer (and floating point) of various sizes is automatically inferred
  due to "type tagging" that is set with a special instruction.  A register
  will be *specifically* marked as "16-bit Floating-Point" and, if added
  to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take placce *without* requiring that type-conversion
+type-conversion will take place *without* requiring that type-conversion
  to be explicitly done with its own separate instruction.
  
  However, implicit type-conversion is not only quite burdensome to
  implement (explosion of inferred type-to-type conversion) but also is
  never really going to be complete.  It gets even worse when bit-widths
-also have to be taken into consideration.
+also have to be taken into consideration.  Each new type results in
+an increased O(N^2) conversion space that, as anyone who has examined
+python's source code (which has built-in polymorphic type-conversion),
+knows that the task is more complex than it first seems.
  
  Overall, type-conversion is generally best to leave to explicit
  type-conversion instructions, or in definite specific use-cases left to
@@ -228,87 +222,20 @@ condition-codes or predication.  By adding a CSR it becomes possible
  to also tag certain registers as "predicated if referenced as a destination".
  Example:
  
-    // in future operations if r0 is the destination use r5 as
+    // in future operations from now on, if r0 is the destination use r5 as
      // the PREDICATION register
-    IMPLICICSRPREDICATE r0, r5
+    SET_IMPLICIT_CSRPREDICATE r0, r5
      // store the compares in r5 as the PREDICATION register
      CMPEQ8 r5, r1, r2
      // r0 is used here.  ah ha!  that means it's predicated using r5!
      ADD8 r0, r1, r3
  
-With enough registers (and there are enough registers) some fairly
+With enough registers (and in RISC-V there are enough registers) some fairly
  complex predication can be set up and yet still execute without significant
  stalling, even in a simple non-superscalar architecture.
  
-### Retro-fitting Predication into branch-explicit ISA
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication.  However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.
-
-However what if all branch instructions, if referencing a vectorised
-register, were instead given *completely new analogous meanings* that
-resulted in a parallel bit-wise predication register being set?  This
-would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
-BLT and BGE.
-
-We might imagine that FEQ, FLT and FLT would also need to be converted,
-however these are effectively *already* in the precise form needed and
-do not need to be converted *at all*!  The difference is that FEQ, FLT
-and FLE *specifically* write a 1 to an integer register if the condition
-holds, and 0 if not.  All that needs to be done here is to say, "if
-the integer register is tagged with a bit that says it is a predication
-register, the **bit** in the integer register is set based on the
-current vector index" instead.
-
-There is, in the standard Conditional Branch instruction, more than
-adequate space to interpret it in a similar fashion:
-
-[[!table  data="""
-  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
-imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
- 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
-   offset[12,10:5]  ||    src2  |    src1   |  BEQ         |    offset[11,4:1]      || BRANCH      |
-"""]]
-
-This would become:
-
-[[!table  data="""
-  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
-imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
- 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
-   reserved         ||    src2  |    src1   |  BEQ         |   predicate rs3        || BRANCH      |
-"""]]
-
-Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
-with the interesting side-effect that there is space within what is presently
-the "immediate offset" field to reinterpret that to add in not only a bit
-field to distinguish between floating-point compare and integer compare,
-not only to add in a second source register, but also use some of the bits as
-a predication target as well.
-
-[[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |         imm           |   op   |
-      3      |         3          |    3     |           5           |   2    |
-   C.BEQZ    |   offset[8,4:3]    |   src    |   offset[7:6,2:1,5]   |   C1   |
-"""]]
-
-Now uses the CS format:
-
-[[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |  imm   |              |   op   |
-      3      |         3          |    3     |  2     |  3           |   2    |
-   C.BEQZ    |   predicate rs3    |   src1   |  I/F B | src2         |   C1   |
-"""]]
-
-Bit 6 would be decoded as "operation refers to Integer or Float" including
-interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
-"C" Standard, version 2.0,
-whilst Bit 5 would allow the operation to be extended, in combination with
-funct3 = 110 or 111: a combination of four distinct comparison operators.
+(For details on how Branch Instructions would be retro-fitted to indirectly
+predicated equivalents, see Appendix)
  
  ## Conclusions
  
@@ -341,13 +268,215 @@ requirements would therefore seem to be a logical thing to do.
  
  # Instruction Format
  
-**TODO** *basically borrow from both P and V, which should be quite simple
-to do, with the exception of Tag/no-tag, which needs a bit more
-thought.  V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
-gather-scatterer, and, if implemented, could actually be a really useful
-way to span 8-bit up to 64-bit groups of data, where BGS as it stands
-and described by Clifford does **bits** of up to 16 width.  Lots to
-look at and investigate!*
+The instruction format for Simple-V does not actually have *any* compare
+operations, *any* arithmetic, floating point or memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations and LOAD/STORE
+depending on implicit CSR configurations for both vector length and
+bitwidth.  This includes Compressed instructions.
+
+* For analysis of RVV see [[v_comparative_analysis]] which begins to
+  outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+  for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead.  Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit and having the benefit of
+being explicit.*
+
+## Branch Instruction:
+
+[[!table  data="""
+31      | 30 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
+imm[12] | imm[10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
+1       | 6        | 5        | 5     | 3      | 4             | 1  | 7       |
+I/F     | reserved | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
+0       | reserved | src2     | src1  | 000    | predicate rs3     || BEQ     |
+0       | reserved | src2     | src1  | 001    | predicate rs3     || BNE     |
+0       | reserved | src2     | src1  | 010    | predicate rs3     || rsvd    |
+0       | reserved | src2     | src1  | 011    | predicate rs3     || rsvd    |
+0       | reserved | src2     | src1  | 100    | predicate rs3     || BLE     |
+0       | reserved | src2     | src1  | 101    | predicate rs3     || BGE     |
+0       | reserved | src2     | src1  | 110    | predicate rs3     || BLTU    |
+0       | reserved | src2     | src1  | 111    | predicate rs3     || BGEU    |
+1       | reserved | src2     | src1  | 000    | predicate rs3     || FEQ     |
+1       | reserved | src2     | src1  | 001    | predicate rs3     || FNE     |
+1       | reserved | src2     | src1  | 010    | predicate rs3     || rsvd    |
+1       | reserved | src2     | src1  | 011    | predicate rs3     || rsvd    |
+1       | reserved | src2     | src1  | 100    | predicate rs3     || FLT     |
+1       | reserved | src2     | src1  | 101    | predicate rs3     || FLE     |
+1       | reserved | src2     | src1  | 110    | predicate rs3     || rsvd    |
+1       | reserved | src2     | src1  | 111    | predicate rs3     || rsvd    |
+"""]]
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+         preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
+                           s2 ? vreg[rs2][i] : sreg[rs2]);
+
+With associated predication, vector-length adjustments and so on,
+and temporarily ignoring bitwidth (which makes the comparisons more
+complex), this becomes:
+
+    if I/F == INT: # integer type cmp
+        pred_enabled = int_pred_enabled # TODO: exception if not set!
+        preg = int_pred_reg[rd]
+    else:
+        pred_enabled = fp_pred_enabled # TODO: exception if not set!
+        preg = fp_pred_reg[rd]
+
+    s1 = CSRvectorlen[src1] > 1;
+    s2 = CSRvectorlen[src2] > 1;
+    for (int i=0; i<vl; ++i)
+       preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
+                          s2 ? reg[src2+i] : reg[src2]);
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+  into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+  Reordering") setting Vector-Length * (number of SIMD elements) bits
+  in Predicate Register rs3 as opposed to just Vector-Length bits.
+* Predicated Branches do not actually have an adjustment to the Program
+  Counter, so all of bits 25 through 30 in every case are not needed.
+* There are plenty of reserved opcodes for which bits 25 through 30 could
+  be put to good use if there is a suitable use-case.
+* FEQ and FNE (and BEQ and BNE) are included in order to save one
+  instruction having to invert the resultant predicate bitfield.
+  FLT and FLE may be inverted to FGT and FGE if needed by swapping
+  src1 and src2 (likewise the integer counterparts).
+
+## Compressed Branch Instruction:
+
+[[!table  data="""
+15..13 | 12...10  | 9..7 | 6..5  | 4..2 | 1..0 | name |
+funct3 | imm      | rs10 | imm   |      | op   |      |
+3      | 3        | 3    | 2     |  3   | 2    |      |
+C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
+110    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.EQ |
+111    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.NE |
+110    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LT |
+111    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LE |
+"""]]
+
+Notes:
+
+* Bits 5 13 14 and 15 make up the comparator type
+* In both floating-point and integer cases there are four predication
+  comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
+  src1 and src2).
+
+## LOAD / STORE Instructions
+
+For full analysis of topological adaptation of RVV LOAD/STORE
+see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
+may be implicitly overloaded into the one base RV LOAD instruction.
+
+Revised LOAD:
+
+[[!table  data="""
+31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
+imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
+1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
+?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
+"""]]
+
+The exact same corresponding adaptation is also carried out on the single,
+double and quad precision floating-point LOAD-FP and STORE-FP operations,
+which fit the exact same instruction format.  Thus all three types
+(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
+as well as FSW, FSD and FSQ.
+
+Notes:
+
+* LOAD remains functionally (topologically) identical to RVV LOAD
+  (for both integer and floating-point variants).
+* Predication CSR-marking register is not explicitly shown in instruction, it's
+  implicit based on the CSR predicate state for the rd (destination) register
+* rs2, the source, may *also be marked as a vector*, which implicitly
+  is taken to indicate "Indexed Load" (LD.X)
+* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
+* Bit 31 is reserved (ideas under consideration: auto-increment)
+* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
+* **TODO**: clarify where width maps to elsize
+
+Pseudo-code (excludes CSR SIMD bitwidth):
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+
+    pred_enabled = int_pred_enabled
+    preg = int_pred_reg[rd]
+
+    for (int i=0; i<vl; ++i)
+      if (preg_enabled[rd] && [!]preg[i])
+        for (int j=0; j<seglen+1; j++)
+        {
+          if CSRvectorised[rs2])
+             offs = vreg[rs2][i]
+          else
+             offs = i*(seglen+1)*stride;
+          vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
+        }
+
+Taking CSR (SIMD) bitwidth into account involves using the vector
+length and register encoding according to the "Bitwidth Virtual Register
+Reordering" scheme shown in the Appendix (see function "regoffs").
+
+A similar instruction exists for STORE, with identical topological
+translation of all features.  **TODO**
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are of the same format, where bits 2-4 are
+a src register instead of dest:
+
+[[!table  data="""
+15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
+funct3 | imm         | rs10   | imm         | rd0  | op   |
+3      | 3           | 3      | 2           | 3    | 2    |
+C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
+"""]]
+
+Unfortunately it is not possible to fit the full functionality
+of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
+require another operand (rs2) in addition to the operand width
+(which is also missing), offset, base, and src/dest.
+
+However a close approximation may be achieved by taking the top bit
+of the offset in each of the five types of LD (and ST), reducing the
+offset to 4 bits and utilising the 5th bit to indicate whether "stride"
+is to be enabled.  In this way it is at least possible to introduce
+that functionality.
+
+(**TODO**: *assess whether the loss of one bit from offset is worth having
+"stride" capability.*)
+
+We also assume (including for the "stride" variant) that the "width"
+parameter, which is missing, is derived and implicit, just as it is
+with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
+and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
+C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
+
+Interestingly we note that the Vectorised Simple-V variant of
+LOAD/STORE (Compressed and otherwise), due to it effectively using the
+standard register file(s), is the direct functional equivalent of
+standard load-multiple and store-multiple instructions found in other
+processors.
+
+In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
+page 76, "For virtual memory systems some data accesses could be resident
+in physical memory and some not".  The interesting question then arises:
+how does RVV deal with the exact same scenario?
  
  # Note on implementation of parallelism
  
@@ -388,14 +517,15 @@ implementation efforts, without "extra baggage".
  # CSRs <a name="csrs"></a>
  
  There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has precedent
-in the setting of MISA to enable / disable extensions).
+decode phase to re-interpret standard RV opcodes (a practice that has
+precedent in the setting of MISA to enable / disable extensions).
  
  * Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
  * Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
  * Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
  * Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (key-value store)
+* Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
  
  Notes:
  
@@ -481,6 +611,7 @@ and so on).
  The reason for setting this limit is so that predication registers, when
  marked as such, may fit into a single register as opposed to fanning out
  over several registers.  This keeps the implementation a little simpler.
+Note that RVV on top of Simple-V may choose to over-ride this decision.
  
  ## Vector-length CSRs
  
@@ -565,40 +696,7 @@ The reason for multiplying the vector length by the number of SIMD elements
  predicated.
  
  An example of how to subdivide the register file when bitwidth != default
-is given in the section "Virtual Register Reordering".
-
-# Example of vector / vector, vector / scalar, scalar / scalar => vector add
-
-    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
-    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
-    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
-    register x[32][XLEN];
-
-    function op_add(rd, rs1, rs2, predr)
-    {
-       /* note that this is ADD, not PADD */
-       int i, id, irs1, irs2;
-       # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
-       # also destination makes no sense as a scalar but what the hell...
-       for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
-          if (CSRpredicate[predr][i]) # i *think* this is right...
-             x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
-          # now increment the idxs
-          if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
-             id += 1;
-          if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
-             irs1 += 1;
-          if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
-             irs2 += 1;
-    }
-
-# V-Extension to Simple-V Comparative Analysis
-
-This section has been moved to its own page [[v_comparative_analysis]]
-
-# P-Ext ISA
-
-This section has been moved to its own page [[p_comparative_analysis]]
+is given in the section "Bitwidth Virtual Register Reordering".
  
  # Exceptions
  
@@ -852,9 +950,116 @@ the question is asked "How can each of the proposals effectively implement
    (caveat: anything not specified drops through to software-emulation / traps)
  * TODO
  
-# Register reordering <a name="register_reordering"></a>
+# Appendix
+
+## V-Extension to Simple-V Comparative Analysis
  
-## Register File
+This section has been moved to its own page [[v_comparative_analysis]]
+
+## P-Ext ISA
+
+This section has been moved to its own page [[p_comparative_analysis]]
+
+## Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
+    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
+    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
+    register x[32][XLEN];
+
+    function op_add(rd, rs1, rs2, predr)
+    {
+       /* note that this is ADD, not PADD */
+       int i, id, irs1, irs2;
+       # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
+       # also destination makes no sense as a scalar but what the hell...
+       for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
+          if (CSRpredicate[predr][i]) # i *think* this is right...
+             x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
+          # now increment the idxs
+          if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
+             id += 1;
+          if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
+             irs1 += 1;
+          if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
+             irs2 += 1;
+    }
+
+## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
+
+One of the goals of this parallelism proposal is to avoid instruction
+duplication.  However, with the base ISA having been designed explictly
+to *avoid* condition-codes entirely, shoe-horning predication into it
+bcomes quite challenging.
+
+However what if all branch instructions, if referencing a vectorised
+register, were instead given *completely new analogous meanings* that
+resulted in a parallel bit-wise predication register being set?  This
+would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
+BLT and BGE.
+
+We might imagine that FEQ, FLT and FLT would also need to be converted,
+however these are effectively *already* in the precise form needed and
+do not need to be converted *at all*!  The difference is that FEQ, FLT
+and FLE *specifically* write a 1 to an integer register if the condition
+holds, and 0 if not.  All that needs to be done here is to say, "if
+the integer register is tagged with a bit that says it is a predication
+register, the **bit** in the integer register is set based on the
+current vector index" instead.
+
+There is, in the standard Conditional Branch instruction, more than
+adequate space to interpret it in a similar fashion:
+
+[[!table  data="""
+  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
+imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
+ 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
+   offset[12,10:5]  ||    src2  |    src1   |  BEQ         |    offset[11,4:1]      || BRANCH      |
+"""]]
+
+This would become:
+
+[[!table  data="""
+31      | 30 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
+imm[12] | imm[10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
+1       | 6        | 5        | 5     | 3      | 4             | 1  | 7       |
+reserved          || src2     | src1  | BEQ    | predicate rs3     || BRANCH  |
+"""]]
+
+Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
+with the interesting side-effect that there is space within what is presently
+the "immediate offset" field to reinterpret that to add in not only a bit
+field to distinguish between floating-point compare and integer compare,
+not only to add in a second source register, but also use some of the bits as
+a predication target as well.
+
+[[!table  data="""
+15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
+   funct3    |       imm          |   rs10   |         imm           |   op   |
+      3      |         3          |    3     |           5           |   2    |
+   C.BEQZ    |   offset[8,4:3]    |   src    |   offset[7:6,2:1,5]   |   C1   |
+"""]]
+
+Now uses the CS format:
+
+[[!table  data="""
+15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
+   funct3    |       imm          |   rs10   |  imm   |              |   op   |
+      3      |         3          |    3     |  2     |  3           |   2    |
+   C.BEQZ    |   predicate rs3    |   src1   |  I/F B | src2         |   C1   |
+"""]]
+
+Bit 6 would be decoded as "operation refers to Integer or Float" including
+interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
+"C" Standard, version 2.0,
+whilst Bit 5 would allow the operation to be extended, in combination with
+funct3 = 110 or 111: a combination of four distinct (predicated) comparison
+operators.  In both floating-point and integer cases those could be
+EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
+
+## Register reordering <a name="register_reordering"></a>
+
+### Register File
  
  | Reg Num | Bits |
  | ------- | ---- |
@@ -869,7 +1074,7 @@ the question is asked "How can each of the proposals effectively implement
  | .. | (32..0) |
  | r31| (32..0) |
  
-## Vectorised CSR
+### Vectorised CSR
  
  May not be an actual CSR: may be generated from Vector Length CSR:
  single-bit is less burdensome on instruction decode phase.
@@ -878,7 +1083,7 @@ single-bit is less burdensome on instruction decode phase.
  | - | - | - | - | - | - | - | - |
  | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
  
-## Vector Length CSR
+### Vector Length CSR
  
  | Reg Num | (3..0) |
  | ------- | ---- |
@@ -891,7 +1096,7 @@ single-bit is less burdensome on instruction decode phase.
  | r6 | 0 |
  | r7 | 1 |
  
-## Virtual Register Reordering:
+### Virtual Register Reordering
  
  This example assumes the above Vector Length CSR table
  
@@ -903,8 +1108,10 @@ This example assumes the above Vector Length CSR table
  | r4 | (32..0) | (32..0) | (32..0) |
  | r7 | (32..0) |
  
+### Bitwidth Virtual Register Reordering
+
  This example goes a little further and illustrates the effect that a
-bitwidth CSR has been set on a register
+bitwidth CSR has been set on a register.  Preconditions:
  
  * RV32 assumed
  * CSRintbitwidth[2] = 010 # integer r2 is 16-bit
@@ -916,6 +1123,7 @@ This is interpreted as follows:
  * Given that the context is RV32, ELEN=32.
  * With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
  * Therefore the actual vector length is up to *six* elements
+* However vsetl sets a length 5 therefore the last "element" is skipped
  
  So when using an operation that uses r2 as a source (or destination)
  the operation is carried out as follows:
@@ -939,7 +1147,39 @@ operations carried out 32-bits at a time is perfectly acceptable, as is
  Regardless of the internal parallelism choice, *predication must
  still be respected*, making Simple-V in effect the "consistent public API".
  
-## Example Instruction translation: <a name="example_translation"></a>
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default  |
+| 001 | 8        |
+| 010 | 16       |
+| 011 | 32       |
+| 100 | 64       |
+| 101 | 128      |
+| 110 | rsvd     |
+| 111 | rsvd     |
+
+Pseudocode for vector length taking CSR SIMD-bitwidth into account:
+
+    vew = CSRbitwidth[rs1]
+    if (vew == 0)
+        bytesperreg = (XLEN/8) # or FLEN as appropriate
+    else:
+        bytesperreg = bytestable[vew] # 1 2 4 8 16
+    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+    vlen = CSRvectorlen[rs1] * simdmult
+
+To index an element in a register rnum where the vector element index is i:
+
+    function regoffs(rnum, i):
+        regidx = floor(i / simdmult)  # integer-div rounded down
+        byteidx = i % simdmult        # integer-remainder
+        return rnum + regidx,         # actual real register
+               byteidx * 8,           # low
+               byteidx * 8 + (vew-1), # high
+
+### Example Instruction translation: <a name="example_translation"></a>
  
  Instructions "ADD r2 r4 r4" would result in three instructions being
  generated and placed into the FILO:
@@ -948,7 +1188,7 @@ generated and placed into the FILO:
  * ADD r2 r5 r5
  * ADD r2 r6 r6
  
-## Insights
+### Insights
  
  SIMD register file splitting still to consider.  For RV64, benefits of doubling
  (quadrupling in the case of Half-Precision IEEE754 FP) the apparent
@@ -969,7 +1209,7 @@ on caches).  Interestingly we observe then that Simple-V is about
  of underlying hardware is an implementor-choice that could just as
  equally be applied *without* Simple-V even being implemented.
  
-# Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
+## Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
  
  It could indeed have been logically deduced (or expected), that there
  would be additional decode latency in this proposal, because if
@@ -1057,9 +1297,7 @@ pluses:
    parallel ALUs) is only equal to one ("virtual" parallelism), or is
    greater than one, should not be underestimated.
  
-# Appendix
-
-# Reducing Register Bank porting
+## Reducing Register Bank porting
  
  This looks quite reasonable.
  <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
@@ -1076,8 +1314,6 @@ The nice thing about a vector architecture is that you *know* that
  to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
  by *introducing* deliberate latency into the execution phase.
  
-
-
  # References
  
  * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>