whitespace cleanup

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index c0a8f8b24c5d459e26a98e0c1a1f2e4a3d2e4e91..5d4e67ab4afb3c193f6383a9b805dc2e1f0aee01 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -2,6 +2,22 @@
  
  [[!toc ]]
  
+# Summary
+
+Key insight: Simple-V is intended as an abstraction layer to provide
+a consistent "API" to parallelisation of existing *and future* operations.
+*Actual* internal hardware-level parallelism is *not* required, such
+that Simple-V may be viewed as providing a "compact" or "consolidated"
+means of issuing multiple near-identical arithmetic instructions to an
+instruction queue (FILO), pending execution.
+
+*Actual* parallelism, if added independently of Simple-V in the form
+of Out-of-order restructuring (including parallel ALU lanes) or VLIW
+implementations, or SIMD, or anything else, would then benefit *if*
+Simple-V was added on top.
+
+# Introduction
+
  This proposal exists so as to be able to satisfy several disparate
  requirements: power-conscious, area-conscious, and performance-conscious
  designs all pull an ISA and its implementation in different conflicting
@@ -11,7 +27,7 @@ Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
  whilst each extremely powerful in their own right and clearly desirable,
  are also:
  
-* Clearly independent in their origins (Cray and AndeStar v3 respectively)
+* Clearly independent in their origins (Cray and AndesStar v3 respectively)
    so need work to adapt to the RISC-V ethos and paradigm
  * Are sufficiently large so as to make adoption (and exploration for
    analysis and review purposes) prohibitively expensive
@@ -108,19 +124,19 @@ Implementation of the latter:
  
  Operation involving (referring to) register M:
  
-> bitwidth = default # default for opcode?
-> vectorlen = 1 # scalar
-> 
-> for (o = 0, o < 2, o++)
->   if (CSR-Vector_registernum[o] == M)
->       bitwidth = CSR-Vector_bitwidth[o]
->       vectorlen = CSR-Vector_len[o]
->       break
+    bitwidth = default # default for opcode?
+    vectorlen = 1 # scalar
+    
+    for (o = 0, o < 2, o++)
+      if (CSR-Vector_registernum[o] == M)
+          bitwidth = CSR-Vector_bitwidth[o]
+          vectorlen = CSR-Vector_len[o]
+          break
  
  and for the former it would simply be:
  
-> bitwidth = CSR-Vector_bitwidth[M]
-> vectorlen = CSR-Vector_len[M]
+    bitwidth = CSR-Vector_bitwidth[M]
+    vectorlen = CSR-Vector_len[M]
  
  Alternatives:
  
@@ -154,19 +170,19 @@ which would mean:
  
  LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set):
  
-> offs = 0
-> stride = 1
-> vector-len = CSR-Vector-length register N
->
-> for (o = 0, o < 2, o++)
->   if (CSR-Offset register o == M)
->       offs = CSR-Offset amount register o
->       if CSR-Offset Stride-mode == offset:
->           stride = ldoffs
->       break
->
-> for (i = 0, i < vector-len; i++)
->   r[N+i] = mem[(offs*i + r[M+i])*stride]
+    offs = 0
+    stride = 1
+    vector-len = CSR-Vector-length register N
+  
+    for (o = 0, o < 2, o++)
+      if (CSR-Offset register o == M)
+          offs = CSR-Offset amount register o
+          if CSR-Offset Stride-mode == offset:
+              stride = ldoffs
+          break
+   
+    for (i = 0, i < vector-len; i++)
+      r[N+i] = mem[(offs*i + r[M+i])*stride]
  
  # Analysis and discussion of Vector vs SIMD
  
@@ -249,14 +265,21 @@ contains an extremely interesting feature: zero-overhead loops.  This
  proposal would basically allow an inner loop of instructions to be
  repeated indefinitely, a fixed number of times.
  
-Its specific advantage over explicit loops is that the pipeline in a
-DSP can potentially be kept completely full *even in an in-order
+Its specific advantage over explicit loops is that the pipeline in a DSP
+can potentially be kept completely full *even in an in-order single-issue
  implementation*.  Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in order
-to keep ALU pipelines 100% occupied.
+out-of-order execution capabilities to "pre-process" instructions in
+order to keep ALU pipelines 100% occupied.
  
-This very simple proposal offers a way to increase pipeline activity in the
-one key area which really matters: the inner loop.
+By bringing that capability in, this proposal could offer a way to increase
+pipeline activity even in simpler implementations in the one key area
+which really matters: the inner loop.
+
+However when looking at much more comprehensive schemes
+"A portable specification of zero-overhead loop control hardware
+applied to embedded processors" (ZOLC), optimising only the single
+inner loop seems inadequate, tending to suggest that ZOLC may be
+better off being proposed as an entirely separate Extension.
  
  ## Mask and Tagging (Predication)
  
@@ -307,13 +330,13 @@ condition-codes or predication.  By adding a CSR it becomes possible
  to also tag certain registers as "predicated if referenced as a destination".
  Example:
  
-> // in future operations if r0 is the destination use r5 as 
-> // the PREDICATION register
-> IMPLICICSRPREDICATE r0, r5
-> // store the compares in r5 as the PREDICATION register 
-> CMPEQ8 r5, r1, r2
-> // r0 is used here.  ah ha!  that means it's predicated using r5! 
-> ADD8 r0, r1, r3
+    // in future operations if r0 is the destination use r5 as 
+    // the PREDICATION register
+    IMPLICICSRPREDICATE r0, r5
+    // store the compares in r5 as the PREDICATION register 
+    CMPEQ8 r5, r1, r2
+    // r0 is used here.  ah ha!  that means it's predicated using r5! 
+    ADD8 r0, r1, r3
  
  With enough registers (and there are enough registers) some fairly
  complex predication can be set up and yet still execute without significant
@@ -399,19 +422,24 @@ follows:
  * Fixed vs variable parallelism: <b>variable</b>
  * Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
  * Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit</b>
-* Tag or no-tag: <b>Complex and needs further thought</b>
+* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
+* Tag or no-tag: <b>Complex but highly beneficial</b>
+
+In particular:
  
-In particular: variable-length vectors came out on top because of the
-high setup, teardown and corner-cases associated with the fixed width
-of SIMD.  Implicit bit-width helps to extend the ISA to escape from
-former limitations and restrictions (in a backwards-compatible fashion),
-and implicit (zero-overhead) loops provide a means to keep pipelines
-potentially 100% occupied *without* requiring a super-scalar or out-of-order
-architecture.
+* variable-length vectors came out on top because of the high setup, teardown
+  and corner-cases associated with the fixed width of SIMD.
+* Implicit bit-width helps to extend the ISA to escape from
+  former limitations and restrictions (in a backwards-compatible fashion),
+  whilst also leaving implementors free to simmplify implementations
+  by using actual explicit internal parallelism.
+* Implicit (zero-overhead) loops provide a means to keep pipelines
+  potentially 100% occupied in a single-issue in-order implementation
+  i.e. *without* requiring a super-scalar or out-of-order architecture,
+  but doing a proper, full job (ZOLC) is an entirely different matter.
  
-Constructing a SIMD/Simple-Vector proposal based around even only these four
-(five?) requirements would therefore seem to be a logical thing to do.
+Constructing a SIMD/Simple-Vector proposal based around four of these five
+requirements would therefore seem to be a logical thing to do.
  
  # Instruction Format
  
@@ -459,6 +487,31 @@ instructions to deal with corner-cases is thus avoided, and implementors
  get to choose precisely where to focus and target the benefits of their
  implementation efforts, without "extra baggage".
  
+# Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
+    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
+    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
+    register x[32][XLEN];
+
+    function op_add(rd, rs1, rs2, predr)
+    {
+       /* note that this is ADD, not PADD */
+       int i, id, irs1, irs2;
+       # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
+       # also destination makes no sense as a scalar but what the hell...
+       for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
+          if (CSRpredicate[predr][i]) # i *think* this is right...
+             x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
+          # now increment the idxs
+          if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
+             id += 1;
+          if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
+             irs1 += 1;
+          if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
+             irs2 += 1;
+    }
+
  # V-Extension to Simple-V Comparative Analysis
  
  This section covers the ways in which Simple-V is comparable
@@ -604,10 +657,137 @@ Lots to be discussed.
  
  ## 17.18 Vector Load/Store Instructions
  
-These may not have a direct equivalent in Simple-V, except if mask/tagging
-is to be deployed.
+The Vector Load/Store instructions as proposed in V are extremely powerful
+and can be used for reordering and regular restructuring.
+
+Vector Load:
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+          vreg[vd+j][i] = mem[areg[as1] + (i*(seglen+1)+j)*stride];
+
+Store:
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+          mem[areg[base] + (i*(seglen+1)+j)*stride] = vreg[vd+j][i];
+
+Indexed Load:
+
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+          vreg[vd+j][i] = mem[sreg[base] + vreg[vs2][i] + j*elsize];
+
+Indexed Store:
+
+    for (int i=0; i<vl; ++i)
+    if ([!]preg[p][i])
+      for (int j=0; j<seglen+1; j++)
+        mem[sreg[base] + vreg[vs2][i] + j*elsize] = vreg[vd+j][i];
+
+Keeping these instructions as-is for Simple-V is highly recommended.
+However: one of the goals of this Extension is to retro-fit (re-use)
+existing RV Load/Store:
+
+[[!table  data="""
+31                  20 | 19      15 | 14    12 | 11           7 | 6         0 |
+       imm[11:0]       |     rs1    |  funct3  |       rd       |    opcode |
+            12         |      5     |    3     |        5       |      7 |
+       offset[11:0]    |    base    |  width   |      dest      |    LOAD |
+"""]]
+
+[[!table  data="""
+31          25 | 24    20 | 19     15 | 14    12 | 11          7 | 6         0 |
+ imm[11:5]     |   rs2    |    rs1    |  funct3  |   imm[4:0]    |    opcode |
+      7        |    5     |     5     |    3     |       5       |      7 |
+ offset[11:5]  |   src    |   base    |  width   |  offset[4:0]  |   STORE |
+"""]]
+
+The RV32 instruction opcodes as follows:
+
+[[!table  data="""
+31 28  27 | 26 25 | 24  20 |19  15  |14| 13 12 | 11   7 | 6     0 | op  |
+imm[4:0]  | 00    | 00000  |    rs1 | 1| m     | vd     | 0000111 | VLD |
+imm[4:0]  | 01    |   rs2  |    rs1 | 1| m     | vd     | 0000111 | VLDS|
+imm[4:0]  | 11    |   vs2  |    rs1 | 1| m     | vd     | 0000111 | VLDX|
+vs3       | 00    | 00000  |    rs1 |1 | m     |imm[4:0]| 0100111 |VST  |
+vs3       | 01    | rs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTS |
+vs3       | 11    | vs2    |    rs1 |1 | m     |imm[4:0]| 0100111 |VSTX |
+"""]]
+
+Conversion on LOAD as follows:
  
-To be discussed.
+* rd or rs1 are CSR-vectorised indicating "Vector Mode"
+* rd equivalent to vd
+* rs1 equivalent to rs1
+* imm[4:0] from RV format (11..7]) is same
+* imm[9:5] from RV format (29..25] is rs2 (rs2=00000 for VLD)
+* imm[11:10] from RV format (31..30] is opcode (VLD, VLDS, VLDX)
+* width from RV format (14..12) is same (width and zero/sign extend)
+
+[[!table  data="""
+31 30 | 29 25 | 24    20 | 19 15 | 14  12 | 11      7 | 6    0 |
+imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
+2     | 5     | 5        | 5     | 3      | 5         | 7      |
+00    | 00000 | imm[4:0] | base  | width  | dest      | LOAD   |
+01    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.S |
+11    | rs2   | imm[4:0] | base  | width  | dest      | LOAD.X |
+"""]]
+
+Similar conversion on STORE as follows:
+
+[[!table  data="""
+31 30 | 29  25 | 24   20 | 19 15 | 14  12 | 11      7 | 6    0 |
+imm[11:0]              ||| rs1   | funct3 | rd        | opcode |
+2     | 5      | 5       | 5     | 3      | 5         | 7      |
+00    | 00000  | src     | base  | width  | offs[4:0] | LOAD   |
+01    | rs3    | src     | base  | width  | offs[4:0] | LOAD.S |
+11    | rs3    | src     | base  | width  | offs[4:0] | LOAD.X |
+"""]]
+
+Notes:
+
+* Predication CSR-marking register is not explicitly shown in instruction
+* In both LOAD and STORE, it is possible now to rs2 (or rs3) as a vector.
+* That in turn means that Indexed Load need not have an explicit opcode
+* That in turn means that bit 30 may indicate "stride" and bit 31 is free
+
+Revised LOAD:
+
+[[!table  data="""
+31 | 30 | 29  25      |  24   20 | 19    15 | 14    12 | 11   7 | 6     0 |
+   imm[11:0]                  ||||    rs1   |  funct3  |   rd   |  opcode |
+ 1 | 1  |  5          |    5     |     5    |    3     |    5   |     7   |
+ ? | s  |  rs2        | imm[4:0] |   base   |  width   |  dest  |  LOAD   |
+"""]]
+
+Where in turn the pseudo-code may now combine the two:
+
+    if (unit-strided) stride = elsize;
+    else stride = areg[as2]; // constant-strided
+    for (int i=0; i<vl; ++i)
+      if ([!]preg[p][i])
+        for (int j=0; j<seglen+1; j++)
+        {
+          if CSRvectorised[rs2])
+             offs = vreg[rs2][i]
+          else
+             offs = i*(seglen+1)*stride;
+          vreg[vd+j][i] = mem[sreg[base] + offs + j*stride];
+        }
+
+Notes:
+
+* j is multiplied by stride, not elsize, including in the rs2 vectorised case.
+* There may be more sophisticated variants involving the 31st bit, however
+  it would be nice to reserve that bit for post-increment of address registers
  
  ## 17.19 Vector Register Gather
  
@@ -739,20 +919,20 @@ auto-incrementing the two address registers a2 and a3, as well as
  providing a means to interact between the zero-overhead loop and the
  vsetvl instruction.  a sort-of pseudo-assembly of that would look like:
  
-> # a2 to be auto-incremented by t0*4
-> zero-overhead-set-auto-increment a2, t0, 4
-> # a2 to be auto-incremented by t0*4
-> zero-overhead-set-auto-increment a3, t0, 4
-> zero-overhead-set-loop-terminator-condition a0 zero
-> zero-overhead-set-start-end stripmine, stripmine+endoffset
-> stripmine:
-> vsetvl t0,a0
-> vlw v0, a2
-> vlw v1, a3
-> vfma v1, a1, v0, v1
-> vsw v1, a3
-> sub a0, a0, t0
->stripmine+endoffset:
+    # a2 to be auto-incremented by t0 times 4
+    zero-overhead-set-auto-increment a2, t0, 4
+    # a2 to be auto-incremented by t0 times 4
+    zero-overhead-set-auto-increment a3, t0, 4
+    zero-overhead-set-loop-terminator-condition a0 zero
+    zero-overhead-set-start-end stripmine, stripmine+endoffset
+    stripmine:
+    vsetvl t0,a0
+    vlw v0, a2
+    vlw v1, a3
+    vfma v1, a1, v0, v1
+    vsw v1, a3
+    sub a0, a0, t0
+    stripmine+endoffset:
  
  the question is: would something like this even be desirable?  it's a
  variant of auto-increment [1].  last time i saw any hint of auto-increment
@@ -944,6 +1124,229 @@ translates effectively to:
  * Throw an exception.  Whether that actually results in spawning threads
    as part of the trap-handling remains to be seen.
  
+# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
+
+This section compares the various parallelism proposals as they stand,
+including traditional SIMD, in terms of features, ease of implementation,
+complexity, flexibility, and die area.
+
+## [[alt_rvp]]
+
+Primary benefit of Alt-RVP is the simplicity with which parallelism
+may be introduced (effective multiplication of regfiles and associated ALUs).
+
+* plus: the simplicity of the lanes (combined with the regularity of
+  allocating identical opcodes multiple independent registers) meaning
+  that SRAM or 2R1W can be used for entire regfile (potentially).
+* minus: a more complex instruction set where the parallelism is much
+  more explicitly directly specified in the instruction and
+* minus: if you *don't* have an explicit instruction (opcode) and you
+  need one, the only place it can be added is... in the vector unit and
+* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
+  not useable or accessible in other Extensions.
+* plus-and-minus: Lanes may be utilised for high-speed context-switching
+  but with the down-side that they're an all-or-nothing part of the Extension.
+  No Alt-RVP: no fast register-bank switching.
+* plus: Lane-switching would mean that complex operations not suited to
+  parallelisation can be carried out, followed by further parallel Lane-based
+  work, without moving register contents down to memory (and back)
+* minus: Access to registers across multiple lanes is challenging. "Solution"
+  is to drop data into memory and immediately back in again (like MMX).
+
+## Simple-V
+
+Primary benefit of Simple-V is the OO abstraction of parallel principles
+from actual (internal) parallel hardware.  It's an API in effect that's
+designed to be slotted in to an existing implementation (just after
+instruction decode) with minimum disruption and effort.
+
+* minus: the complexity of having to use register renames, OoO, VLIW,
+  register file cacheing, all of which has been done before but is a
+  pain
+* plus: transparent re-use of existing opcodes as-is just indirectly
+  saying "this register's now a vector" which
+* plus: means that future instructions also get to be inherently
+  parallelised because there's no "separate vector opcodes"
+* plus: Compressed instructions may also be (indirectly) parallelised
+* minus: the indirect nature of Simple-V means that setup (setting
+  a CSR register to indicate vector length, a separate one to indicate
+  that it is a predicate register and so on) means a little more setup
+  time than Alt-RVP or RVV's "direct and within the (longer) instruction"
+  approach.
+* plus: shared register file meaning that, like Alt-RVP, complex
+  operations not suited to parallelisation may be carried out interleaved
+  between parallelised instructions *without* requiring data to be dropped
+  down to memory and back (into a separate vectorised register engine).
+* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
+  files means that huge parallel workloads would use up considerable
+  chunks of the register file.  However in the case of RV64 and 32-bit
+  operations, that effectively means 64 slots are available for parallel
+  operations.
+* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
+  be added, yet the instruction opcodes remain unchanged (and still appear
+  to be parallel).  consistent "API" regardless of actual internal parallelism:
+  even an in-order single-issue implementation with a single ALU would still
+  appear to have parallel vectoristion.
+* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
+  hard to say if there would be pluses or minuses (on die area).  At worse it
+  would be "no worse" than existing register renaming, OoO, VLIW and register
+  file cacheing schemes.
+
+## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
+
+RVV is extremely well-designed and has some amazing features, including
+2D reorganisation of memory through LOAD/STORE "strides".
+
+* plus: regular predictable workload means that implementations may
+  streamline effects on L1/L2 Cache.
+* plus: regular and clear parallel workload also means that lanes
+  (similar to Alt-RVP) may be used as an implementation detail,
+  using either SRAM or 2R1W registers.
+* plus: separate engine with no impact on the rest of an implementation
+* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
+  really feasible.
+* minus: no ISA abstraction or re-use either: additions to other Extensions
+  do not gain parallelism, resulting in prolific duplication of functionality
+  inside RVV *and out*.
+* minus: when operations require a different approach (scalar operations
+  using the standard integer or FP regfile) an entire vector must be
+  transferred out to memory, into standard regfiles, then back to memory,
+  then back to the vector unit, this to occur potentially multiple times.
+* minus: will never fit into Compressed instruction space (as-is.  May
+  be able to do so if "indirect" features of Simple-V are partially adopted).
+* plus-and-slight-minus: extended variants may address up to 256
+  vectorised registers (requires 48/64-bit opcodes to do it).
+* minus-and-partial-plus: separate engine plus complexity increases
+  implementation time and die area, meaning that adoption is likely only
+  to be in high-performance specialist supercomputing (where it will
+  be absolutely superb).
+
+## Traditional SIMD
+
+The only really good things about SIMD are how easy it is to implement and
+get good performance.  Unfortunately that makes it quite seductive...
+
+* plus: really straightforward, ALU basically does several packed operations
+  at once.  Parallelism is inherent at the ALU, making the addition of
+  SIMD-style parallelism an easy decision that has zero significant impact
+  on the rest of any given architectural design and layout.
+* plus (continuation): SIMD in simple in-order single-issue designs can
+  therefore result in superb throughput, easily achieved even with a very
+  simple execution model.
+* minus: ridiculously complex setup and corner-cases that disproportionately
+  increase instruction count on what would otherwise be a "simple loop",
+  should the number of elements in an array not happen to exactly match
+  the SIMD group width.
+* minus: getting data usefully out of registers (if separate regfiles
+  are used) means outputting to memory and back.
+* minus: quite a lot of supplementary instructions for bit-level manipulation
+  are needed in order to efficiently extract (or prepare) SIMD operands.
+* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
+  dimension and parallelism (width): an at least O(N^2) and quite probably
+  O(N^3) ISA proliferation that often results in several thousand
+  separate instructions.  all requiring separate and distinct corner-case
+  algorithms!
+* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
+  8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
+  For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
+  four separate and distinct instructions: one for (r1:low r2:high),
+  one for (r1:high r2:low), one for (r1:high r2:high) and one for
+  (r1:low r2:low) *per function*.
+* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
+  between operand and result bit-widths.  In combination with high/low
+  proliferation the situation is made even worse.
+* minor-saving-grace: some implementations *may* have predication masks
+  that allow control over individual elements within the SIMD block.
+
+# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
+
+This section compares the various parallelism proposals as they stand,
+*against* traditional SIMD as opposed to *alongside* SIMD.  In other words,
+the question is asked "How can each of the proposals effectively implement
+(or replace) SIMD, and how effective would they be"?
+
+## [[alt_rvp]]
+
+* Alt-RVP would not actually replace SIMD but would augment it: just as with
+  a SIMD architecture where the ALU becomes responsible for the parallelism,
+  Alt-RVP ALUs would likewise be so responsible... with *additional*
+  (lane-based) parallelism on top.
+* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+  at least one dimension are avoided (architectural upgrades introducing
+  128-bit then 256-bit then 512-bit variants of the exact same 64-bit
+  SIMD block)
+* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
+  of instructions as SIMD, albeit not quite as badly (due to Lanes).
+* In the same discussion for Alt-RVP, an additional proposal was made to
+  be able to subdivide the bits of each register lane (columns) down into
+  arbitrary bit-lengths (RGB 565 for example).
+* A recommendation was given instead to make the subdivisions down to 32-bit,
+  16-bit or even 8-bit, effectively dividing the registerfile into
+  Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further.  If inter-lane
+  "swapping" instructions were then introduced, some of the disadvantages
+  of SIMD could be mitigated.
+
+## RVV
+
+* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
+  parallelism.
+* However whilst SIMD is usually designed for single-issue in-order simple
+  DSPs with a focus on Multimedia (Audio, Video and Image processing),
+  RVV's primary focus appears to be on Supercomputing: optimisation of
+  mathematical operations that fit into the OpenCL space.
+* Adding functions (operations) that would normally fit (in parallel) 
+  into a SIMD instruction requires an equivalent to be added to the
+  RVV Extension, if one does not exist.  Given the specialist nature of
+  some SIMD instructions (8-bit or 16-bit saturated or halving add),
+  this possibility seems extremely unlikely to occur, even if the
+  implementation overhead of RVV were acceptable (compared to
+  normal SIMD/DSP-style single-issue in-order simplicity).
+
+## Simple-V
+
+* Simple-V borrows hugely from RVV as it is intended to be easy to
+  topologically transplant every single instruction from RVV (as
+  designed) into Simple-V equivalents, with *zero loss of functionality
+   or capability*.
+* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
+  Extension which contained the basic primitives (non-parallelised
+  8, 16 or 32-bit SIMD operations) inherently *become* parallel,
+  automatically.
+* Additionally, standard operations (ADD, MUL) that would normally have
+  to have special SIMD-parallel opcodes added need no longer have *any*
+  of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
+  4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
+  *standard* RV opcodes (present and future) and automatically parallelises
+  them.
+* By inheriting the RVV feature of arbitrary vector-length, then just as
+  with RVV the corner-cases and ISA proliferation of SIMD is avoided.
+* Whilst not entirely finalised, registers are expected to be
+  capable of being subdivided down to an implementor-chosen bitwidth
+  in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
+  and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
+  choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
+  ALUs that perform twin 8-bit operations as they see fit, or anything
+  else including no subdivisions at all.
+* Even though implementors have that choice even to have full 64-bit
+  (with RV64) SIMD, they *must* provide predication that transparently
+  switches off appropriate units on the last loop, thus neatly fitting
+  underlying SIMD ALU implementations *into* the arbitrary vector-length
+  RVV paradigm, keeping the uniform consistent API that is a key strategic
+  feature of Simple-V.
+* With Simple-V fitting into the standard register files, certain classes
+  of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
+  can be done by applying *Parallelised* Bit-manipulation operations
+  followed by parallelised *straight* versions of element-to-element
+  arithmetic operations, even if the bit-manipulation operations require
+  changing the bitwidth of the "vectors" to do so.  Predication can
+  be utilised to skip high words (or low words) in source or destination.
+* In essence, the key downside of SIMD - massive duplication of
+  identical functions over time as an architecture evolves from 32-bit
+  wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
+  vector-style parallelism being dropped on top of 8-bit or 16-bit
+  operations, all the while keeping a consistent ISA-level "API" irrespective
+  of implementor design choices (or indeed actual implementations).
+
  # Impementing V on top of Simple-V
  
  * Number of Offset CSRs extends from 2
@@ -958,9 +1361,84 @@ translates effectively to:
    (caveat: anything not specified drops through to software-emulation / traps)
  * TODO
  
-# Analysis of CSR decoding on latency
-
-<a name="csr_decoding_analysis"></a>
+# Register reordering <a name="register_reordering"></a>
+
+## Register File 
+
+| Reg Num | Bits |
+| ------- | ---- |
+| r0 | (32..0) |
+| r1 | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) |
+| r5 | (32..0) |
+| r6 | (32..0) |
+| r7 | (32..0) |
+
+## Vectorised CSR
+
+May not be an actual CSR: may be generated from Vector Length CSR:
+single-bit is less burdensome on instruction decode phase.
+
+| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
+| - | - | - | - | - | - | - | - |  
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
+
+## Vector Length CSR
+
+| Reg Num | (3..0) |
+| ------- | ---- |
+| r0 | 2 |
+| r1 | 0 |
+| r2 | 1 |
+| r3 | 1 |
+| r4 | 3 |
+| r5 | 0 |
+| r6 | 0 |
+| r7 | 1 |
+
+## Virtual Register Reordering:
+
+| Reg Num | Bits (0) | Bits (1) | Bits (2) |
+| ------- | -------- | -------- | -------- |
+| r0 | (32..0) | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) | (32..0) | (32..0) |
+| r7 | (32..0) |
+
+## Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FILO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
+## Insights 
+
+SIMD register file splitting still to consider.  For RV64, benefits of doubling
+(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
+size of the floating point register file to 64 (128 in the case of HP)
+seem pretty clear and worth the complexity.
+
+64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
+done on 64-bit registers it's not so conceptually difficult.  May even
+be achieved by *actually* splitting the regfile into 64 virtual 32-bit
+registers such that a 64-bit FP scalar operation is dropped into (r0.H
+r0.L) tuples.  Implementation therefore hidden through register renaming.
+
+Implementations intending to introduce VLIW, OoO and parallelism
+(even without Simple-V) would then find that the instructions are
+generated quicker (or in a more compact fashion that is less heavy
+on caches).  Interestingly we observe then that Simple-V is about
+"consolidation of instruction generation", where actual parallelism
+of underlying hardware is an implementor-choice that could just as
+equally be applied *without* Simple-V even being implemented.
+
+# Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
  
  It could indeed have been logically deduced (or expected), that there
  would be additional decode latency in this proposal, because if
@@ -1048,6 +1526,26 @@ pluses:
    parallel ALUs) is only equal to one ("virtual" parallelism), or is
    greater than one, should not be underestimated.
  
+# Appendix
+
+# Reducing Register Bank porting
+
+This looks quite reasonable.
+<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+
+The main details are outlined on page 4.  They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
+
+
  
  # References
  
@@ -1067,3 +1565,9 @@ pluses:
  * Branch Divergence <https://jbush001.github.io/2014/12/07/branch-divergence-in-parallel-kernels.html>
  * Life of Triangles (3D) <https://jbush001.github.io/2016/02/27/life-of-triangle.html>
  * Videocore-IV <https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-3d-Graphics-Pipeline>
+* Discussion proposing CSRs that change ISA definition
+  <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/InzQ1wr_3Ak>
+* Zero-overhead loops <https://pdfs.semanticscholar.org/dbaa/66985cc730d4b44d79f519e96ec9c43ab5b7.pdf>
+* Multi-ported VLIW Register File Implementation <https://ce-publications.et.tudelft.nl/publications/1517_multiple_contexts_in_a_multiported_vliw_register_file_impl.pdf>
+* Fast context save/restore proposal <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57F823FA.6030701%40gmail.com>
+* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>