X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=0642ce926d702469769bae2210773b1035398a95;hb=69f396a3a5e7b113ac7f4edc60dbb3641726b176;hp=8209cd73eb7f211c3157fd0eb2b38c533904d5e8;hpb=9969963343a68c1057eb56e3e8f0397d60946f9b;p=libreriscv.git

diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn
index 8209cd73e..0642ce926 100644
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,5 +1,7 @@
 # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
 
+**Note: this document is out of date and involved early ideas and discussions**
+
 Key insight: Simple-V is intended as an abstraction layer to provide
 a consistent "API" to parallelisation of existing *and future* operations.
 *Actual* internal hardware-level parallelism is *not* required, such
@@ -9,8 +11,13 @@ instruction queue (FIFO), pending execution.
 
 *Actual* parallelism, if added independently of Simple-V in the form
 of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
+
+**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E.
+
+* Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
+* Specification: now move to its own page: [[specification]]
 
 [[!toc ]]
 
@@ -21,7 +28,7 @@ requirements: power-conscious, area-conscious, and performance-conscious
 designs all pull an ISA and its implementation in different conflicting
 directions, as do the specific intended uses for any given implementation.
 
-Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
+The existing P (SIMD) proposal and the V (Vector) proposals,
 whilst each extremely powerful in their own right and clearly desirable,
 are also:
 
@@ -31,8 +38,8 @@ are also:
   analysis and review purposes) prohibitively expensive
 * Both contain partial duplication of pre-existing RISC-V instructions
   (an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
-  at the instruction level.
+* Both have independent, incompatible and disparate methods for introducing
+  parallelism at the instruction level
 * Both require that their respective parallelism paradigm be implemented
   along-side and integral to their respective functionality *or not at all*.
 * Both independently have methods for introducing parallelism that
@@ -52,10 +59,13 @@ details outlined in the Appendix), the key points being:
 * Vectorisation typically includes much more comprehensive memory load
   and store schemes (unit stride, constant-stride and indexed), which
   in turn have ramifications: virtual memory misses (TLB cache misses)
-  and even multiple page-faults... all caused by a *single instruction*.
+  and even multiple page-faults... all caused by a *single instruction*,
+  yet with a clear benefit that the regularisation of LOAD/STOREs can
+  be optimised for minimal impact on caches and maximised throughput.
 * By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
   to pages), and these load/stores have absolutely nothing to do with the
-  SIMD / ALU engine, no matter how wide the operand.
+  SIMD / ALU engine, no matter how wide the operand.  Simplicity but with
+  more impact on instruction and data caches.
 
 Overall it makes a huge amount of sense to have a means and method
 of introducing instruction parallelism in a flexible way that provides
@@ -123,7 +133,8 @@ reducing power consumption for the same.
 SIMD again has a severe disadvantage here, over Vector: huge proliferation
 of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
 have to then have operations *for each and between each*.  It gets very
-messy, very quickly.
+messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
+proliferation profile.
 
 The V-Extension on the other hand proposes to set the bit-width of
 future instructions on a per-register basis, such that subsequent instructions
@@ -135,7 +146,7 @@ burdensome to implementations, given that instruction decode already has
 to direct the operation to a correctly-sized width ALU engine, anyway.
 
 Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
+whatever reason, including limitations of the available operand space),
 implicit bit-width allows the meaning of certain operations to be
 type-overloaded *without* pollution or alteration of frozen and immutable
 instructions, in a fully backwards-compatible fashion.
@@ -208,10 +219,10 @@ Interestingly, none of this complexity is faced in SIMD architectures...
 but then they do not get the opportunity to optimise for highly-streamlined
 memory accesses either.
 
-With the "bang-per-buck" ratio being so high and the direct improvement
-in L1 Instruction Cache usage, as well as the opportunity to optimise
-L1 and L2 cache usage, the case for including Vector LOAD/STORE is
-compelling.
+With the "bang-per-buck" ratio being so high and the indirect improvement
+in L1 Instruction Cache usage (reduced instruction count), as well as
+the opportunity to optimise L1 and L2 cache usage, the case for including
+Vector LOAD/STORE is compelling.
 
 ## Mask and Tagging (Predication)
 
@@ -232,8 +243,8 @@ So these are the ways in which conditional execution may be implemented:
 * explicit compare and branch: BNE x, y -> offs would jump offs
   instructions if x was not equal to y
 * explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) ADD results in a carry, carry bit implicitly
-  (or sometimes explicitly) goes into a "tag" (mask) register
+* implicit (condition-code) such as ADD results in a carry, carry bit
+  implicitly (or sometimes explicitly) goes into a "tag" (mask) register
 
 The first of these is a "normal" branch method, which is flat-out impossible
 to parallelise without look-ahead and effectively rewriting instructions.
@@ -304,256 +315,9 @@ In particular:
   i.e. *without* requiring a super-scalar or out-of-order architecture,
   but doing a proper, full job (ZOLC) is an entirely different matter.
 
-Constructing a SIMD/Simple-Vector proposal based around four of these five
+Constructing a SIMD/Simple-Vector proposal based around four of these six
 requirements would therefore seem to be a logical thing to do.
 
-# Instruction Format
-
-The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or memory instructions.
-Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on implicit CSR configurations for both vector length and
-bitwidth.  *This includes Compressed instructions* as well as future ones.
-
-* For analysis of RVV see [[v_comparative_analysis]] which begins to
-  outline topologically-equivalent mappings of instructions
-* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
-  for format of Branch opcodes.
-
-**TODO**: *analyse and decide whether the implicit nature of predication
-as proposed is or is not a lot of hassle, and if explicit prefixes are
-a better idea instead.  Parallelism therefore effectively may end up
-as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
-with some opportunities for to use Compressed bringing it down to 48.
-Also to consider is whether one or both of the last two remaining Compressed
-instruction codes in Quadrant 1 could be used as a parallelism prefix,
-bringing parallelised opcodes down to 32-bit (when combined with C)
-and having the benefit of being explicit.*
-
-## Branch Instruction:
-
-This is the overloaded table for Integer-base Branch operations.  Opcode
-(bits 6..0) is set in all cases to 1100011.
-
-[[!table  data="""
-31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
-imm[12|10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
-7           | 5        | 5     | 3      | 4             | 1  | 7       |
-reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
-reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
-reserved    | src2     | src1  | 001    | predicate rs3     || BNE     |
-reserved    | src2     | src1  | 010    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 011    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 100    | predicate rs3     || BLE     |
-reserved    | src2     | src1  | 101    | predicate rs3     || BGE     |
-reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
-reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
-"""]]
-
-This is the overloaded table for Floating-point Predication operations.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield.
-
-As with
-Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
-Likewise Single-precision, fmt bits 26..25) is still set to 00.
-Double-precision is still set to 01, whilst Quad-precision
-appears not to have a definition in V2.3-Draft (but should be unaffected).
-
-It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact as an explicit (scalar) instruction is needed
-to invert the predicate.  An additional encoding funct3=011 is therefore
-proposed to cater for this.
-
-[[!table  data="""
-31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
-funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
-5       | 2        | 5        | 5     | 3      | 4        | 7       |
-10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
-10100   | 00/01/11 | src2     | src1  | *011*  | pred rs3 | FNE     |
-10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
-10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
-"""]]
-
-Note (**TBD**): floating-point exceptions will need to be extended
-to cater for multiple exceptions (and statuses of the same).  The
-usual approach is to have an array of status codes and bit-fields,
-and one exception, rather than throw separate exceptions for each
-Vector element.
-
-In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
-for predicated compare operations of function "cmp":
-
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[p][i])
-         preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
-                           s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
-    if I/F == INT: # integer type cmp
-        pred_enabled = int_pred_enabled # TODO: exception if not set!
-        preg = int_pred_reg[rd]
-    else:
-        pred_enabled = fp_pred_enabled # TODO: exception if not set!
-        preg = fp_pred_reg[rd]
-
-    s1 = CSRvectorlen[src1] > 1;
-    s2 = CSRvectorlen[src2] > 1;
-    for (int i=0; i<vl; ++i)
-       preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
-                          s2 ? reg[src2+i] : reg[src2]);
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
-  into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
-  Reordering") setting Vector-Length * (number of SIMD elements) bits
-  in Predicate Register rs3 as opposed to just Vector-Length bits.
-* Predicated Branches do not actually have an adjustment to the Program
-  Counter, so all of bits 25 through 30 in every case are not needed.
-* There are plenty of reserved opcodes for which bits 25 through 30 could
-  be put to good use if there is a suitable use-case.
-* FEQ and FNE (and BEQ and BNE) are included in order to save one
-  instruction having to invert the resultant predicate bitfield.
-  FLT and FLE may be inverted to FGT and FGE if needed by swapping
-  src1 and src2 (likewise the integer counterparts).
-
-## Compressed Branch Instruction:
-
-[[!table  data="""
-15..13 | 12...10  | 9..7 | 6..5  | 4..2 | 1..0 | name |
-funct3 | imm      | rs10 | imm   |      | op   |      |
-3      | 3        | 3    | 2     |  3   | 2    |      |
-C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
-110    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.EQ |
-111    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.NE |
-110    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LT |
-111    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LE |
-"""]]
-
-Notes:
-
-* Bits 5 13 14 and 15 make up the comparator type
-* In both floating-point and integer cases there are four predication
-  comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
-  src1 and src2).
-
-## LOAD / STORE Instructions
-
-For full analysis of topological adaptation of RVV LOAD/STORE
-see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction.
-
-Revised LOAD:
-
-[[!table  data="""
-31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
-imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
-1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
-?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
-"""]]
-
-The exact same corresponding adaptation is also carried out on the single,
-double and quad precision floating-point LOAD-FP and STORE-FP operations,
-which fit the exact same instruction format.  Thus all three types
-(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
-as well as FSW, FSD and FSQ.
-
-Notes:
-
-* LOAD remains functionally (topologically) identical to RVV LOAD
-  (for both integer and floating-point variants).
-* Predication CSR-marking register is not explicitly shown in instruction, it's
-  implicit based on the CSR predicate state for the rd (destination) register
-* rs2, the source, may *also be marked as a vector*, which implicitly
-  is taken to indicate "Indexed Load" (LD.X)
-* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
-* Bit 31 is reserved (ideas under consideration: auto-increment)
-* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
-* **TODO**: clarify where width maps to elsize
-
-Pseudo-code (excludes CSR SIMD bitwidth):
-
-    if (unit-strided) stride = elsize;
-    else stride = areg[as2]; // constant-strided
-
-    pred_enabled = int_pred_enabled
-    preg = int_pred_reg[rd]
-
-    for (int i=0; i<vl; ++i)
-      if (preg_enabled[rd] && [!]preg[i])
-        for (int j=0; j<seglen+1; j++)
-        {
-          if CSRvectorised[rs2])
-             offs = vreg[rs2][i]
-          else
-             offs = i*(seglen+1)*stride;
-          vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
-        }
-
-Taking CSR (SIMD) bitwidth into account involves using the vector
-length and register encoding according to the "Bitwidth Virtual Register
-Reordering" scheme shown in the Appendix (see function "regoffs").
-
-A similar instruction exists for STORE, with identical topological
-translation of all features.  **TODO**
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are of the same format, where bits 2-4 are
-a src register instead of dest:
-
-[[!table  data="""
-15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
-funct3 | imm         | rs10   | imm         | rd0  | op   |
-3      | 3           | 3      | 2           | 3    | 2    |
-C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
-"""]]
-
-Unfortunately it is not possible to fit the full functionality
-of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
-require another operand (rs2) in addition to the operand width
-(which is also missing), offset, base, and src/dest.
-
-However a close approximation may be achieved by taking the top bit
-of the offset in each of the five types of LD (and ST), reducing the
-offset to 4 bits and utilising the 5th bit to indicate whether "stride"
-is to be enabled.  In this way it is at least possible to introduce
-that functionality.
-
-(**TODO**: *assess whether the loss of one bit from offset is worth having
-"stride" capability.*)
-
-We also assume (including for the "stride" variant) that the "width"
-parameter, which is missing, is derived and implicit, just as it is
-with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
-and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
-C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
-
-Interestingly we note that the Vectorised Simple-V variant of
-LOAD/STORE (Compressed and otherwise), due to it effectively using the
-standard register file(s), is the direct functional equivalent of
-standard load-multiple and store-multiple instructions found in other
-processors.
-
-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not".  The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults and adjusting the TLB
-in advance, accordingly: other strategies are explored in the Appendix
-Section "Virtual Memory Page Faults".
-
 # Note on implementation of parallelism
 
 One extremely important aspect of this proposal is to respect and support
@@ -578,215 +342,238 @@ basis* whether and how much "Virtual Parallelism" to deploy.
 
 It is absolutely critical to note that it is proposed that such choices MUST
 be **entirely transparent** to the end-user and the compiler.  Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
+a Vector (varible-width SIMD) may not precisely match the width of the
 parallelism within the implementation, the end-user **should not care**
 and in this way the performance benefits are gained but the ISA remains
 straightforward.  All that happens at the end of an instruction run is: some
 parallel units (if there are any) would remain offline, completely
 transparently to the ISA, the program, and the compiler.
 
-The "SIMD considered harmful" trap of having huge complexity and extra
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
 instructions to deal with corner-cases is thus avoided, and implementors
 get to choose precisely where to focus and target the benefits of their
 implementation efforts, without "extra baggage".
 
-# CSRs <a name="csrs"></a>
-
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
-
-* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
-  marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
-  of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
-  "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
-  as opposed to having the predicate register explicitly in the instruction.
-
-## Predication CSR
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated.  The first entry is whether predication
-is enabled.  The second entry is whether the register index refers to a
-floating-point or an integer register.  The third entry is the index
-of that register which is to be predicated (if referred to).  The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6      | 5   | (4..0)  | (4..0)  |
-| ----- | -      | -   | ------- | ------- |
-| r0    | pren0  | i/f | regidx  | predidx |
-| r1    | pren1  | i/f | regidx  | predidx |
-| ..    | pren.. | i/f | regidx  | predidx |
-| r15   | pren15 | i/f | regidx  | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
-    fp_pred_enabled[32];
-    int_pred_enabled[32];
-    for (i = 0; i < 16; i++)
-       if CSRpred[i].pren:
-          idx = CSRpred[i].regidx
-          predidx = CSRpred[i].predidx
-          if CSRpred[i].type == 0: # integer
-            int_pred_enabled[idx] = 1
-            int_pred_reg[idx] = predidx
-          else:
-            fp_pred_enabled[idx] = 1
-            fp_pred_reg[idx] = predidx
-
-So when an operation is to be predicated, it is the internal state that
-is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
-    for (int i=0; i<vl; ++i)
-        if ([!]preg[p][i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store:
-
-    if type(iop) == INT:
-        pred_enabled = int_pred_enabled
-        preg = int_pred_reg[rd]
-    else:
-        pred_enabled = fp_pred_enabled
-        preg = fp_pred_reg[rd]
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism.  Options are covered in the Appendix.
+
+
+### FMV, FNEG and FABS Instructions
+
+These are identical in form to C.MV, except covering floating-point
+register copying.  The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
+
+(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
+
+### FVCT Instructions
+
+These are again identical in form to C.MV, except that they cover
+floating-point to integer and integer to floating-point.  When element
+width in each vector is set to default, the instructions behave exactly
+as they are defined for standard RV (scalar) operations, except vectorised
+in exactly the same fashion as outlined in C.MV.
 
-    for (int i=0; i<vl; ++i)
-        if (preg_enabled[rd] && [!]preg[i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+However when the source or destination element width is not set to default,
+the opcode's explicit element widths are *over-ridden* to new definitions,
+and the opcode's element width is taken as indicative of the SIMD width
+(if applicable i.e. if packed SIMD is requested) instead.
 
-## MAXVECTORDEPTH
+For example FCVT.S.L would normally be used to convert a 64-bit
+integer in register rs1 to a 64-bit floating-point number in rd.
+If however the source rs1 is set to be a vector, where elwidth is set to
+default/2 and "packed SIMD" is enabled, then the first 32 bits of
+rs1 are converted to a floating-point number to be stored in rd's
+first element and the higher 32-bits *also* converted to floating-point
+and stored in the second.  The 32 bit size comes from the fact that
+FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
+divide that by two it means that rs1 element width is to be taken as 32.
 
-MAXVECTORDEPTH is the same concept as MVL in RVV.  However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
+Similar rules apply to the destination register.
 
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers.  This keeps the implementation a little simpler.
-Note that RVV on top of Simple-V may choose to over-ride this decision.
+# Exceptions
 
-## Vector-length CSRs
+> What does an ADD of two different-sized vectors do in simple-V?
 
-Vector lengths are interpreted as meaning "any instruction referring to
-r(N) generates implicit identical instructions referring to registers
-r(N+M-1) where M is the Vector Length".  Vector Lengths may be set to
-use up to 16 registers in the register file.
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+  than the destination, throw an exception.
 
-One separate CSR table is needed for each of the integer and floating-point
-register files:
+> And what about instructions like JALR?Â 
+> What does jumping to a vector do?
 
-| RegNo | (3..0) |
-| ----- | ------ |
-| r0    | vlen0  |
-| r1    | vlen1  |
-| ..    | vlen.. |
-| r31   | vlen31 |
+* Throw an exception.  Whether that actually results in spawning threads
+  as part of the trap-handling remains to be seen.
 
-An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
-whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector.  A vector length of 1 indicates
-that it is to be treated as a scalar.  Vector lengths of 0 are reserved.
+# Under consideration <a name="issues"></a>
 
-Internally, implementations may choose to use the non-zero vector length
-to set a bit-field per register, to be used in the instruction decode phase.
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+         (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+  (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+  (a bit like misaligned addressing... for registers)
+  or just use predication to skip start?
 
-Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
-bitwidth is specifically not set) it becomes:
+## Future (extra) bank be included (made mandatory)
 
-    CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change.  Also it has implications for context-switching.
 
-This is in contrast to RVV:
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV.  SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
 
-    CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+## How large should the Register and Predication CSR key-value stores be?
 
-## Element (SIMD) bitwidth CSRs
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed.  At the time of writing
+(12jul2018) that is too early to tell.  An approximate best-guess
+however would be 16 entries.
 
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
 
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0    | vew0   |
-| r1    | vew1   |
-| ..    | vew..  |
-| r31   | vew31  |
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
 
-vew may be one of the following (giving a table "bytestable", used below):
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point.  However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
 
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free.  However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
 
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes.  This would be considered quite
+rare but not outside of the realm of possibility.
 
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
+Conclusion: all needs careful analysis and future work.
 
-    vew = CSRbitwidth[rs1]
-    if (vew == 0)
-        bytesperreg = (XLEN/8) # or FLEN as appropriate
-    else:
-        bytesperreg = bytestable[vew] # 1 2 4 8 16
-    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
-    vlen = CSRvectorlen[rs1] * simdmult
-    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
 
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear.  It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero".  The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
 
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
+## Can CLIP be done as a CSR (mode, like elwidth)
 
-# Exceptions
+RVV appears to be going this way.  At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
 
-> What does an ADD of two different-sized vectors do in simple-V?
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis.  Other bits may be used for other purposes
+(see SIMD saturation below)
 
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
-  than the destination, throw an exception.
+## SIMD saturation (etc.) also set as a mode?
 
-> And what about instructions like JALR?Â 
-> What does jumping to a vector do?
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
 
-* Throw an exception.  Whether that actually results in spawning threads
-  as part of the trap-handling remains to be seen.
+## Include src1/src2 predication on Comparison Ops?
+
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
+
+The natural question therefore to ask is: where else could this
+flexibility be deployed?  What about comparison operations?
+
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+    regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
+
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
+
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
+
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
+
+It is quite complex, in other words, and needs careful consideration.
+
+## 8/16-bit ops is it worthwhile adding a "start offset"?
+
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
+
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
+
+    r3[0] = add r4[0], r6[0]
+    r3[1] = add r4[1], r6[1]
+    r3[2] = add r4[2], r6[2]
+    r3[3] = add r4[3], r6[3]
+
+an offset of 1 would result in four operations as follows, instead:
+
+    r3[0] = add r4[1], r6[0]
+    r3[1] = add r4[2], r6[1]
+    r3[2] = add r4[3], r6[2]
+    r3[3] = add r5[0], r6[3]
+
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in.  So this
+leaves just the packed-SIMD case to consider.
+
+Two ways in which this could be implemented / emulated (without special
+hardware):
+
+* bit-manipulation that shuffles the data along by one byte (or one word)
+  either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+  penalties for doing so.
+
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset.  On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
 
 # Impementing V on top of Simple-V
 
@@ -820,7 +607,7 @@ levels: Base and reserved future functionality.
   up to 16 (TBD) of either the floating-point or integer registers to
   be marked as "predicated" (key), and if so, which integer register to
   use as the predication mask (value).
-  
+
 **TODO**
 
 # Implementing P (renamed to DSP) on top of Simple-V
@@ -845,6 +632,11 @@ This section compares the various parallelism proposals as they stand,
 including traditional SIMD, in terms of features, ease of implementation,
 complexity, flexibility, and die area.
 
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
 ### [[alt_rvp]]
 
 Primary benefit of Alt-RVP is the simplicity with which parallelism
@@ -875,9 +667,9 @@ from actual (internal) parallel hardware.  It's an API in effect that's
 designed to be slotted in to an existing implementation (just after
 instruction decode) with minimum disruption and effort.
 
-* minus: the complexity of having to use register renames, OoO, VLIW,
-  register file cacheing, all of which has been done before but is a
-  pain
+* minus: the complexity (if full parallelism is to be exploited)
+  of having to use register renames, OoO, VLIW, register file cacheing,
+  all of which has been done before but is a pain
 * plus: transparent re-use of existing opcodes as-is just indirectly
   saying "this register's now a vector" which
 * plus: means that future instructions also get to be inherently
@@ -986,7 +778,7 @@ the question is asked "How can each of the proposals effectively implement
   a SIMD architecture where the ALU becomes responsible for the parallelism,
   Alt-RVP ALUs would likewise be so responsible... with *additional*
   (lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
   at least one dimension are avoided (architectural upgrades introducing
   128-bit then 256-bit then 512-bit variants of the exact same 64-bit
   SIMD block)
@@ -1064,37 +856,35 @@ the question is asked "How can each of the proposals effectively implement
 
 ### Example Instruction translation: <a name="example_translation"></a>
 
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FIFO:
+Instructions "ADD r7 r4 r4" would result in three instructions being
+generated and placed into the FIFO.  r7 and r4 are marked as "vectorised":
 
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
+* ADD r7 r4 r4
+* ADD r8 r5 r5
+* ADD r9 r6 r6
+
+Instructions "ADD r7 r4 r1" would result in three instructions being
+generated and placed into the FIFO.  r7 and r1 are marked as "vectorised"
+whilst r4 is not:
+
+* ADD r7 r4 r1
+* ADD r8 r4 r2
+* ADD r9 r4 r3
 
 ## Example of vector / vector, vector / scalar, scalar / scalar => vector add
 
-    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
-    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
-    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
-    register x[32][XLEN];
-
-    function op_add(rd, rs1, rs2, predr)
-    {
-    Â  Â /* note that this is ADD, not PADD */
-    Â  Â int i, id, irs1, irs2;
-    Â  Â # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
-    Â  Â # also destination makes no sense as a scalar but what the hell...
-    Â  Â for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
-    Â  Â  Â  if (CSRpredicate[predr][i]) # i *think* this is right...
-    Â  Â  Â  Â  Â x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
-    Â  Â  Â  # now increment the idxs
-    Â  Â  Â  if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
-    Â  Â  Â  Â  Â id += 1;
-    Â  Â  Â  if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
-    Â  Â  Â  Â  Â irs1 += 1;
-    Â  Â  Â  if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
-    Â  Â  Â  Â  Â irs2 += 1;
-    }
+    function op_add(rd, rs1, rs2) # add not VADD!
+     Â int i, id=0, irs1=0, irs2=0;
+     Â rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+     Â rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+     Â rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+     Â predval = get_pred_val(FALSE, rd);
+     Â for (i = 0; i < VL; i++)
+        if (predval & 1<<i) # predication uses intregs
+     Â     Â ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+        if (int_vec[rd ].isvector) Â { id += 1; }
+        if (int_vec[rs1].isvector) Â { irs1 += 1; }
+        if (int_vec[rs2].isvector) Â { irs2 += 1; }
 
 ## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
 
@@ -1122,10 +912,10 @@ There is, in the standard Conditional Branch instruction, more than
 adequate space to interpret it in a similar fashion:
 
 [[!table  data="""
-  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
-imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
- 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
-   offset[12,10:5]  ||    src2  |    src1   |  BEQ         |    offset[11,4:1]      || BRANCH      |
+31      |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7       | 6....0 |
+imm[12] | imm[10:5]  |rs2   | rs1  | funct3 | imm[4:1] | imm[11] | opcode |
+ 1      | 6          | 5    | 5    | 3      | 4        | 1       |   7    |
+   offset[12,10:5]  || src2 | src1 | BEQ    | offset[11,4:1]    || BRANCH |
 """]]
 
 This would become:
@@ -1145,19 +935,19 @@ not only to add in a second source register, but also use some of the bits as
 a predication target as well.
 
 [[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |         imm           |   op   |
-      3      |         3          |    3     |           5           |   2    |
-   C.BEQZ    |   offset[8,4:3]    |   src    |   offset[7:6,2:1,5]   |   C1   |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2     | 1 .. 0 |
+funct3 | imm           | rs10  | imm               | op     |
+3      | 3             | 3     | 5                 | 2      |
+C.BEQZ | offset[8,4:3] | src   | offset[7:6,2:1,5] | C1     |
 """]]
 
 Now uses the CS format:
 
 [[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |  imm   |              |   op   |
-      3      |         3          |    3     |  2     |  3           |   2    |
-   C.BEQZ    |   predicate rs3    |   src1   |  I/F B | src2         |   C1   |
+15..13 | 12 .  10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm      | rs10   | imm    |      | op     |
+3      | 3        | 3      | 2      | 3    | 2      |
+C.BEQZ | pred rs3 | src1   | I/F B  | src2 | C1     |
 """]]
 
 Bit 6 would be decoded as "operation refers to Integer or Float" including
@@ -1260,16 +1050,16 @@ still be respected*, making Simple-V in effect the "consistent public API".
 
 vew may be one of the following (giving a table "bytestable", used below):
 
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default  | XLEN/8     |
+| 001 | 8        | 1          |
+| 010 | 16       | 2          |
+| 011 | 32       | 4          |
+| 100 | 64       | 8          |
+| 101 | 128      | 16         |
+| 110 | rsvd     | rsvd       |
+| 111 | rsvd     | rsvd       |
 
 Pseudocode for vector length taking CSR SIMD-bitwidth into account:
 
@@ -1387,7 +1177,7 @@ So the question boils down to:
 Whilst the above may seem to be severe minuses, there are some strong
 pluses:
 
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
 * Smaller reduction of P's opcode space: around 10%.
 * The potential to use Compressed instructions in both Vector and SIMD
   due to the overloading of register meaning (implicit vectorisation,
@@ -1437,7 +1227,116 @@ RVV nor base RV have taken integer-overflow (carry) into account, which
 makes proposing it quite challenging given that the relevant (Base) RV
 sections are frozen.  Consequently it makes sense to forgo this feature.
 
-## Virtual Memory page-faults
+## Context Switch Example <a name="context_switch"></a>
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction).  An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+    /* Macro for saving task context */
+    .macro portSAVE_CONTEXT
+        .global	pxCurrentTCB
+        /* make room in stack */
+        addi	sp, sp, -REGBYTES * 32
+
+        /* Save Context */
+        STORE	x1, 0x0(sp)
+        STORE	x2, 1 * REGBYTES(sp)
+        STORE	x3, 2 * REGBYTES(sp)
+        ...
+        ...
+        STORE	x30, 29 * REGBYTES(sp)
+        STORE	x31, 30 * REGBYTES(sp)
+
+        /* Store current stackpointer in task control block (TCB) */
+        LOAD	t0, pxCurrentTCB	//pointer
+        STORE	sp, 0x0(t0)
+        .endm
+
+    /* Saves current error program counter (EPC) as task program counter */
+    .macro portSAVE_EPC
+        csrr	t0, mepc
+        STORE	t0, 31 * REGBYTES(sp)
+        .endm
+
+    /* Saves current return adress (RA) as task program counter */
+    .macro portSAVE_RA
+        STORE	ra, 31 * REGBYTES(sp)
+        .endm
+
+    /* Macro for restoring task context */
+    .macro portRESTORE_CONTEXT
+
+        .global	pxCurrentTCB
+        /* Load stack pointer from the current TCB */
+        LOAD	sp, pxCurrentTCB
+        LOAD	sp, 0x0(sp)
+
+        /* Load task program counter */
+        LOAD	t0, 31 * REGBYTES(sp)
+        csrw	mepc, t0
+
+        /* Run in machine mode */
+        li 		t0, MSTATUS_PRV1
+        csrs	mstatus, t0
+
+        /* Restore registers,
+           Skip global pointer because that does not change */
+        LOAD	x1, 0x0(sp)
+        LOAD	x4, 3 * REGBYTES(sp)
+        LOAD	x5, 4 * REGBYTES(sp)
+        ...
+        ...
+        LOAD	x30, 29 * REGBYTES(sp)
+        LOAD	x31, 30 * REGBYTES(sp)
+
+        addi	sp, sp, REGBYTES * 32
+        mret
+        .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+    /* a few things are assumed here: (a) that when switching to
+       M-Mode an entirely different set of CSRs is used from that
+       which is used in U-Mode and (b) that the M-Mode x1 and x4
+       vectors are also not used anywhere else in M-Mode, consequently
+       only need to be set up just the once.
+     */
+    .macroVectorSetup
+        MVECTORCSRx1 = 31, defaultlen
+        MVECTORCSRx4 = 28, defaultlen
+
+        /* Save Context */
+        SETVL x0, x0, 31 /* x0 ignored silently */
+        STORE	x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
+        /* Restore registers,
+           Skip global pointer because that does not change */
+        LOAD	x1, 0x0(sp)
+        SETVL x0, x0, 28 /* x0 ignored silently */
+        LOAD	x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored.  If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
+## Virtual Memory page-faults on LOAD/STORE
+
+
+### Notes from conversations
 
 > I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
 > riscv-isa-manual in order to work out how to re-map RVV onto the standard
@@ -1562,9 +1461,9 @@ would still be there (and stalled).  hmmm.
     >
     > Thrown away.
 
-discussion then led to the question of OoO architectures 
+discussion then led to the question of OoO architectures
 
-> The costs of the imprecise-exception model are greater than the benefit. 
+> The costs of the imprecise-exception model are greater than the benefit.
 > Software doesn't want to cope with it. Â It's hard to debug. Â You can't
 > migrate state between different microarchitectures--unless you force all
 > implementations to support the same imprecise-exception model, which would
@@ -1572,11 +1471,151 @@ discussion then led to the question of OoO architectures
 > relevant, is that the imprecise model increases the size of the context
 > structure, as the microarchitectural guts have to be spilled to memory.)
 
-
-## Implementation Paradigms
-
-TODO: assess various implementation paradigms:
-
+## Zero/Non-zero Predication
+
+>> > Â it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed. Â if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> > Â if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> > Â whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesnât apply to only your
+> model with the âstandard register fileâ is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a âzero bitâ per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection <a name="scalar_detection"></a>
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation.  Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU Â (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists).  the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no".  3R because you need
+> to check src1, src2 and dest simultaneously.  the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+>  is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove.  it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay.Â  so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+<https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/g3feFnAoKIM>
+
+## Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**.  This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+  register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+  is placed in a dest register), and if the "required" length is longer
+  than the *available* length, the dest reg is set to the MIN of those
+  two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+  length and thus there is no way (at the time that VSETVL is called) to
+  know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+  the vector operation to the length specified in the destination
+  register's CSR, at the time that each instruction is issued...
+  except that that cannot possibly be guaranteed to match
+  with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+  VSETVL.  "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+  destreg equal to MIN(counterreg, lenimmed), with register-based
+  variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+  a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call).  This is probably desirable behaviour.
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
+
+TODO: assess various implementation paradigms.  These are listed roughly
+in order of simplicity (minimum compliance, for ultra-light-weight
+embedded systems or to reduce design complexity and the burden of
+design implementation and compliance, in non-critical areas), right the
+way to high-performance systems.
+
+* Full (or partial) software-emulated (via traps): full support for CSRs
+  required, however when a register is used that is detected (in hardware)
+  to be vectorised, an exception is thrown.
 * Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
 * In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
 * Out-of-order with instruction FIFOs and aggressive register-renaming
@@ -1588,10 +1627,93 @@ Also to be taken into consideration:
 * Comphrensive vectorisation: FIFOs and internal parallelism
 * Hybrid Parallelism
 
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised.  On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support.  In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
 # TODO Research
 
 > For great floating point DSPs check TIâs C3x, C4X, and C6xx DSPs
 
+Idea: basic simple butterfly swap on a few element indices, primarily targetted
+at SIMD / DSP.  High-byte low-byte swapping, high-word low-word swapping,
+perhaps allow reindexing of permutations up to 4 elements?  8?  Reason:
+such operations are less costly than a full indexed-shuffle, which requires
+a separate instruction cycle.
+
+Predication "all zeros" needs to be "leave alone".  Detection of
+ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
+ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
+Destruction of destination indices requires a copy of the entire vector
+in advance to avoid.
+
+TBD: floating-point compare and other exception handling
+
+------
+
+Multi-LR/SC
+
+Please don't try to use the L1 itself.
+
+Use the Load and Store buffers which capture instruction state prior
+to being accessed in the L1 (and prior to data arriving in the case of
+Store buffer).
+
+Also, use the L1 Miss buffers as these already HAVE to be snooped by
+coherence traffic. These are used to monitor that all participating
+cache lines remain interference free, and amalgamate same into a CPU
+signal accessible ia branch or predicate.
+
+The Load buffers manage inbound traffic
+The Store buffers manage outbound traffic.
+
+Done properly, the participating cache lines can exceed the associativity
+of the L1 cache without architectural harm (may incur additional latency).
+
+<https://groups.google.com/d/msg/comp.arch/QVl3c9vVDj0/ol_232-pAQAJ>
+
+> > > so, let's say instead of another LR *cancelling* the load
+> > > reservation, the SMP core / hardware thread *blocks* for
+> > > up to 63 further instructions, waiting for the reservation
+> > > to clear.
+> > 
+> > Can you explain what you mean by this paragraph?
+> 
+>  best put in sequential events, probably.
+> 
+>  <core1> LR    <-- 64-instruction countdown starts here
+>  <core1> ... 63
+>  <core1> ... 62
+>  <core2> LR same address <--- notes that core1 is on 61,
+>                          so pauses for **UP TO** 61 cycles
+>  <core1> ... 32
+>  <core1> SC <- core1 didn't reach zero, therefore valid, therefore
+>                core2 is now **UNBLOCKED**, is granted the
+>                load-reservation (and begins its **own** 64-cycle
+>                LR instruction countdown)
+>  <core2> ... 63
+>  <core2> ... 62
+>  <core2> ... 
+>  <core2> ...
+>  <core2> SC <- also valid
+
+Looks to me that you could effect the same functionality by simply
+holding onto the cache line in core 1 preventing core 2 from
+<architecturally> getting past the LR.
+
+On the other hand, the freeze is similar to how the MP CRAYs did
+ATOMIC stuff.
+
 # References
 
 * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
@@ -1622,3 +1744,12 @@ Also to be taken into consideration:
   restarted if an exception occurs (VM page-table miss)
   <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
 * Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>
+* RVV slides 2017 <https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
+* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Streaming Pipelines <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf>
+* Barcelona SIMD Presentation <https://content.riscv.org/wp-content/uploads/2018/05/09.05.2018-9.15-9.30am-RISCV201805-Andes-proposed-P-extension.pdf>
+* <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Full Description (last page) of RVV instructions
+  <https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf>
+* PULP Low-energy Cluster Vector Processor
+  <http://iis-projects.ee.ethz.ch/index.php/Low-Energy_Cluster-Coupled_Vector_Coprocessor_for_Special-Purpose_PULP_Acceleration>