(no commit message)

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 33a260919a4279819801641fc4480405733d0583..0642ce926d702469769bae2210773b1035398a95 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,8 +1,6 @@
  # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
  
-* TODO 23may2018: CSR-CAM-ify regfile tables
-* TODO 23may2018: zero-mark predication CSR
-* TODO 23may2018: impl. detail on scalar-only ops (see appendix)
+**Note: this document is out of date and involved early ideas and discussions**
  
  Key insight: Simple-V is intended as an abstraction layer to provide
  a consistent "API" to parallelisation of existing *and future* operations.
@@ -13,8 +11,13 @@ instruction queue (FIFO), pending execution.
  
  *Actual* parallelism, if added independently of Simple-V in the form
  of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
+
+**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E.
+
+* Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
+* Specification: now move to its own page: [[specification]]
  
  [[!toc ]]
  
@@ -130,7 +133,8 @@ reducing power consumption for the same.
  SIMD again has a severe disadvantage here, over Vector: huge proliferation
  of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
  have to then have operations *for each and between each*.  It gets very
-messy, very quickly.
+messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
+proliferation profile.
  
  The V-Extension on the other hand proposes to set the bit-width of
  future instructions on a per-register basis, such that subsequent instructions
@@ -358,527 +362,218 @@ absolute bare minimum level of compliance with the "API" (software-traps
  when vectorisation is detected), all the way up to full supercomputing
  level all-hardware parallelism.  Options are covered in the Appendix.
  
-# CSRs <a name="csrs"></a>
-
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
-
-* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
-
-Also (see Appendix, "Context Switch Example") it may turn out to be important
-to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
-Vectorised LOAD / STORE may be used to load and store multiple registers:
-something that is missing from the Base RV ISA.
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
-  marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
-  of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
-  "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
-  as opposed to having the predicate register explicitly in the instruction.
-* Whilst the predication CSR is a key-value store it *generates* easier-to-use
-  state information.
-* TODO: assess whether the same technique could be applied to the other
-  Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
-  V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
-  needed for context-switches (empty slots need never be stored).
-
-## Predication CSR
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated.  The first entry is whether predication
-is enabled.  The second entry is whether the register index refers to a
-floating-point or an integer register.  The third entry is the index
-of that register which is to be predicated (if referred to).  The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6      | 5   | (4..0)  | (4..0)  |
-| ----- | -      | -   | ------- | ------- |
-| r0    | pren0  | i/f | regidx  | predidx |
-| r1    | pren1  | i/f | regidx  | predidx |
-| ..    | pren.. | i/f | regidx  | predidx |
-| r15   | pren15 | i/f | regidx  | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
-    fp_pred_enabled[32];
-    int_pred_enabled[32];
-    for (i = 0; i < 16; i++)
-       if CSRpred[i].pren:
-          idx = CSRpred[i].regidx
-          predidx = CSRpred[i].predidx
-          if CSRpred[i].type == 0: # integer
-            int_pred_enabled[idx] = 1
-            int_pred_reg[idx] = predidx
-          else:
-            fp_pred_enabled[idx] = 1
-            fp_pred_reg[idx] = predidx
-
-So when an operation is to be predicated, it is the internal state that
-is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
-    for (int i=0; i<vl; ++i)
-        if ([!]preg[p][i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store:
-
-    if type(iop) == INT:
-        pred_enabled = int_pred_enabled
-        preg = int_pred_reg[rd]
-    else:
-        pred_enabled = fp_pred_enabled
-        preg = fp_pred_reg[rd]
  
-    for (int i=0; i<vl; ++i)
-        if (preg_enabled[rd] && [!]preg[i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+### FMV, FNEG and FABS Instructions
  
-## MAXVECTORDEPTH
+These are identical in form to C.MV, except covering floating-point
+register copying.  The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
  
-MAXVECTORDEPTH is the same concept as MVL in RVV.  However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
+(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
  
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers.  This keeps the implementation a little simpler.
-Note also (as also described in the VSETVL section) that the *minimum*
-for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
-and 31 for RV32 or RV64).
+### FVCT Instructions
  
-Note that RVV on top of Simple-V may choose to over-ride this decision.
+These are again identical in form to C.MV, except that they cover
+floating-point to integer and integer to floating-point.  When element
+width in each vector is set to default, the instructions behave exactly
+as they are defined for standard RV (scalar) operations, except vectorised
+in exactly the same fashion as outlined in C.MV.
  
-## Vector-length CSRs
+However when the source or destination element width is not set to default,
+the opcode's explicit element widths are *over-ridden* to new definitions,
+and the opcode's element width is taken as indicative of the SIMD width
+(if applicable i.e. if packed SIMD is requested) instead.
  
-Vector lengths are interpreted as meaning "any instruction referring to
-r(N) generates implicit identical instructions referring to registers
-r(N+M-1) where M is the Vector Length".  Vector Lengths may be set to
-use up to 16 registers in the register file.
+For example FCVT.S.L would normally be used to convert a 64-bit
+integer in register rs1 to a 64-bit floating-point number in rd.
+If however the source rs1 is set to be a vector, where elwidth is set to
+default/2 and "packed SIMD" is enabled, then the first 32 bits of
+rs1 are converted to a floating-point number to be stored in rd's
+first element and the higher 32-bits *also* converted to floating-point
+and stored in the second.  The 32 bit size comes from the fact that
+FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
+divide that by two it means that rs1 element width is to be taken as 32.
  
-One separate CSR table is needed for each of the integer and floating-point
-register files:
+Similar rules apply to the destination register.
  
-| RegNo | (3..0) |
-| ----- | ------ |
-| r0    | vlen0  |
-| r1    | vlen1  |
-| ..    | vlen.. |
-| r31   | vlen31 |
+# Exceptions
  
-An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
-whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector.
+> What does an ADD of two different-sized vectors do in simple-V?
  
-Note:
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+  than the destination, throw an exception.
  
-* A vector length of 1 indicates that it is to be treated as a scalar.
-  Bitwidths (on the same register) are interpreted and meaningful.
-* A vector length of 0 indicates that the parallelism is to be switched
-  off for this register (treated as a scalar).  When length is 0,
-  the bitwidth CSR for the register is *ignored*.
+> And what about instructions like JALR? 
+> What does jumping to a vector do?
  
-Internally, implementations may choose to use the non-zero vector length
-to set a bit-field per register, to be used in the instruction decode phase.
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
+* Throw an exception.  Whether that actually results in spawning threads
+  as part of the trap-handling remains to be seen.
  
-Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
-bitwidth is specifically not set) it becomes:
+# Under consideration <a name="issues"></a>
  
-    CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+         (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+  (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+  (a bit like misaligned addressing... for registers)
+  or just use predication to skip start?
  
-This is in contrast to RVV:
+## Future (extra) bank be included (made mandatory)
  
-    CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change.  Also it has implications for context-switching.
  
-## Element (SIMD) bitwidth CSRs
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV.  SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
  
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
+## How large should the Register and Predication CSR key-value stores be?
  
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0    | vew0   |
-| r1    | vew1   |
-| ..    | vew..  |
-| r31   | vew31  |
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed.  At the time of writing
+(12jul2018) that is too early to tell.  An approximate best-guess
+however would be 16 entries.
  
-vew may be one of the following (giving a table "bytestable", used below):
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
  
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
  
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point.  However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
  
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free.  However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
  
-    vew = CSRbitwidth[rs1]
-    if (vew == 0)
-        bytesperreg = (XLEN/8) # or FLEN as appropriate
-    else:
-        bytesperreg = bytestable[vew] # 1 2 4 8 16
-    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
-    vlen = CSRvectorlen[rs1] * simdmult
-    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
-
-# Instructions
-
-By being a topological remap of RVV concepts, the following RVV instructions
-remain exactly the same: VMPOP, VMFIRST, VEXTRACT, VINSERT, VMERGE, VSELECT,
-VSLIDE, VCLASS and VPOPC.  Two instructions, VCLIP and VCLIPI, do not
-have RV Standard equivalents, so are left out of Simple-V.
-All other instructions from RVV are topologically re-mapped and retain
-their complete functionality, intact.
-
-## Instruction Format
-
-The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or *any*
-memory instructions.
-Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on CSR configurations for vector length, bitwidth and
-predication.  *This includes Compressed instructions* as well as any
-future instructions and Custom Extensions.
-
-* For analysis of RVV see [[v_comparative_analysis]] which begins to
-  outline topologically-equivalent mappings of instructions
-* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
-  for format of Branch opcodes.
-
-**TODO**: *analyse and decide whether the implicit nature of predication
-as proposed is or is not a lot of hassle, and if explicit prefixes are
-a better idea instead.  Parallelism therefore effectively may end up
-as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
-with some opportunities for to use Compressed bringing it down to 48.
-Also to consider is whether one or both of the last two remaining Compressed
-instruction codes in Quadrant 1 could be used as a parallelism prefix,
-bringing parallelised opcodes down to 32-bit (when combined with C)
-and having the benefit of being explicit.*
-
-## VSETVL
-
-VSETVL is slightly different from RVV in that the minimum vector length
-is required to be at least the number of registers in the register file,
-and no more than XLEN.  This allows vector LOAD/STORE to be used to switch
-the entire bank of registers using a single instruction (see Appendix,
-"Context Switch Example").  The reason for limiting VSETVL to XLEN is
-down to the fact that predication bits fit into a single register of length
-XLEN bits.
-
-The second minor change is that when VSETVL is requested to be stored
-into x0, it is *ignored* silently.
-
-Unlike RVV, implementors *must* provide pseudo-parallelism (using sequential
-loops in hardware) if actual hardware-parallelism in the ALUs is not deployed.
-A hybrid is also permitted (as used in Broadcom's VideoCore-IV) however this
-must be *entirely* transparent to the ISA.
-
-## Branch Instruction:
-
-Branch operations use standard RV opcodes that are reinterpreted to be
-"predicate variants" in the instance where either of the two src registers
-have their corresponding CSRvectorlen[src] entry as non-zero.  When this
-reinterpretation is enabled the predicate target register rs3 is to be
-treated as a bitfield (up to a maximum of XLEN bits corresponding to a
-maximum of XLEN elements).
-
-If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
-goes ahead as vector-scalar or scalar-vector.  Implementors should note that
-this could require considerable multi-porting of the register file in order
-to parallelise properly, so may have to involve the use of register cacheing
-and transparent copying (see Multiple-Banked Register File Architectures
-paper).
-
-In instances where no vectorisation is detected on either src registers
-the operation is treated as an absolutely standard scalar branch operation.
-
-This is the overloaded table for Integer-base Branch operations.  Opcode
-(bits 6..0) is set in all cases to 1100011.
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes.  This would be considered quite
+rare but not outside of the realm of possibility.
  
-[[!table  data="""
-31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
-imm[12,10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
-7           | 5        | 5     | 3      | 4             | 1  | 7       |
-reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
-reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
-reserved    | src2     | src1  | 001    | predicate rs3     || BNE     |
-reserved    | src2     | src1  | 010    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 011    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 100    | predicate rs3     || BLE     |
-reserved    | src2     | src1  | 101    | predicate rs3     || BGE     |
-reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
-reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
-"""]]
+Conclusion: all needs careful analysis and future work.
  
-Note that just as with the standard (scalar, non-predicated) branch
-operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
-
-Below is the overloaded table for Floating-point Predication operations.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield (done
-implicitly).
-
-As with
-Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
-Likewise Single-precision, fmt bits 26..25) is still set to 00.
-Double-precision is still set to 01, whilst Quad-precision
-appears not to have a definition in V2.3-Draft (but should be unaffected).
-
-It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact as an explicit (scalar) instruction is needed
-to invert the predicate bitmask.  An additional encoding funct3=011 is
-therefore proposed to cater for this.
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
  
-[[!table  data="""
-31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
-funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
-5       | 2        | 5        | 5     | 3      | 4        | 7       |
-10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
-10100   | 00/01/11 | src2     | src1  | **011**| pred rs3 | FNE     |
-10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
-10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
-"""]]
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear.  It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero".  The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
  
-Note (**TBD**): floating-point exceptions will need to be extended
-to cater for multiple exceptions (and statuses of the same).  The
-usual approach is to have an array of status codes and bit-fields,
-and one exception, rather than throw separate exceptions for each
-Vector element.
+## Can CLIP be done as a CSR (mode, like elwidth)
  
-In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
-for predicated compare operations of function "cmp":
+RVV appears to be going this way.  At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
  
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[p][i])
-         preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
-                           s2 ? vreg[rs2][i] : sreg[rs2]);
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis.  Other bits may be used for other purposes
+(see SIMD saturation below)
  
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
+## SIMD saturation (etc.) also set as a mode?
  
-    if I/F == INT: # integer type cmp
-        pred_enabled = int_pred_enabled # TODO: exception if not set!
-        preg = int_pred_reg[rd]
-        reg = int_regfile
-    else:
-        pred_enabled = fp_pred_enabled # TODO: exception if not set!
-        preg = fp_pred_reg[rd]
-        reg = fp_regfile
-
-    s1 = CSRvectorlen[src1] > 1;
-    s2 = CSRvectorlen[src2] > 1;
-    for (int i=0; i<vl; ++i)
-       preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
-                          s2 ? reg[src2+i] : reg[src2]);
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
-  into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
-  Reordering") setting Vector-Length times (number of SIMD elements) bits
-  in Predicate Register rs3 as opposed to just Vector-Length bits.
-* Predicated Branches do not actually have an adjustment to the Program
-  Counter, so all of bits 25 through 30 in every case are not needed.
-* There are plenty of reserved opcodes for which bits 25 through 30 could
-  be put to good use if there is a suitable use-case.
-* FEQ and FNE (and BEQ and BNE) are included in order to save one
-  instruction having to invert the resultant predicate bitfield.
-  FLT and FLE may be inverted to FGT and FGE if needed by swapping
-  src1 and src2 (likewise the integer counterparts).
-
-## Compressed Branch Instruction:
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
  
-[[!table  data="""
-15..13 | 12...10  | 9..7 | 6..5  | 4..2 | 1..0 | name |
-funct3 | imm      | rs10 | imm   |      | op   |      |
-3      | 3        | 3    | 2     |  3   | 2    |      |
-C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
-110    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.EQ |
-111    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.NE |
-110    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LT |
-111    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LE |
-"""]]
+## Include src1/src2 predication on Comparison Ops?
  
-Notes:
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
  
-* Bits 5 13 14 and 15 make up the comparator type
-* Bit 6 indicates whether to use integer or floating-point comparisons
-* In both floating-point and integer cases there are four predication
-  comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
-  src1 and src2).
+The natural question therefore to ask is: where else could this
+flexibility be deployed?  What about comparison operations?
  
-## LOAD / STORE Instructions <a name="load_store"></a>
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+    regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
  
-For full analysis of topological adaptation of RVV LOAD/STORE
-see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction,
-and likewise for STORE.
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
  
-Revised LOAD:
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
  
-[[!table  data="""
-31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
-imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
-1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
-?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
-"""]]
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
  
-The exact same corresponding adaptation is also carried out on the single,
-double and quad precision floating-point LOAD-FP and STORE-FP operations,
-which fit the exact same instruction format.  Thus all three types
-(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
-as well as FSW, FSD and FSQ.
+It is quite complex, in other words, and needs careful consideration.
  
-Notes:
+## 8/16-bit ops is it worthwhile adding a "start offset"?
  
-* LOAD remains functionally (topologically) identical to RVV LOAD
-  (for both integer and floating-point variants).
-* Predication CSR-marking register is not explicitly shown in instruction, it's
-  implicit based on the CSR predicate state for the rd (destination) register
-* rs2, the source, may *also be marked as a vector*, which implicitly
-  is taken to indicate "Indexed Load" (LD.X)
-* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
-* Bit 31 is reserved (ideas under consideration: auto-increment)
-* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
-* **TODO**: clarify where width maps to elsize
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
  
-Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
  
-    if (unit-strided) stride = elsize;
-    else stride = areg[as2]; // constant-strided
+    r3[0] = add r4[0], r6[0]
+    r3[1] = add r4[1], r6[1]
+    r3[2] = add r4[2], r6[2]
+    r3[3] = add r4[3], r6[3]
  
-    pred_enabled = int_pred_enabled
-    preg = int_pred_reg[rd]
+an offset of 1 would result in four operations as follows, instead:
  
-    for (int i=0; i<vl; ++i)
-      if (preg_enabled[rd] && [!]preg[i])
-        for (int j=0; j<seglen+1; j++)
-        {
-          if CSRvectorised[rs2])
-             offs = vreg[rs2][i]
-          else
-             offs = i*(seglen+1)*stride;
-          vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
-        }
+    r3[0] = add r4[1], r6[0]
+    r3[1] = add r4[2], r6[1]
+    r3[2] = add r4[3], r6[2]
+    r3[3] = add r5[0], r6[3]
  
-Taking CSR (SIMD) bitwidth into account involves using the vector
-length and register encoding according to the "Bitwidth Virtual Register
-Reordering" scheme shown in the Appendix (see function "regoffs").
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in.  So this
+leaves just the packed-SIMD case to consider.
  
-A similar instruction exists for STORE, with identical topological
-translation of all features.  **TODO**
+Two ways in which this could be implemented / emulated (without special
+hardware):
  
-## Compressed LOAD / STORE Instructions
+* bit-manipulation that shuffles the data along by one byte (or one word)
+  either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+  penalties for doing so.
  
-Compressed LOAD and STORE are of the same format, where bits 2-4 are
-a src register instead of dest:
-
-[[!table  data="""
-15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
-funct3 | imm         | rs10   | imm         | rd0  | op   |
-3      | 3           | 3      | 2           | 3    | 2    |
-C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
-"""]]
-
-Unfortunately it is not possible to fit the full functionality
-of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
-require another operand (rs2) in addition to the operand width
-(which is also missing), offset, base, and src/dest.
-
-However a close approximation may be achieved by taking the top bit
-of the offset in each of the five types of LD (and ST), reducing the
-offset to 4 bits and utilising the 5th bit to indicate whether "stride"
-is to be enabled.  In this way it is at least possible to introduce
-that functionality.
-
-(**TODO**: *assess whether the loss of one bit from offset is worth having
-"stride" capability.*)
-
-We also assume (including for the "stride" variant) that the "width"
-parameter, which is missing, is derived and implicit, just as it is
-with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
-and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
-C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
-
-Interestingly we note that the Vectorised Simple-V variant of
-LOAD/STORE (Compressed and otherwise), due to it effectively using the
-standard register file(s), is the direct functional equivalent of
-standard load-multiple and store-multiple instructions found in other
-processors.
-
-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not".  The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults and adjusting the TLB
-in advance, accordingly: other strategies are explored in the Appendix
-Section "Virtual Memory Page Faults".
-
-# Exceptions
-
-> What does an ADD of two different-sized vectors do in simple-V?
-
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
-  than the destination, throw an exception.
-
-> And what about instructions like JALR? 
-> What does jumping to a vector do?
-
-* Throw an exception.  Whether that actually results in spawning threads
-  as part of the trap-handling remains to be seen.
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset.  On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
  
  # Impementing V on top of Simple-V
  
@@ -1161,37 +856,35 @@ the question is asked "How can each of the proposals effectively implement
  
  ### Example Instruction translation: <a name="example_translation"></a>
  
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FIFO:
+Instructions "ADD r7 r4 r4" would result in three instructions being
+generated and placed into the FIFO.  r7 and r4 are marked as "vectorised":
+
+* ADD r7 r4 r4
+* ADD r8 r5 r5
+* ADD r9 r6 r6
+
+Instructions "ADD r7 r4 r1" would result in three instructions being
+generated and placed into the FIFO.  r7 and r1 are marked as "vectorised"
+whilst r4 is not:
  
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
+* ADD r7 r4 r1
+* ADD r8 r4 r2
+* ADD r9 r4 r3
  
  ## Example of vector / vector, vector / scalar, scalar / scalar => vector add
  
-    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
-    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
-    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
-    register x[32][XLEN];
-
-    function op_add(rd, rs1, rs2, predr)
-    {
-       /* note that this is ADD, not PADD */
-       int i, id, irs1, irs2;
-       # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
-       # also destination makes no sense as a scalar but what the hell...
-       for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
-          if (CSRpredicate[predr][i]) # i *think* this is right...
-             x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
-          # now increment the idxs
-          if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
-             id += 1;
-          if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
-             irs1 += 1;
-          if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
-             irs2 += 1;
-    }
+    function op_add(rd, rs1, rs2) # add not VADD!
+      int i, id=0, irs1=0, irs2=0;
+      rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+      rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+      rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+      predval = get_pred_val(FALSE, rd);
+      for (i = 0; i < VL; i++)
+        if (predval & 1<<i) # predication uses intregs
+           ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+        if (int_vec[rd ].isvector)  { id += 1; }
+        if (int_vec[rs1].isvector)  { irs1 += 1; }
+        if (int_vec[rs2].isvector)  { irs2 += 1; }
  
  ## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
  
@@ -1559,7 +1252,7 @@ To illustrate how this works, here is some example code from FreeRTOS
          ...
          STORE  x30, 29 * REGBYTES(sp)
          STORE  x31, 30 * REGBYTES(sp)
-        
+
          /* Store current stackpointer in task control block (TCB) */
          LOAD   t0, pxCurrentTCB        //pointer
          STORE  sp, 0x0(t0)
@@ -1620,11 +1313,11 @@ bank of registers is to be loaded/saved:
      .macroVectorSetup
          MVECTORCSRx1 = 31, defaultlen
          MVECTORCSRx4 = 28, defaultlen
-    
+
          /* Save Context */
          SETVL x0, x0, 31 /* x0 ignored silently */
-        STORE  x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth 
-        
+        STORE  x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
          /* Restore registers,
             Skip global pointer because that does not change */
          LOAD   x1, 0x0(sp)
@@ -1830,7 +1523,7 @@ or may not require an additional pipeline stage)
  >> FIFO).
  
  > Those bits cannot be known until after the registers are decoded from the
-> instruction and a lookup in the "vector length table" has completed. 
+> instruction and a lookup in the "vector length table" has completed.
  > Considering that one of the reasons RISC-V keeps registers in invariant
  > positions across all instructions is to simplify register decoding, I expect
  > that inserting an SRAM read would lengthen the critical path in most
@@ -1849,6 +1542,68 @@ reply:
  >  is there anything unreasonable that anyone can foresee about that?
  > what are the down-sides?
  
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove.  it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay.  so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+<https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/g3feFnAoKIM>
+
+## Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**.  This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+  register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+  is placed in a dest register), and if the "required" length is longer
+  than the *available* length, the dest reg is set to the MIN of those
+  two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+  length and thus there is no way (at the time that VSETVL is called) to
+  know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+  the vector operation to the length specified in the destination
+  register's CSR, at the time that each instruction is issued...
+  except that that cannot possibly be guaranteed to match
+  with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+  VSETVL.  "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+  destreg equal to MIN(counterreg, lenimmed), with register-based
+  variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+  a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call).  This is probably desirable behaviour.
  
  ## Implementation Paradigms <a name="implementation_paradigms"></a>
  
@@ -1904,6 +1659,61 @@ in advance to avoid.
  
  TBD: floating-point compare and other exception handling
  
+------
+
+Multi-LR/SC
+
+Please don't try to use the L1 itself.
+
+Use the Load and Store buffers which capture instruction state prior
+to being accessed in the L1 (and prior to data arriving in the case of
+Store buffer).
+
+Also, use the L1 Miss buffers as these already HAVE to be snooped by
+coherence traffic. These are used to monitor that all participating
+cache lines remain interference free, and amalgamate same into a CPU
+signal accessible ia branch or predicate.
+
+The Load buffers manage inbound traffic
+The Store buffers manage outbound traffic.
+
+Done properly, the participating cache lines can exceed the associativity
+of the L1 cache without architectural harm (may incur additional latency).
+
+<https://groups.google.com/d/msg/comp.arch/QVl3c9vVDj0/ol_232-pAQAJ>
+
+> > > so, let's say instead of another LR *cancelling* the load
+> > > reservation, the SMP core / hardware thread *blocks* for
+> > > up to 63 further instructions, waiting for the reservation
+> > > to clear.
+> > 
+> > Can you explain what you mean by this paragraph?
+> 
+>  best put in sequential events, probably.
+> 
+>  <core1> LR    <-- 64-instruction countdown starts here
+>  <core1> ... 63
+>  <core1> ... 62
+>  <core2> LR same address <--- notes that core1 is on 61,
+>                          so pauses for **UP TO** 61 cycles
+>  <core1> ... 32
+>  <core1> SC <- core1 didn't reach zero, therefore valid, therefore
+>                core2 is now **UNBLOCKED**, is granted the
+>                load-reservation (and begins its **own** 64-cycle
+>                LR instruction countdown)
+>  <core2> ... 63
+>  <core2> ... 62
+>  <core2> ... 
+>  <core2> ...
+>  <core2> SC <- also valid
+
+Looks to me that you could effect the same functionality by simply
+holding onto the cache line in core 1 preventing core 2 from
+<architecturally> getting past the LR.
+
+On the other hand, the freeze is similar to how the MP CRAYs did
+ATOMIC stuff.
+
  # References
  
  * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
@@ -1935,7 +1745,11 @@ TBD: floating-point compare and other exception handling
    <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
  * Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>
  * RVV slides 2017 <https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
-* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf> 
+* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
  * Streaming Pipelines <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf>
  * Barcelona SIMD Presentation <https://content.riscv.org/wp-content/uploads/2018/05/09.05.2018-9.15-9.30am-RISCV201805-Andes-proposed-P-extension.pdf>
  * <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Full Description (last page) of RVV instructions
+  <https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf>
+* PULP Low-energy Cluster Vector Processor
+  <http://iis-projects.ee.ethz.ch/index.php/Low-Energy_Cluster-Coupled_Vector_Coprocessor_for_Special-Purpose_PULP_Acceleration>