X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=0642ce926d702469769bae2210773b1035398a95;hb=fa430b3a7c6a80d54aa5d040bd3dd4add9fcf5d0;hp=723cc07259960db0b00ed61b10873a310b9e3ab8;hpb=77fdb0e132155a3060cad2104ea528b1ab9233c2;p=libreriscv.git

diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn
index 723cc0725..0642ce926 100644
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,5 +1,7 @@
 # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
 
+**Note: this document is out of date and involved early ideas and discussions**
+
 Key insight: Simple-V is intended as an abstraction layer to provide
 a consistent "API" to parallelisation of existing *and future* operations.
 *Actual* internal hardware-level parallelism is *not* required, such
@@ -12,6 +14,11 @@ of Out-of-order restructuring (including parallel ALU lanes) or VLIW
 implementations, or SIMD, or anything else, would then benefit from
 the uniformity of a consistent API.
 
+**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E.
+
+* Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
+* Specification: now move to its own page: [[specification]]
+
 [[!toc ]]
 
 # Introduction
@@ -355,659 +362,6 @@ absolute bare minimum level of compliance with the "API" (software-traps
 when vectorisation is detected), all the way up to full supercomputing
 level all-hardware parallelism.  Options are covered in the Appendix.
 
-# CSRs <a name="csrs"></a>
-
-There are two CSR tables needed to create lookup tables which are used at
-the register decode phase.
-
-* Integer Register N is Vector
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-
-Also (see Appendix, "Context Switch Example") it may turn out to be important
-to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
-Vectorised LOAD / STORE may be used to load and store multiple registers:
-something that is missing from the Base RV ISA.
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
-  marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
-  of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
-  "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
-  as opposed to having the predicate register explicitly in the instruction.
-* Whilst the predication CSR is a key-value store it *generates* easier-to-use
-  state information.
-* TODO: assess whether the same technique could be applied to the other
-  Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
-  V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
-  needed for context-switches (empty slots need never be stored).
-
-## Predication CSR <a name="predication_csr_table"></a>
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated.  However it is important to note
-that the *actual* register is *different* from the one that ends up
-being used, due to the level of indirection through the lookup table.
-This includes (in the future) redirecting to a *second* bank of
-integer registers (as a future option)
-
-* regidx is the actual register that in combination with the
-  i/f flag, if that integer or floating-point register is referred to,
-  results in the lookup table being referenced to find the predication
-  mask to use on the operation in which that (regidx) register has
-  been used
-* predidx (in combination with the bank bit in the future) is the
-  *actual* register to be used for the predication mask.  Note:
-  in effect predidx is actually a 6-bit register address, as the bank
-  bit is the MSB (and is nominally set to zero for now).
-* inv indicates that the predication mask bits are to be inverted
-  prior to use *without* actually modifying the contents of the
-  register itself.
-* zeroing is either 1 or 0, and if set to 1, the operation must
-  place zeros in any element position where the predication mask is
-  set to zero.  If zeroing is set to 1, unpredicated elements *must*
-  be left alone.  Some microarchitectures may choose to interpret
-  this as skipping the operation entirely.  Others which wish to
-  stick more closely to a SIMD architecture may choose instead to
-  interpret unpredicated elements as an internal "copy element"
-  operation (which would be necessary in SIMD microarchitectures
-  that perform register-renaming)
-
-| PrCSR | 13     | 12     | 11    | 10  | (9..5)  | (4..0)  |
-| ----- | -      | -      | -     | -   | ------- | ------- |
-| 0     | bank0  | zero0  | inv0  | i/f | regidx  | predidx |
-| 1     | bank1  | zero1  | inv1  | i/f | regidx  | predidx |
-| ..    | bank.. | zero.. | inv.. | i/f | regidx  | predidx |
-| 15    | bank15 | zero15 | inv15 | i/f | regidx  | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
-    struct pred {
-        bool zero;
-        bool inv;
-        bool bank;   // 0 for now, 1=rsvd
-        bool enabled;
-        int predidx; // redirection: actual int register to use
-    }
-
-    struct pred fp_pred_reg[32];   // 64 in future (bank=1)
-    struct pred int_pred_reg[32];  // 64 in future (bank=1)
-
-    for (i = 0; i < 16; i++)
-      tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
-      idx = CSRpred[i].regidx
-      tb[idx].zero = CSRpred[i].zero
-      tb[idx].inv  = CSRpred[i].inv
-      tb[idx].bank = CSRpred[i].bank
-      tb[idx].predidx  = CSRpred[i].predidx
-      tb[idx].enabled  = true
-
-So when an operation is to be predicated, it is the internal state that
-is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
-    for (int i=0; i<vl; ++i)
-        if ([!]preg[p][i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store, which iwws used
-as follows.
-
-    if type(iop) == INT:
-        preg = int_pred_reg[rd]
-    else:
-        preg = fp_pred_reg[rd]
-
-    for (int i=0; i<vl; ++i)
-        predidx = preg[rd].predidx; // the indirection takes place HERE
-        if (!preg[rd].enabled)
-            predicate = ~0x0; // all parallel ops enabled
-        else:
-            predicate = intregfile[predidx]; // get actual reg contents HERE
-            if (preg[rd].inv) // invert if requested
-                predicate = ~predicate;
-        if (predicate && (1<<i))
-           (d ? regfile[rd+i] : regfile[rd]) =
-            iop(s1 ? regfile[rs1+i] : regfile[rs1],
-                s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
-        else if (preg[rd].zero)
-            // TODO: place zero in dest reg
-
-Note:
-
-* d, s1 and s2 are booleans indicating whether destination,
-  source1 and source2 are vector or scalar
-* key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
-  above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
-  register-level redirection (from the Register CSR table) if they are
-  vectors.
-
-If written as a function, obtaining the predication mask (but not whether
-zeroing takes place) may be done as follows:
-
-    def get_pred_val(bool is_fp_op, int reg):
-       tb = int_pred if is_fp_op else fp_pred
-       if (!tb[reg].enabled):
-          return ~0x0              // all ops enabled
-       predidx = tb[reg].predidx   // redirection occurs HERE
-       predicate = intreg[predidx] // actual predicate HERE
-       if (tb[reg].inv):
-          predicate = ~predicate   // invert ALL bits
-       return predicate
-
-## MAXVECTORLENGTH
-
-MAXVECTORLENGTH is the same concept as MVL in RVV.  However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
-
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers.  This keeps the implementation a little simpler.
-Note also (as also described in the VSETVL section) that the *minimum*
-for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
-and 31 for RV32 or RV64).
-
-Note that RVV on top of Simple-V may choose to over-ride this decision.
-
-## Register CSR key-value (CAM) table
-
-The purpose of the Register CSR table is four-fold:
-
-* To mark integer and floating-point registers as requiring "redirection"
-  if it is ever used as a source or destination in any given operation.
-  This involves a level of indirection through a 5-to-6-bit lookup table
-  (where the 6th bit - bank - is always set to 0 for now).
-* To indicate whether, after redirection through the lookup table, the
-  register is a vector (or remains a scalar).
-* To over-ride the implicit or explicit bitwidth that the operation would
-  normally give the register.
-* To indicate if the register is to be interpreted as "packed" (SIMD)
-  i.e. containing multiple contiguous elements of size equal to "bitwidth".
-
-| RgCSR | 15     | 14     | 13       | (12..11) | 10  | (9..5)  | (4..0)  |
-| ----- | -      | -      | -        | -        | -   | ------- | ------- |
-| 0     | simd0  | bank0  | isvec0   | vew0     | i/f | regidx  | predidx |
-| 1     | simd1  | bank1  | isvec1   | vew1     | i/f | regidx  | predidx |
-| ..    | simd.. | bank.. | isvec..  | vew..    | i/f | regidx  | predidx |
-| 15    | simd15 | bank15 | isvec15  | vew15    | i/f | regidx  | predidx |
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth  |
-| --- | --------- |
-| 00  | default   |
-| 01  | default/2 |
-| 10  | 8         |
-| 11  | 16        |
-
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
-
-As the above table is a CAM (key-value store) it may be appropriate
-to expand it as follows:
-
-    struct vectorised fp_vec[32], int_vec[32]; // 64 in future
-
-    for (i = 0; i < 16; i++) // 16 CSRs?
-       tb = int_vec if CSRvec[i].type == 0 else fp_vec
-       idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
-       tb[idx].elwidth  = CSRvec[i].elwidth
-       tb[idx].regidx   = CSRvec[i].regidx  // indirection
-       tb[idx].isvector = CSRvec[i].isvector // 0=scalar
-       tb[idx].packed   = CSRvec[i].packed  // SIMD or not
-       tb[idx].bank     = CSRvec[i].bank    // 0 (1=rsvd)
-
-TODO: move elsewhere
-
-    # TODO: use elsewhere (retire for now)
-    vew = CSRbitwidth[rs1]
-    if (vew == 0)
-        bytesperreg = (XLEN/8) # or FLEN as appropriate
-    elif (vew == 1)
-        bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
-    else:
-        bytesperreg = bytestable[vew] # 8 or 16
-    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
-    vlen = CSRvectorlen[rs1] * simdmult
-    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
-
-# Instructions
-
-Despite being a 98% complete and accurate topological remap of RVV
-concepts and functionality, the only instructions needed are VSETVL
-and VGETVL.  *All* RVV instructions can be re-mapped, however xBitManip
-becomes a critical dependency for efficient manipulation of predication
-masks (as a bit-field).  Despite the removal of all but VSETVL and VGETVL,
-*all instructions from RVV are topologically re-mapped and retain their
-complete functionality, intact*.
-
-Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
-equivalents, so are left out of Simple-V.  VSELECT could be included if
-there existed a MV.X instruction in RV (MV.X is a hypothetical
-non-immediate variant of MV that would allow another register to
-specify which register was to be copied).  Note that if any of these three
-instructions are added to any given RV extension, their functionality
-will be inherently parallelised.
-
-## Instruction Format
-
-The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or *any*
-memory instructions.
-Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations, MV,
-FCVT, and LOAD/STORE
-depending on CSR configurations for bitwidth and
-predication.  **Everything** becomes parallelised.  *This includes
-Compressed instructions* as well as any
-future instructions and Custom Extensions.
-
-* For analysis of RVV see [[v_comparative_analysis]] which begins to
-  outline topologically-equivalent mappings of instructions
-* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
-  for format of Branch opcodes.
-
-**TODO**: *analyse and decide whether the implicit nature of predication
-as proposed is or is not a lot of hassle, and if explicit prefixes are
-a better idea instead.  Parallelism therefore effectively may end up
-as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
-with some opportunities for to use Compressed bringing it down to 48.
-Also to consider is whether one or both of the last two remaining Compressed
-instruction codes in Quadrant 1 could be used as a parallelism prefix,
-bringing parallelised opcodes down to 32-bit (when combined with C)
-and having the benefit of being explicit.*
-
-## VSETVL
-
-NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
-with the instruction format remaining the same.
-
-VSETVL is slightly different from RVV in that the minimum vector length
-is required to be at least the number of registers in the register file,
-and no more than XLEN.  This allows vector LOAD/STORE to be used to switch
-the entire bank of registers using a single instruction (see Appendix,
-"Context Switch Example").  The reason for limiting VSETVL to XLEN is
-down to the fact that predication bits fit into a single register of length
-XLEN bits.
-
-The second change is that when VSETVL is requested to be stored
-into x0, it is *ignored* silently (VSETVL x0, x5, #4)
-
-The third change is that there is an additional immediate added to VSETVL,
-to which VL is set after first going through MIN-filtering.
-So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
-
-    VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-where RegfileLen <= MAXVECTORDEPTH < XLEN
-
-This has implication for the microarchitecture, as VL is required to be
-set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
-requested in the #immediate parameter.  RVV has the option to set VL
-to an arbitrary value that suits the conditions and the micro-architecture:
-SV does *not* permit that.
-
-The reason is so that if SV is to be used for a context-switch or as a
-substitute for LOAD/STORE-Multiple, the operation can be done with only
-2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
-single LD/ST operation).  If VL does *not* get set to the register file
-length when VSETVL is called, then a software-loop would be needed.
-To avoid this need, VL *must* be set to exactly what is requested
-(limits notwithstanding).
-
-Therefore, in turn, unlike RVV, implementors *must* provide
-pseudo-parallelism (using sequential loops in hardware) if actual
-hardware-parallelism in the ALUs is not deployed.  A hybrid is also
-permitted (as used in Broadcom's VideoCore-IV) however this must be
-*entirely* transparent to the ISA.
-
-## Branch Instruction:
-
-Branch operations use standard RV opcodes that are reinterpreted to
-be "predicate variants" in the instance where either of the two src
-registers are marked as vectors (isvector=1).  When this reinterpretation
-is enabled the "immediate" field of the branch operation is taken to be a
-predication target register, rs3.  The predicate target register rs3 is
-to be treated as a bitfield (up to a maximum of XLEN bits corresponding
-to a maximum of XLEN elements).
-
-If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
-goes ahead as vector-scalar or scalar-vector.  Implementors should note that
-this could require considerable multi-porting of the register file in order
-to parallelise properly, so may have to involve the use of register cacheing
-and transparent copying (see Multiple-Banked Register File Architectures
-paper).
-
-In instances where no vectorisation is detected on either src registers
-the operation is treated as an absolutely standard scalar branch operation.
-
-This is the overloaded table for Integer-base Branch operations.  Opcode
-(bits 6..0) is set in all cases to 1100011.
-
-[[!table  data="""
-31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
-imm[12,10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
-7           | 5        | 5     | 3      | 4             | 1  | 7       |
-reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
-reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
-reserved    | src2     | src1  | 001    | predicate rs3     || BNE     |
-reserved    | src2     | src1  | 010    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 011    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 100    | predicate rs3     || BLE     |
-reserved    | src2     | src1  | 101    | predicate rs3     || BGE     |
-reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
-reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
-"""]]
-
-Note that just as with the standard (scalar, non-predicated) branch
-operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
-
-Below is the overloaded table for Floating-point Predication operations.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield (done
-implicitly).
-
-As with
-Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
-Likewise Single-precision, fmt bits 26..25) is still set to 00.
-Double-precision is still set to 01, whilst Quad-precision
-appears not to have a definition in V2.3-Draft (but should be unaffected).
-
-It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact.  To deal with this, SV's predication has
-had "invert" added to it.
-
-[[!table  data="""
-31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
-funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
-5       | 2        | 5        | 5     | 3      | 4        | 7       |
-10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
-10100   | 00/01/11 | src2     | src1  | **011**| pred rs3 | rsvd    |
-10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
-10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
-"""]]
-
-Note (**TBD**): floating-point exceptions will need to be extended
-to cater for multiple exceptions (and statuses of the same).  The
-usual approach is to have an array of status codes and bit-fields,
-and one exception, rather than throw separate exceptions for each
-Vector element.
-
-In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
-for predicated compare operations of function "cmp":
-
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[p][i])
-         preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
-                           s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
-    if I/F == INT: # integer type cmp
-        preg = int_pred_reg[rd]
-        reg = int_regfile
-    else:
-        preg = fp_pred_reg[rd]
-        reg = fp_regfile
-
-    s1 = reg_is_vectorised(src1);
-    s2 = reg_is_vectorised(src2);
-    if (!s2 && !s1) goto branch;
-    for (int i = 0; i < VL; ++i)
-      if (cmp(s1 ? reg[src1+i]:reg[src1],
-              s2 ? reg[src2+i]:reg[src2])
-             preg[rs3] |= 1<<i;  # bitfield not vector
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
-  into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
-  Reordering") setting Vector-Length times (number of SIMD elements) bits
-  in Predicate Register rs3 as opposed to just Vector-Length bits.
-* Predicated Branches do not actually have an adjustment to the Program
-  Counter, so all of bits 25 through 30 in every case are not needed.
-* There are plenty of reserved opcodes for which bits 25 through 30 could
-  be put to good use if there is a suitable use-case.
-  FLT and FLE may be inverted to FGT and FGE if needed by swapping
-  src1 and src2 (likewise the integer counterparts).
-
-## Compressed Branch Instruction:
-
-Compressed Branch instructions are likewise re-interpreted as predicated
-2-register operations, with the result going into rs3.  All the bits of
-the immediate are re-interpreted for different purposes, to extend the
-number of comparator operations to beyond the original specification,
-but also to cater for floating-point comparisons as well as integer ones.
-
-[[!table  data="""
-15..13 | 12...10  | 9..7 | 6..5  | 4..2 | 1..0 | name |
-funct3 | imm      | rs10 | imm   |      | op   |      |
-3      | 3        | 3    | 2     |  3   | 2    |      |
-C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
-110    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.EQ |
-111    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.NE |
-110    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LT |
-111    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LE |
-"""]]
-
-Notes:
-
-* Bits 5 13 14 and 15 make up the comparator type
-* Bit 6 indicates whether to use integer or floating-point comparisons
-* In both floating-point and integer cases there are four predication
-  comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
-  src1 and src2).
-
-## LOAD / STORE Instructions <a name="load_store"></a>
-
-For full analysis of topological adaptation of RVV LOAD/STORE
-see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction,
-and likewise for STORE.
-
-Revised LOAD:
-
-[[!table  data="""
-31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
-imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
-1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
-?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
-"""]]
-
-The exact same corresponding adaptation is also carried out on the single,
-double and quad precision floating-point LOAD-FP and STORE-FP operations,
-which fit the exact same instruction format.  Thus all three types
-(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
-as well as FSW, FSD and FSQ.
-
-Notes:
-
-* LOAD remains functionally (topologically) identical to RVV LOAD
-  (for both integer and floating-point variants).
-* Predication CSR-marking register is not explicitly shown in instruction, it's
-  implicit based on the CSR predicate state for the rd (destination) register
-* rs2, the source, may *also be marked as a vector*, which implicitly
-  is taken to indicate "Indexed Load" (LD.X)
-* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
-* Bit 31 is reserved (ideas under consideration: auto-increment)
-* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
-* **TODO**: clarify where width maps to elsize
-
-Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
-
-    if (unit-strided) stride = elsize;
-    else stride = areg[as2]; // constant-strided
-
-    preg = int_pred_reg[rd]
-
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[rd] & 1<<i)
-        for (int j=0; j<seglen+1; j++)
-        {
-          if CSRvectorised[rs2])
-             offs = vreg[rs2+i]
-          else
-             offs = i*(seglen+1)*stride;
-          vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
-        }
-
-Taking CSR (SIMD) bitwidth into account involves using the vector
-length and register encoding according to the "Bitwidth Virtual Register
-Reordering" scheme shown in the Appendix (see function "regoffs").
-
-A similar instruction exists for STORE, with identical topological
-translation of all features.  **TODO**
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are of the same format, where bits 2-4 are
-a src register instead of dest:
-
-[[!table  data="""
-15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
-funct3 | imm         | rs10   | imm         | rd0  | op   |
-3      | 3           | 3      | 2           | 3    | 2    |
-C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
-"""]]
-
-Unfortunately it is not possible to fit the full functionality
-of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
-require another operand (rs2) in addition to the operand width
-(which is also missing), offset, base, and src/dest.
-
-However a close approximation may be achieved by taking the top bit
-of the offset in each of the five types of LD (and ST), reducing the
-offset to 4 bits and utilising the 5th bit to indicate whether "stride"
-is to be enabled.  In this way it is at least possible to introduce
-that functionality.
-
-(**TODO**: *assess whether the loss of one bit from offset is worth having
-"stride" capability.*)
-
-We also assume (including for the "stride" variant) that the "width"
-parameter, which is missing, is derived and implicit, just as it is
-with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
-and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
-C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
-
-Interestingly we note that the Vectorised Simple-V variant of
-LOAD/STORE (Compressed and otherwise), due to it effectively using the
-standard register file(s), is the direct functional equivalent of
-standard load-multiple and store-multiple instructions found in other
-processors.
-
-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not".  The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults and adjusting the TLB
-in advance, accordingly: other strategies are explored in the Appendix
-Section "Virtual Memory Page Faults".
-
-## Vectorised Copy/Move (and conversion) instructions
-
-There is a series of 2-operand instructions involving copying (and
-alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ.  These operations all
-follow the same pattern, as it is *both* the source *and* destination
-predication masks that are taken into account.  This is different from
-the three-operand arithmetic instructions, where the predication mask
-is taken from the *destination* register, and applied uniformly to the
-elements of the source register(s), element-for-element.
-
-### C.MV Instruction <a name="c_mv"></a>
-
-There is no MV instruction in RV however there is a C.MV instruction.
-It is used for copying integer-to-integer registers (vectorised FMV
-is used for copying floating-point).
-
-If either the source or the destination register are marked as vectors
-C.MV is reinterpreted to be a vectorised (multi-register) predicated
-move operation.  The actual instruction's format does not change:
-
-[[!table  data="""
-15  12 | 11   7 | 6  2 | 1  0 |
-funct4 | rd     | rs   | op   |
-4      | 5      | 5    | 2    |
-C.MV   | dest   | src  | C0   |
-"""]]
-
-A simplified version of the pseudocode for this operation is as follows:
-
-    function op_mv(rd, rs) # MV not VMV!
-     Â rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
-     Â rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
-     Â ps = get_pred_val(FALSE, rs); # predication on src
-     Â pd = get_pred_val(FALSE, rd); # ... AND on dest
-     Â for (int i = 0, int j = 0; i < VL && j < VL;):
-        if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
-        if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
-        ireg[rd+j] <= ireg[rs+i];
-        if (int_vec[rs].isvec) i++;
-        if (int_vec[rd].isvec) j++;
-
-Note that:
-
-* elwidth (SIMD) is not covered above
-* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
-  not covered
-
-There are several different instructions from RVV that are covered by
-this one opcode:
-
-[[!table  data="""
-src    | dest    | predication   | op             |
-scalar | vector  | none          | VSPLAT         |
-scalar | vector  | destination   | sparse VSPLAT  |
-scalar | vector  | 1-bit dest    | VINSERT        |
-vector | scalar  | 1-bit? src    | VEXTRACT       |
-vector | vector  | none          | VCOPY          |
-vector | vector  | src           | Vector Gather  |
-vector | vector  | dest          | Vector Scatter |
-vector | vector  | src & dest    | Gather/Scatter |
-vector | vector  | src == dest   | sparse VCOPY   |
-"""]]
-
-Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
-operations with inversion on the src and dest predication for one of the
-two C.MV operations.
-
-Note that in the instance where the Compressed Extension is not implemented,
-MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
-Note that the behaviour is **different** from C.MV because with addi the
-predication mask to use is taken **only** from rd and is applied against
-all elements: rs[i] = rd[i].
 
 ### FMV, FNEG and FABS Instructions
 
@@ -2305,6 +1659,61 @@ in advance to avoid.
 
 TBD: floating-point compare and other exception handling
 
+------
+
+Multi-LR/SC
+
+Please don't try to use the L1 itself.
+
+Use the Load and Store buffers which capture instruction state prior
+to being accessed in the L1 (and prior to data arriving in the case of
+Store buffer).
+
+Also, use the L1 Miss buffers as these already HAVE to be snooped by
+coherence traffic. These are used to monitor that all participating
+cache lines remain interference free, and amalgamate same into a CPU
+signal accessible ia branch or predicate.
+
+The Load buffers manage inbound traffic
+The Store buffers manage outbound traffic.
+
+Done properly, the participating cache lines can exceed the associativity
+of the L1 cache without architectural harm (may incur additional latency).
+
+<https://groups.google.com/d/msg/comp.arch/QVl3c9vVDj0/ol_232-pAQAJ>
+
+> > > so, let's say instead of another LR *cancelling* the load
+> > > reservation, the SMP core / hardware thread *blocks* for
+> > > up to 63 further instructions, waiting for the reservation
+> > > to clear.
+> > 
+> > Can you explain what you mean by this paragraph?
+> 
+>  best put in sequential events, probably.
+> 
+>  <core1> LR    <-- 64-instruction countdown starts here
+>  <core1> ... 63
+>  <core1> ... 62
+>  <core2> LR same address <--- notes that core1 is on 61,
+>                          so pauses for **UP TO** 61 cycles
+>  <core1> ... 32
+>  <core1> SC <- core1 didn't reach zero, therefore valid, therefore
+>                core2 is now **UNBLOCKED**, is granted the
+>                load-reservation (and begins its **own** 64-cycle
+>                LR instruction countdown)
+>  <core2> ... 63
+>  <core2> ... 62
+>  <core2> ... 
+>  <core2> ...
+>  <core2> SC <- also valid
+
+Looks to me that you could effect the same functionality by simply
+holding onto the cache line in core 1 preventing core 2 from
+<architecturally> getting past the LR.
+
+On the other hand, the freeze is similar to how the MP CRAYs did
+ATOMIC stuff.
+
 # References
 
 * SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>