fosdem2024_formal: add slides and diagrams

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 2eea8dfc247c955416ff6257ca163f2d8e5a3b7a..5567b3d8bbee22a321a7887dd866fa1466d58f04 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,2311 +1,12 @@
-# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
-
-Key insight: Simple-V is intended as an abstraction layer to provide
-a consistent "API" to parallelisation of existing *and future* operations.
-*Actual* internal hardware-level parallelism is *not* required, such
-that Simple-V may be viewed as providing a "compact" or "consolidated"
-means of issuing multiple near-identical arithmetic instructions to an
-instruction queue (FIFO), pending execution.
-
-*Actual* parallelism, if added independently of Simple-V in the form
-of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit from
-the uniformity of a consistent API.
-
-[[!toc ]]
-
-# Introduction
-
-This proposal exists so as to be able to satisfy several disparate
-requirements: power-conscious, area-conscious, and performance-conscious
-designs all pull an ISA and its implementation in different conflicting
-directions, as do the specific intended uses for any given implementation.
-
-The existing P (SIMD) proposal and the V (Vector) proposals,
-whilst each extremely powerful in their own right and clearly desirable,
-are also:
-
-* Clearly independent in their origins (Cray and AndesStar v3 respectively)
-  so need work to adapt to the RISC-V ethos and paradigm
-* Are sufficiently large so as to make adoption (and exploration for
-  analysis and review purposes) prohibitively expensive
-* Both contain partial duplication of pre-existing RISC-V instructions
-  (an undesirable characteristic)
-* Both have independent, incompatible and disparate methods for introducing
-  parallelism at the instruction level
-* Both require that their respective parallelism paradigm be implemented
-  along-side and integral to their respective functionality *or not at all*.
-* Both independently have methods for introducing parallelism that
-  could, if separated, benefit
-  *other areas of RISC-V not just DSP or Floating-point respectively*.
-
-There are also key differences between Vectorisation and SIMD (full
-details outlined in the Appendix), the key points being:
-
-* SIMD has an extremely seductively compelling ease of implementation argument:
-  each operation is passed to the ALU, which is where the parallelism
-  lies.  There is *negligeable* (if any) impact on the rest of the core
-  (with life instead being made hell for compiler writers and applications
-  writers due to extreme ISA proliferation).
-* By contrast, Vectorisation has quite some complexity (for considerable
-  flexibility, reduction in opcode proliferation and much more).
-* Vectorisation typically includes much more comprehensive memory load
-  and store schemes (unit stride, constant-stride and indexed), which
-  in turn have ramifications: virtual memory misses (TLB cache misses)
-  and even multiple page-faults... all caused by a *single instruction*,
-  yet with a clear benefit that the regularisation of LOAD/STOREs can
-  be optimised for minimal impact on caches and maximised throughput.
-* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
-  to pages), and these load/stores have absolutely nothing to do with the
-  SIMD / ALU engine, no matter how wide the operand.  Simplicity but with
-  more impact on instruction and data caches.
-
-Overall it makes a huge amount of sense to have a means and method
-of introducing instruction parallelism in a flexible way that provides
-implementors with the option to choose exactly where they wish to offer
-performance improvements and where they wish to optimise for power
-and/or area (and if that can be offered even on a per-operation basis that
-would provide even more flexibility).
-
-Additionally it makes sense to *split out* the parallelism inherent within
-each of P and V, and to see if each of P and V then, in *combination* with
-a "best-of-both" parallelism extension, could be added on *on top* of
-this proposal, to topologically provide the exact same functionality of
-each of P and V.  Each of P and V then can focus on providing the best
-operations possible for their respective target areas, without being
-hugely concerned about the actual parallelism.
-
-Furthermore, an additional goal of this proposal is to reduce the number
-of opcodes utilised by each of P and V as they currently stand, leveraging
-existing RISC-V opcodes where possible, and also potentially allowing
-P and V to make use of Compressed Instructions as a result.
-
-# Analysis and discussion of Vector vs SIMD
-
-There are six combined areas between the two proposals that help with
-parallelism (increased performance, reduced power / area) without
-over-burdening the ISA with a huge proliferation of
-instructions:
-
-* Fixed vs variable parallelism (fixed or variable "M" in SIMD)
-* Implicit vs fixed instruction bit-width (integral to instruction or not)
-* Implicit vs explicit type-conversion (compounded on bit-width)
-* Implicit vs explicit inner loops.
-* Single-instruction LOAD/STORE.
-* Masks / tagging (selecting/preventing certain indexed elements from execution)
-
-The pros and cons of each are discussed and analysed below.
-
-## Fixed vs variable parallelism length
-
-In David Patterson and Andrew Waterman's analysis of SIMD and Vector
-ISAs, the analysis comes out clearly in favour of (effectively) variable
-length SIMD.  As SIMD is a fixed width, typically 4, 8 or in extreme cases
-16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD
-are extremely burdensome except for applications whose requirements
-*specifically* match the *precise and exact* depth of the SIMD engine.
-
-Thus, SIMD, no matter what width is chosen, is never going to be acceptable
-for general-purpose computation, and in the context of developing a
-general-purpose ISA, is never going to satisfy 100 percent of implementors.
-
-To explain this further: for increased workloads over time, as the
-performance requirements increase for new target markets, implementors
-choose to extend the SIMD width (so as to again avoid mixing parallelism
-into the instruction issue phases: the primary "simplicity" benefit of
-SIMD in the first place), with the result that the entire opcode space
-effectively doubles with each new SIMD width that's added to the ISA.
-
-That basically leaves "variable-length vector" as the clear *general-purpose*
-winner, at least in terms of greatly simplifying the instruction set,
-reducing the number of instructions required for any given task, and thus
-reducing power consumption for the same.
-
-## Implicit vs fixed instruction bit-width
-
-SIMD again has a severe disadvantage here, over Vector: huge proliferation
-of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
-have to then have operations *for each and between each*.  It gets very
-messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
-proliferation profile.
-
-The V-Extension on the other hand proposes to set the bit-width of
-future instructions on a per-register basis, such that subsequent instructions
-involving that register are *implicitly* of that particular bit-width until
-otherwise changed or reset.
-
-This has some extremely useful properties, without being particularly
-burdensome to implementations, given that instruction decode already has
-to direct the operation to a correctly-sized width ALU engine, anyway.
-
-Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand space),
-implicit bit-width allows the meaning of certain operations to be
-type-overloaded *without* pollution or alteration of frozen and immutable
-instructions, in a fully backwards-compatible fashion.
-
-## Implicit and explicit type-conversion
-
-The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help
-deal with over-population of instructions, such that type-casting from
-integer (and floating point) of various sizes is automatically inferred
-due to "type tagging" that is set with a special instruction.  A register
-will be *specifically* marked as "16-bit Floating-Point" and, if added
-to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take place *without* requiring that type-conversion
-to be explicitly done with its own separate instruction.
-
-However, implicit type-conversion is not only quite burdensome to
-implement (explosion of inferred type-to-type conversion) but also is
-never really going to be complete.  It gets even worse when bit-widths
-also have to be taken into consideration.  Each new type results in
-an increased O(N^2) conversion space that, as anyone who has examined
-python's source code (which has built-in polymorphic type-conversion),
-knows that the task is more complex than it first seems.
-
-Overall, type-conversion is generally best to leave to explicit
-type-conversion instructions, or in definite specific use-cases left to
-be part of an actual instruction (DSP or FP)
-
-## Zero-overhead loops vs explicit loops
-
-The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology
-contains an extremely interesting feature: zero-overhead loops.  This
-proposal would basically allow an inner loop of instructions to be
-repeated indefinitely, a fixed number of times.
-
-Its specific advantage over explicit loops is that the pipeline in a DSP
-can potentially be kept completely full *even in an in-order single-issue
-implementation*.  Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in
-order to keep ALU pipelines 100% occupied.
-
-By bringing that capability in, this proposal could offer a way to increase
-pipeline activity even in simpler implementations in the one key area
-which really matters: the inner loop.
-
-However when looking at much more comprehensive schemes
-"A portable specification of zero-overhead loop control hardware
-applied to embedded processors" (ZOLC), optimising only the single
-inner loop seems inadequate, tending to suggest that ZOLC may be
-better off being proposed as an entirely separate Extension.
-
-## Single-instruction LOAD/STORE
-
-In traditional Vector Architectures there are instructions which
-result in multiple register-memory transfer operations resulting
-from a single instruction.  They're complicated to implement in hardware,
-yet the benefits are a huge consistent regularisation of memory accesses
-that can be highly optimised with respect to both actual memory and any
-L1, L2 or other caches.  In Hwacha EECS-2015-263 it is explicitly made
-clear the consequences of getting this architecturally wrong:
-L2 cache-thrashing at the very least.
-
-Complications arise when Virtual Memory is involved: TLB cache misses
-need to be dealt with, as do page faults.  Some of the tradeoffs are
-discussed in <http://people.eecs.berkeley.edu/~krste/thesis.pdf>, Section
-4.6, and an article by Jeff Bush when faced with some of these issues
-is particularly enlightening
-<https://jbush001.github.io/2015/11/03/lost-in-translation.html>
-
-Interestingly, none of this complexity is faced in SIMD architectures...
-but then they do not get the opportunity to optimise for highly-streamlined
-memory accesses either.
-
-With the "bang-per-buck" ratio being so high and the indirect improvement
-in L1 Instruction Cache usage (reduced instruction count), as well as
-the opportunity to optimise L1 and L2 cache usage, the case for including
-Vector LOAD/STORE is compelling.
-
-## Mask and Tagging (Predication)
-
-Tagging (aka Masks aka Predication) is a pseudo-method of implementing
-simplistic branching in a parallel fashion, by allowing execution on
-elements of a vector to be switched on or off depending on the results
-of prior operations in the same array position.
-
-The reason for considering this is simple: by *definition* it
-is not possible to perform individual parallel branches in a SIMD
-(Single-Instruction, **Multiple**-Data) context.  Branches (modifying
-of the Program Counter) will result in *all* parallel data having
-a different instruction executed on it: that's just the definition of
-SIMD, and it is simply unavoidable.
-
-So these are the ways in which conditional execution may be implemented:
-
-* explicit compare and branch: BNE x, y -> offs would jump offs
-  instructions if x was not equal to y
-* explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) such as ADD results in a carry, carry bit
-  implicitly (or sometimes explicitly) goes into a "tag" (mask) register
-
-The first of these is a "normal" branch method, which is flat-out impossible
-to parallelise without look-ahead and effectively rewriting instructions.
-This would defeat the purpose of RISC.
-
-The latter two are where parallelism becomes easy to do without complexity:
-every operation is modified to be "conditionally executed" (in an explicit
-way directly in the instruction format *or* implicitly).
-
-RVV (Vector-Extension) proposes to have *explicit* storing of the compare
-in a tag/mask register, and to *explicitly* have every vector operation
-*require* that its operation be "predicated" on the bits within an
-explicitly-named tag/mask register.
-
-SIMD (P-Extension) has not yet published precise documentation on what its
-schema is to be: there is however verbal indication at the time of writing
-that:
-
-> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will
-> be executed using the same compare ALU logic for the base ISA with some
-> minor modifications to handle smaller data types. The function will not
-> be duplicated.
-
-This is an *implicit* form of predication as the base RV ISA does not have
-condition-codes or predication.  By adding a CSR it becomes possible
-to also tag certain registers as "predicated if referenced as a destination".
-Example:
-
-    // in future operations from now on, if r0 is the destination use r5 as
-    // the PREDICATION register
-    SET_IMPLICIT_CSRPREDICATE r0, r5
-    // store the compares in r5 as the PREDICATION register
-    CMPEQ8 r5, r1, r2
-    // r0 is used here.  ah ha!  that means it's predicated using r5!
-    ADD8 r0, r1, r3
-
-With enough registers (and in RISC-V there are enough registers) some fairly
-complex predication can be set up and yet still execute without significant
-stalling, even in a simple non-superscalar architecture.
-
-(For details on how Branch Instructions would be retro-fitted to indirectly
-predicated equivalents, see Appendix)
-
-## Conclusions
-
-In the above sections the five different ways where parallel instruction
-execution has closely and loosely inter-related implications for the ISA and
-for implementors, were outlined.  The pluses and minuses came out as
-follows:
-
-* Fixed vs variable parallelism: <b>variable</b>
-* Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
-* Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
-* Single-instruction Vector LOAD/STORE: <b>Complex but highly beneficial</b>
-* Tag or no-tag: <b>Complex but highly beneficial</b>
-
-In particular:
-
-* variable-length vectors came out on top because of the high setup, teardown
-  and corner-cases associated with the fixed width of SIMD.
-* Implicit bit-width helps to extend the ISA to escape from
-  former limitations and restrictions (in a backwards-compatible fashion),
-  whilst also leaving implementors free to simmplify implementations
-  by using actual explicit internal parallelism.
-* Implicit (zero-overhead) loops provide a means to keep pipelines
-  potentially 100% occupied in a single-issue in-order implementation
-  i.e. *without* requiring a super-scalar or out-of-order architecture,
-  but doing a proper, full job (ZOLC) is an entirely different matter.
-
-Constructing a SIMD/Simple-Vector proposal based around four of these six
-requirements would therefore seem to be a logical thing to do.
-
-# Note on implementation of parallelism
-
-One extremely important aspect of this proposal is to respect and support
-implementors desire to focus on power, area or performance.  In that regard,
-it is proposed that implementors be free to choose whether to implement
-the Vector (or variable-width SIMD) parallelism as sequential operations
-with a single ALU, fully parallel (if practical) with multiple ALUs, or
-a hybrid combination of both.
-
-In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
-Parallelism".  They achieve a 16-way SIMD at an **instruction** level
-by providing a combination of a 4-way parallel ALU *and* an externally
-transparent loop that feeds 4 sequential sets of data into each of the
-4 ALUs.
-
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
-
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler.  Whilst
-a Vector (varible-width SIMD) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward.  All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
-
-To make that clear: should an implementor choose a particularly wide
-SIMD-style ALU, each parallel unit *must* have predication so that
-the parallel SIMD ALU may emulate variable-length parallel operations.
-Thus the "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
-
-In addition, implementors will be free to choose whether to provide an
-absolute bare minimum level of compliance with the "API" (software-traps
-when vectorisation is detected), all the way up to full supercomputing
-level all-hardware parallelism.  Options are covered in the Appendix.
-
-# CSRs <a name="csrs"></a>
-
-There are two CSR tables needed to create lookup tables which are used at
-the register decode phase.
-
-* Integer Register N is Vector
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-
-Also (see Appendix, "Context Switch Example") it may turn out to be important
-to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
-Vectorised LOAD / STORE may be used to load and store multiple registers:
-something that is missing from the Base RV ISA.
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
-  marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
-  of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
-  "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
-  as opposed to having the predicate register explicitly in the instruction.
-* Whilst the predication CSR is a key-value store it *generates* easier-to-use
-  state information.
-* TODO: assess whether the same technique could be applied to the other
-  Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
-  V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
-  needed for context-switches (empty slots need never be stored).
-
-## Predication CSR <a name="predication_csr_table"></a>
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated.  However it is important to note
-that the *actual* register is *different* from the one that ends up
-being used, due to the level of indirection through the lookup table.
-This includes (in the future) redirecting to a *second* bank of
-integer registers (as a future option)
-
-* regidx is the actual register that in combination with the
-  i/f flag, if that integer or floating-point register is referred to,
-  results in the lookup table being referenced to find the predication
-  mask to use on the operation in which that (regidx) register has
-  been used
-* predidx (in combination with the bank bit in the future) is the
-  *actual* register to be used for the predication mask.  Note:
-  in effect predidx is actually a 6-bit register address, as the bank
-  bit is the MSB (and is nominally set to zero for now).
-* inv indicates that the predication mask bits are to be inverted
-  prior to use *without* actually modifying the contents of the
-  register itself.
-* zeroing is either 1 or 0, and if set to 1, the operation must
-  place zeros in any element position where the predication mask is
-  set to zero.  If zeroing is set to 1, unpredicated elements *must*
-  be left alone.  Some microarchitectures may choose to interpret
-  this as skipping the operation entirely.  Others which wish to
-  stick more closely to a SIMD architecture may choose instead to
-  interpret unpredicated elements as an internal "copy element"
-  operation (which would be necessary in SIMD microarchitectures
-  that perform register-renaming)
-
-| PrCSR | 13     | 12     | 11    | 10  | (9..5)  | (4..0)  |
-| ----- | -      | -      | -     | -   | ------- | ------- |
-| 0     | bank0  | zero0  | inv0  | i/f | regidx  | predidx |
-| 1     | bank1  | zero1  | inv1  | i/f | regidx  | predidx |
-| ..    | bank.. | zero.. | inv.. | i/f | regidx  | predidx |
-| 15    | bank15 | zero15 | inv15 | i/f | regidx  | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
-    struct pred {
-        bool zero;
-        bool inv;
-        bool bank;   // 0 for now, 1=rsvd
-        bool enabled;
-        int predidx; // redirection: actual int register to use
-    }
-
-    struct pred fp_pred_reg[32];   // 64 in future (bank=1)
-    struct pred int_pred_reg[32];  // 64 in future (bank=1)
-
-    for (i = 0; i < 16; i++)
-      tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
-      idx = CSRpred[i].regidx
-      tb[idx].zero = CSRpred[i].zero
-      tb[idx].inv  = CSRpred[i].inv
-      tb[idx].bank = CSRpred[i].bank
-      tb[idx].predidx  = CSRpred[i].predidx
-      tb[idx].enabled  = true
-
-So when an operation is to be predicated, it is the internal state that
-is used.  In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
-    for (int i=0; i<vl; ++i)
-        if ([!]preg[p][i])
-           (d ? vreg[rd][i] : sreg[rd]) =
-            iop(s1 ? vreg[rs1][i] : sreg[rs1],
-                s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store, which iwws used
-as follows.
-
-    if type(iop) == INT:
-        preg = int_pred_reg[rd]
-    else:
-        preg = fp_pred_reg[rd]
-
-    for (int i=0; i<vl; ++i)
-        predidx = preg[rd].predidx; // the indirection takes place HERE
-        if (!preg[rd].enabled)
-            predicate = ~0x0; // all parallel ops enabled
-        else:
-            predicate = intregfile[predidx]; // get actual reg contents HERE
-            if (preg[rd].inv) // invert if requested
-                predicate = ~predicate;
-        if (predicate && (1<<i))
-           (d ? regfile[rd+i] : regfile[rd]) =
-            iop(s1 ? regfile[rs1+i] : regfile[rs1],
-                s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
-        else if (preg[rd].zero)
-            // TODO: place zero in dest reg
-
-Note:
-
-* d, s1 and s2 are booleans indicating whether destination,
-  source1 and source2 are vector or scalar
-* key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
-  above, for clarity.  rd, rs1 and rs2 all also must ALSO go through
-  register-level redirection (from the Register CSR table) if they are
-  vectors.
-
-If written as a function, obtaining the predication mask (but not whether
-zeroing takes place) may be done as follows:
-
-    def get_pred_val(bool is_fp_op, int reg):
-       tb = int_pred if is_fp_op else fp_pred
-       if (!tb[reg].enabled):
-          return ~0x0              // all ops enabled
-       predidx = tb[reg].predidx   // redirection occurs HERE
-       predicate = intreg[predidx] // actual predicate HERE
-       if (tb[reg].inv):
-          predicate = ~predicate   // invert ALL bits
-       return predicate
-
-## MAXVECTORLENGTH
-
-MAXVECTORLENGTH is the same concept as MVL in RVV.  However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
-
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers.  This keeps the implementation a little simpler.
-Note also (as also described in the VSETVL section) that the *minimum*
-for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
-and 31 for RV32 or RV64).
-
-Note that RVV on top of Simple-V may choose to over-ride this decision.
-
-## Register CSR key-value (CAM) table
-
-The purpose of the Register CSR table is four-fold:
-
-* To mark integer and floating-point registers as requiring "redirection"
-  if it is ever used as a source or destination in any given operation.
-  This involves a level of indirection through a 5-to-6-bit lookup table
-  (where the 6th bit - bank - is always set to 0 for now).
-* To indicate whether, after redirection through the lookup table, the
-  register is a vector (or remains a scalar).
-* To over-ride the implicit or explicit bitwidth that the operation would
-  normally give the register.
-* To indicate if the register is to be interpreted as "packed" (SIMD)
-  i.e. containing multiple contiguous elements of size equal to "bitwidth".
-
-| RgCSR | 15     | 14     | 13       | (12..11) | 10  | (9..5)  | (4..0)  |
-| ----- | -      | -      | -        | -        | -   | ------- | ------- |
-| 0     | simd0  | bank0  | isvec0   | vew0     | i/f | regidx  | predidx |
-| 1     | simd1  | bank1  | isvec1   | vew1     | i/f | regidx  | predidx |
-| ..    | simd.. | bank.. | isvec..  | vew..    | i/f | regidx  | predidx |
-| 15    | simd15 | bank15 | isvec15  | vew15    | i/f | regidx  | predidx |
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth  |
-| --- | --------- |
-| 00  | default   |
-| 01  | default/2 |
-| 10  | 8         |
-| 11  | 16        |
-
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
-
-As the above table is a CAM (key-value store) it may be appropriate
-to expand it as follows:
-
-    struct vectorised fp_vec[32], int_vec[32]; // 64 in future
-
-    for (i = 0; i < 16; i++) // 16 CSRs?
-       tb = int_vec if CSRvec[i].type == 0 else fp_vec
-       idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
-       tb[idx].elwidth  = CSRvec[i].elwidth
-       tb[idx].regidx   = CSRvec[i].regidx  // indirection
-       tb[idx].isvector = CSRvec[i].isvector // 0=scalar
-       tb[idx].packed   = CSRvec[i].packed  // SIMD or not
-       tb[idx].bank     = CSRvec[i].bank    // 0 (1=rsvd)
-
-TODO: move elsewhere
-
-    # TODO: use elsewhere (retire for now)
-    vew = CSRbitwidth[rs1]
-    if (vew == 0)
-        bytesperreg = (XLEN/8) # or FLEN as appropriate
-    elif (vew == 1)
-        bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
-    else:
-        bytesperreg = bytestable[vew] # 8 or 16
-    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
-    vlen = CSRvectorlen[rs1] * simdmult
-    CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
-
-# Instructions
-
-Despite being a 98% complete and accurate topological remap of RVV
-concepts and functionality, the only instructions needed are VSETVL
-and VGETVL.  *All* RVV instructions can be re-mapped, however xBitManip
-becomes a critical dependency for efficient manipulation of predication
-masks (as a bit-field).  Despite the removal of all but VSETVL and VGETVL,
-*all instructions from RVV are topologically re-mapped and retain their
-complete functionality, intact*.
-
-Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
-equivalents, so are left out of Simple-V.  VSELECT could be included if
-there existed a MV.X instruction in RV (MV.X is a hypothetical
-non-immediate variant of MV that would allow another register to
-specify which register was to be copied).  Note that if any of these three
-instructions are added to any given RV extension, their functionality
-will be inherently parallelised.
-
-## Instruction Format
-
-The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or *any*
-memory instructions.
-Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations, MV,
-FCVT, and LOAD/STORE
-depending on CSR configurations for bitwidth and
-predication.  **Everything** becomes parallelised.  *This includes
-Compressed instructions* as well as any
-future instructions and Custom Extensions.
-
-* For analysis of RVV see [[v_comparative_analysis]] which begins to
-  outline topologically-equivalent mappings of instructions
-* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
-  for format of Branch opcodes.
-
-**TODO**: *analyse and decide whether the implicit nature of predication
-as proposed is or is not a lot of hassle, and if explicit prefixes are
-a better idea instead.  Parallelism therefore effectively may end up
-as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
-with some opportunities for to use Compressed bringing it down to 48.
-Also to consider is whether one or both of the last two remaining Compressed
-instruction codes in Quadrant 1 could be used as a parallelism prefix,
-bringing parallelised opcodes down to 32-bit (when combined with C)
-and having the benefit of being explicit.*
-
-## VSETVL
-
-NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
-with the instruction format remaining the same.
-
-VSETVL is slightly different from RVV in that the minimum vector length
-is required to be at least the number of registers in the register file,
-and no more than XLEN.  This allows vector LOAD/STORE to be used to switch
-the entire bank of registers using a single instruction (see Appendix,
-"Context Switch Example").  The reason for limiting VSETVL to XLEN is
-down to the fact that predication bits fit into a single register of length
-XLEN bits.
-
-The second change is that when VSETVL is requested to be stored
-into x0, it is *ignored* silently (VSETVL x0, x5, #4)
-
-The third change is that there is an additional immediate added to VSETVL,
-to which VL is set after first going through MIN-filtering.
-So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
-
-    VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-where RegfileLen <= MAXVECTORDEPTH < XLEN
-
-This has implication for the microarchitecture, as VL is required to be
-set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
-requested in the #immediate parameter.  RVV has the option to set VL
-to an arbitrary value that suits the conditions and the micro-architecture:
-SV does *not* permit that.
-
-The reason is so that if SV is to be used for a context-switch or as a
-substitute for LOAD/STORE-Multiple, the operation can be done with only
-2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
-single LD/ST operation).  If VL does *not* get set to the register file
-length when VSETVL is called, then a software-loop would be needed.
-To avoid this need, VL *must* be set to exactly what is requested
-(limits notwithstanding).
-
-Therefore, in turn, unlike RVV, implementors *must* provide
-pseudo-parallelism (using sequential loops in hardware) if actual
-hardware-parallelism in the ALUs is not deployed.  A hybrid is also
-permitted (as used in Broadcom's VideoCore-IV) however this must be
-*entirely* transparent to the ISA.
-
-## Branch Instruction:
-
-Branch operations use standard RV opcodes that are reinterpreted to
-be "predicate variants" in the instance where either of the two src
-registers are marked as vectors (isvector=1).  When this reinterpretation
-is enabled the "immediate" field of the branch operation is taken to be a
-predication target register, rs3.  The predicate target register rs3 is
-to be treated as a bitfield (up to a maximum of XLEN bits corresponding
-to a maximum of XLEN elements).
-
-If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
-goes ahead as vector-scalar or scalar-vector.  Implementors should note that
-this could require considerable multi-porting of the register file in order
-to parallelise properly, so may have to involve the use of register cacheing
-and transparent copying (see Multiple-Banked Register File Architectures
-paper).
-
-In instances where no vectorisation is detected on either src registers
-the operation is treated as an absolutely standard scalar branch operation.
-
-This is the overloaded table for Integer-base Branch operations.  Opcode
-(bits 6..0) is set in all cases to 1100011.
-
-[[!table  data="""
-31    .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
-imm[12,10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
-7           | 5        | 5     | 3      | 4             | 1  | 7       |
-reserved    | src2     | src1  | BPR    | predicate rs3     || BRANCH  |
-reserved    | src2     | src1  | 000    | predicate rs3     || BEQ     |
-reserved    | src2     | src1  | 001    | predicate rs3     || BNE     |
-reserved    | src2     | src1  | 010    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 011    | predicate rs3     || rsvd    |
-reserved    | src2     | src1  | 100    | predicate rs3     || BLE     |
-reserved    | src2     | src1  | 101    | predicate rs3     || BGE     |
-reserved    | src2     | src1  | 110    | predicate rs3     || BLTU    |
-reserved    | src2     | src1  | 111    | predicate rs3     || BGEU    |
-"""]]
-
-Note that just as with the standard (scalar, non-predicated) branch
-operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
-
-Below is the overloaded table for Floating-point Predication operations.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield (done
-implicitly).
-
-As with
-Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
-Likewise Single-precision, fmt bits 26..25) is still set to 00.
-Double-precision is still set to 01, whilst Quad-precision
-appears not to have a definition in V2.3-Draft (but should be unaffected).
-
-It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact.  To deal with this, SV's predication has
-had "invert" added to it.
-
-[[!table  data="""
-31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 .. 7  | 6 ... 0 |
-funct5  | fmt      | rs2      | rs1   | funct3 | rd       | opcode  |
-5       | 2        | 5        | 5     | 3      | 4        | 7       |
-10100   | 00/01/11 | src2     | src1  | 010    | pred rs3 | FEQ     |
-10100   | 00/01/11 | src2     | src1  | **011**| pred rs3 | rsvd    |
-10100   | 00/01/11 | src2     | src1  | 001    | pred rs3 | FLT     |
-10100   | 00/01/11 | src2     | src1  | 000    | pred rs3 | FLE     |
-"""]]
-
-Note (**TBD**): floating-point exceptions will need to be extended
-to cater for multiple exceptions (and statuses of the same).  The
-usual approach is to have an array of status codes and bit-fields,
-and one exception, rather than throw separate exceptions for each
-Vector element.
-
-In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
-for predicated compare operations of function "cmp":
-
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[p][i])
-         preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
-                           s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
-    if I/F == INT: # integer type cmp
-        preg = int_pred_reg[rd]
-        reg = int_regfile
-    else:
-        preg = fp_pred_reg[rd]
-        reg = fp_regfile
-
-    s1 = reg_is_vectorised(src1);
-    s2 = reg_is_vectorised(src2);
-    if (!s2 && !s1) goto branch;
-    for (int i = 0; i < VL; ++i)
-      if (cmp(s1 ? reg[src1+i]:reg[src1],
-              s2 ? reg[src2+i]:reg[src2])
-             preg[rs3] |= 1<<i;  # bitfield not vector
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
-  into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
-  Reordering") setting Vector-Length times (number of SIMD elements) bits
-  in Predicate Register rs3 as opposed to just Vector-Length bits.
-* Predicated Branches do not actually have an adjustment to the Program
-  Counter, so all of bits 25 through 30 in every case are not needed.
-* There are plenty of reserved opcodes for which bits 25 through 30 could
-  be put to good use if there is a suitable use-case.
-  FLT and FLE may be inverted to FGT and FGE if needed by swapping
-  src1 and src2 (likewise the integer counterparts).
-
-## Compressed Branch Instruction:
-
-Compressed Branch instructions are likewise re-interpreted as predicated
-2-register operations, with the result going into rs3.  All the bits of
-the immediate are re-interpreted for different purposes, to extend the
-number of comparator operations to beyond the original specification,
-but also to cater for floating-point comparisons as well as integer ones.
-
-[[!table  data="""
-15..13 | 12...10  | 9..7 | 6..5  | 4..2 | 1..0 | name |
-funct3 | imm      | rs10 | imm   |      | op   |      |
-3      | 3        | 3    | 2     |  3   | 2    |      |
-C.BPR  | pred rs3 | src1 | I/F B | src2 | C1   |      |
-110    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.EQ |
-111    | pred rs3 | src1 | I/F 0 | src2 | C1   | P.NE |
-110    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LT |
-111    | pred rs3 | src1 | I/F 1 | src2 | C1   | P.LE |
-"""]]
-
-Notes:
-
-* Bits 5 13 14 and 15 make up the comparator type
-* Bit 6 indicates whether to use integer or floating-point comparisons
-* In both floating-point and integer cases there are four predication
-  comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
-  src1 and src2).
-
-## LOAD / STORE Instructions <a name="load_store"></a>
-
-For full analysis of topological adaptation of RVV LOAD/STORE
-see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction,
-and likewise for STORE.
-
-Revised LOAD:
-
-[[!table  data="""
-31 | 30 | 29 25 | 24    20 | 19 15 | 14   12 | 11 7 | 6    0 |
-imm[11:0]               |||| rs1   | funct3  | rd   | opcode |
-1  | 1  |  5    | 5        | 5     | 3       | 5    | 7      |
-?  | s  |  rs2  | imm[4:0] | base  | width   | dest | LOAD   |
-"""]]
-
-The exact same corresponding adaptation is also carried out on the single,
-double and quad precision floating-point LOAD-FP and STORE-FP operations,
-which fit the exact same instruction format.  Thus all three types
-(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
-as well as FSW, FSD and FSQ.
-
-Notes:
-
-* LOAD remains functionally (topologically) identical to RVV LOAD
-  (for both integer and floating-point variants).
-* Predication CSR-marking register is not explicitly shown in instruction, it's
-  implicit based on the CSR predicate state for the rd (destination) register
-* rs2, the source, may *also be marked as a vector*, which implicitly
-  is taken to indicate "Indexed Load" (LD.X)
-* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
-* Bit 31 is reserved (ideas under consideration: auto-increment)
-* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
-* **TODO**: clarify where width maps to elsize
-
-Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
-
-    if (unit-strided) stride = elsize;
-    else stride = areg[as2]; // constant-strided
-
-    preg = int_pred_reg[rd]
-
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[rd] & 1<<i)
-        for (int j=0; j<seglen+1; j++)
-        {
-          if CSRvectorised[rs2])
-             offs = vreg[rs2+i]
-          else
-             offs = i*(seglen+1)*stride;
-          vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
-        }
-
-Taking CSR (SIMD) bitwidth into account involves using the vector
-length and register encoding according to the "Bitwidth Virtual Register
-Reordering" scheme shown in the Appendix (see function "regoffs").
-
-A similar instruction exists for STORE, with identical topological
-translation of all features.  **TODO**
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are of the same format, where bits 2-4 are
-a src register instead of dest:
-
-[[!table  data="""
-15  13 | 12       10 | 9    7 | 6         5 | 4  2 | 1  0 |
-funct3 | imm         | rs10   | imm         | rd0  | op   |
-3      | 3           | 3      | 2           | 3    | 2    |
-C.LW   | offset[5:3] | base   | offset[2|6] | dest | C0   |
-"""]]
-
-Unfortunately it is not possible to fit the full functionality
-of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
-require another operand (rs2) in addition to the operand width
-(which is also missing), offset, base, and src/dest.
-
-However a close approximation may be achieved by taking the top bit
-of the offset in each of the five types of LD (and ST), reducing the
-offset to 4 bits and utilising the 5th bit to indicate whether "stride"
-is to be enabled.  In this way it is at least possible to introduce
-that functionality.
-
-(**TODO**: *assess whether the loss of one bit from offset is worth having
-"stride" capability.*)
-
-We also assume (including for the "stride" variant) that the "width"
-parameter, which is missing, is derived and implicit, just as it is
-with the standard Compressed LOAD/STORE instructions.  For C.LW, C.LD
-and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
-C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
-
-Interestingly we note that the Vectorised Simple-V variant of
-LOAD/STORE (Compressed and otherwise), due to it effectively using the
-standard register file(s), is the direct functional equivalent of
-standard load-multiple and store-multiple instructions found in other
-processors.
-
-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not".  The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults and adjusting the TLB
-in advance, accordingly: other strategies are explored in the Appendix
-Section "Virtual Memory Page Faults".
-
-## Vectorised Copy/Move (and conversion) instructions
-
-There is a series of 2-operand instructions involving copying (and
-alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ.  These operations all
-follow the same pattern, as it is *both* the source *and* destination
-predication masks that are taken into account.  This is different from
-the three-operand arithmetic instructions, where the predication mask
-is taken from the *destination* register, and applied uniformly to the
-elements of the source register(s), element-for-element.
-
-### C.MV Instruction <a name="c_mv"></a>
-
-There is no MV instruction in RV however there is a C.MV instruction.
-It is used for copying integer-to-integer registers (vectorised FMV
-is used for copying floating-point).
-
-If either the source or the destination register are marked as vectors
-C.MV is reinterpreted to be a vectorised (multi-register) predicated
-move operation.  The actual instruction's format does not change:
-
-[[!table  data="""
-15  12 | 11   7 | 6  2 | 1  0 |
-funct4 | rd     | rs   | op   |
-4      | 5      | 5    | 2    |
-C.MV   | dest   | src  | C0   |
-"""]]
-
-A simplified version of the pseudocode for this operation is as follows:
-
-    function op_mv(rd, rs) # MV not VMV!
-      rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
-      rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
-      ps = get_pred_val(FALSE, rs); # predication on src
-      pd = get_pred_val(FALSE, rd); # ... AND on dest
-      for (int i = 0, int j = 0; i < VL && j < VL;):
-        if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
-        if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
-        ireg[rd+j] <= ireg[rs+i];
-        if (int_vec[rs].isvec) i++;
-        if (int_vec[rd].isvec) j++;
-
-Note that:
-
-* elwidth (SIMD) is not covered above
-* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
-  not covered
-
-There are several different instructions from RVV that are covered by
-this one opcode:
-
-[[!table  data="""
-src    | dest    | predication   | op             |
-scalar | vector  | none          | VSPLAT         |
-scalar | vector  | destination   | sparse VSPLAT  |
-scalar | vector  | 1-bit dest    | VINSERT        |
-vector | scalar  | 1-bit? src    | VEXTRACT       |
-vector | vector  | none          | VCOPY          |
-vector | vector  | src           | Vector Gather  |
-vector | vector  | dest          | Vector Scatter |
-vector | vector  | src & dest    | Gather/Scatter |
-vector | vector  | src == dest   | sparse VCOPY   |
-"""]]
-
-Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
-operations with inversion on the src and dest predication for one of the
-two C.MV operations.
-
-Note that in the instance where the Compressed Extension is not implemented,
-MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
-Note that the behaviour is **different** from C.MV because with addi the
-predication mask to use is taken **only** from rd and is applied against
-all elements: rs[i] = rd[i].
-
-### FMV, FNEG and FABS Instructions
-
-These are identical in form to C.MV, except covering floating-point
-register copying.  The same double-predication rules also apply.
-However when elwidth is not set to default the instruction is implicitly
-and automatic converted to a (vectorised) floating-point type conversion
-operation of the appropriate size covering the source and destination
-register bitwidths.
-
-(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
-
-### FVCT Instructions
-
-These are again identical in form to C.MV, except that they cover
-floating-point to integer and integer to floating-point.  When element
-width in each vector is set to default, the instructions behave exactly
-as they are defined for standard RV (scalar) operations, except vectorised
-in exactly the same fashion as outlined in C.MV.
-
-However when the source or destination element width is not set to default,
-the opcode's explicit element widths are *over-ridden* to new definitions,
-and the opcode's element width is taken as indicative of the SIMD width
-(if applicable i.e. if packed SIMD is requested) instead.
-
-For example FCVT.S.L would normally be used to convert a 64-bit
-integer in register rs1 to a 64-bit floating-point number in rd.
-If however the source rs1 is set to be a vector, where elwidth is set to
-default/2 and "packed SIMD" is enabled, then the first 32 bits of
-rs1 are converted to a floating-point number to be stored in rd's
-first element and the higher 32-bits *also* converted to floating-point
-and stored in the second.  The 32 bit size comes from the fact that
-FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
-divide that by two it means that rs1 element width is to be taken as 32.
-
-Similar rules apply to the destination register.
-
-# Exceptions
-
-> What does an ADD of two different-sized vectors do in simple-V?
-
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
-  than the destination, throw an exception.
-
-> And what about instructions like JALR? 
-> What does jumping to a vector do?
-
-* Throw an exception.  Whether that actually results in spawning threads
-  as part of the trap-handling remains to be seen.
-
-# Under consideration <a name="issues"></a>
-
-From the Chennai 2018 slides the following issues were raised.
-Efforts to analyse and answer these questions are below.
-
-* Should future extra bank be included now?
-* How many Register and Predication CSRs should there be?
-         (and how many in RV32E)
-* How many in M-Mode (for doing context-switch)?
-* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-* Can CLIP be done as a CSR (mode, like elwidth)
-* SIMD saturation (etc.) also set as a mode?
-* Include src1/src2 predication on Comparison Ops?
-  (same arrangement as C.MV, with same flexibility/power)
-* 8/16-bit ops is it worthwhile adding a "start offset"?
-  (a bit like misaligned addressing... for registers)
-  or just use predication to skip start?
-
-## Future (extra) bank be included (made mandatory)
-
-The implications of expanding the *standard* register file from
-32 entries per bank to 64 per bank is quite an extensive architectural
-change.  Also it has implications for context-switching.
-
-Therefore, on balance, it is not recommended and certainly should
-not be made a *mandatory* requirement for the use of SV.  SV's design
-ethos is to be minimally-disruptive for implementors to shoe-horn
-into an existing design.
-
-## How large should the Register and Predication CSR key-value stores be?
-
-This is something that definitely needs actual evaluation and for
-code to be run and the results analysed.  At the time of writing
-(12jul2018) that is too early to tell.  An approximate best-guess
-however would be 16 entries.
-
-RV32E however is a special case, given that it is highly unlikely
-(but not outside the realm of possibility) that it would be used
-for performance reasons but instead for reducing instruction count.
-The number of CSR entries therefore has to be considered extremely
-carefully.
-
-## How many CSR entries in M-Mode or S-Mode (for context-switching)?
-
-The minimum required CSR entries would be 1 for each register-bank:
-one for integer and one for floating-point.  However, as shown
-in the "Context Switch Example" section, for optimal efficiency
-(minimal instructions in a low-latency situation) the CSRs for
-the context-switch should be set up *and left alone*.
-
-This means that it is not really a good idea to touch the CSRs
-used for context-switching in the M-Mode (or S-Mode) trap, so
-if there is ever demonstrated a need for vectors then there would
-need to be *at least* one more free.  However just one does not make
-much sense (as it one only covers scalar-vector ops) so it is more
-likely that at least two extra would be needed.
-
-This *in addition* - in the RV32E case - if an RV32E implementation
-happens also to support U/S/M modes.  This would be considered quite
-rare but not outside of the realm of possibility.
-
-Conclusion: all needs careful analysis and future work.
-
-## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-
-On balance it's a neat idea however it does seem to be one where the
-benefits are not really clear.  It would however obviate the need for
-an exception to be raised if the VL runs out of registers to put
-things in (gets to x31, tries a non-existent x32 and fails), however
-the "fly in the ointment" is that x0 is hard-coded to "zero".  The
-increment therefore would need to be double-stepped to skip over x0.
-Some microarchitectures could run into difficulties (SIMD-like ones
-in particular) so it needs a lot more thought.
-
-## Can CLIP be done as a CSR (mode, like elwidth)
-
-RVV appears to be going this way.  At the time of writing (12jun2018)
-it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
-clip by way of exactly this method: setting a "clip mode" in a CSR.
-
-No details are given however the most sensible thing to have would be
-to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
-extra bits specifying the type of clipping to be carried out, on
-a per-register basis.  Other bits may be used for other purposes
-(see SIMD saturation below)
-
-## SIMD saturation (etc.) also set as a mode?
+[[!tag oldstandards]]
  
  
-Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
-may also need extra details (what the saturation maximum is for example).
-
-## Include src1/src2 predication on Comparison Ops?
-
-In the C.MV (and other ops - see "C.MV Instruction"), the decision
-was taken, unlike in ADD (etc.) which are 3-operand ops, to use
-*both* the src *and* dest predication masks to give an extremely
-powerful and flexible instruction that covers a huge number of
-"traditional" vector opcodes.
-
-The natural question therefore to ask is: where else could this
-flexibility be deployed?  What about comparison operations?
-
-Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
-predicated comparison operations are actually a *three* operand
-instruction:
-
-    regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
-
-Therefore at first glance it does not make sense to use src1 and src2
-predication masks, as it breaks the rule of 3-operand instructions
-to use the *destination* predication register.
-
-In this case however, the destination *is* a predication register
-as opposed to being a predication mask that is applied *to* the
-(vectorised) operation, element-at-a-time on src1 and src2.
-
-Thus the question is directly inter-related to whether the modification
-of the predication mask should *itself* be predicated.
-
-It is quite complex, in other words, and needs careful consideration.
-
-## 8/16-bit ops is it worthwhile adding a "start offset"?
-
-The idea here is to make it possible, particularly in a "Packed SIMD"
-case, to be able to avoid doing unaligned Load/Store operations
-by specifying that operations, instead of being carried out
-element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
-element Packed SIMD cases.
-
-For example rather than take 2 32-bit registers divided into 4 8-bit
-elements and have them ADDed element-for-element as follows:
-
-    r3[0] = add r4[0], r6[0]
-    r3[1] = add r4[1], r6[1]
-    r3[2] = add r4[2], r6[2]
-    r3[3] = add r4[3], r6[3]
-
-an offset of 1 would result in four operations as follows, instead:
-
-    r3[0] = add r4[1], r6[0]
-    r3[1] = add r4[2], r6[1]
-    r3[2] = add r4[3], r6[2]
-    r3[3] = add r5[0], r6[3]
-
-In non-packed-SIMD mode there is no benefit at all, as a vector may
-be created using a different CSR that has the offset built-in.  So this
-leaves just the packed-SIMD case to consider.
-
-Two ways in which this could be implemented / emulated (without special
-hardware):
-
-* bit-manipulation that shuffles the data along by one byte (or one word)
-  either prior to or as part of the operation requiring the offset.
-* just use an unaligned Load/Store sequence, even if there are performance
-  penalties for doing so.
-
-The question then is whether the performance hit is worth the extra hardware
-involving byte-shuffling/shifting the data by an arbitrary offset.  On
-balance given that there are two reasonable instruction-based options, the
-hardware-offset option should be left out for the initial version of SV,
-with the option to consider it in an "advanced" version of the specification.
-
-# Impementing V on top of Simple-V
-
-With Simple-V converting the original RVV draft concept-for-concept
-from explicit opcodes to implicit overloading of existing RV Standard
-Extensions, certain features were (deliberately) excluded that need
-to be added back in for RVV to reach its full potential.  This is
-made slightly complicated by the fact that RVV itself has two
-levels: Base and reserved future functionality.
-
-* Representation Encoding is entirely left out of Simple-V in favour of
-  implicitly taking the exact (explicit) meaning from RV Standard Extensions.
-* VCLIP and VCLIPI do not have corresponding RV Standard Extension
-  opcodes (and are the only such operations).
-* Extended Element bitwidths (1 through to 24576 bits) were left out
-  of Simple-V as, again, there is no corresponding RV Standard Extension
-  that covers anything even below 32-bit operands.
-* Polymorphism was entirely left out of Simple-V due to the inherent
-  complexity of automatic type-conversion.
-* Vector Register files were specifically left out of Simple-V in favour
-  of fitting on top of the integer and floating-point files.  An
-  "RVV re-retro-fit" needs to be able to mark (implicitly marked)
-  registers as being actually in a separate *vector* register file.
-* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
-  register file size is 5 bits (32 registers), whilst the "Extended"
-  variant of RVV specifies 8 bits (256 registers) and has yet to
-  be published.
-* One big difference: Sections 17.12 and 17.17, there are only two possible
-  predication registers in RVV "Base".  Through the "indirect" method,
-  Simple-V provides a key-value CSR table that allows (arbitrarily)
-  up to 16 (TBD) of either the floating-point or integer registers to
-  be marked as "predicated" (key), and if so, which integer register to
-  use as the predication mask (value).
-
-**TODO**
-
-# Implementing P (renamed to DSP) on top of Simple-V
-
-* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
-  (caveat: anything not specified drops through to software-emulation / traps)
-* TODO
-
-# Appendix
-
-## V-Extension to Simple-V Comparative Analysis
-
-This section has been moved to its own page [[v_comparative_analysis]]
-
-## P-Ext ISA
-
-This section has been moved to its own page [[p_comparative_analysis]]
-
-## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
-
-This section compares the various parallelism proposals as they stand,
-including traditional SIMD, in terms of features, ease of implementation,
-complexity, flexibility, and die area.
-
-### [[harmonised_rvv_rvp]]
-
-This is an interesting proposal under development to retro-fit the AndesStar
-P-Ext into V-Ext.
-
-### [[alt_rvp]]
-
-Primary benefit of Alt-RVP is the simplicity with which parallelism
-may be introduced (effective multiplication of regfiles and associated ALUs).
-
-* plus: the simplicity of the lanes (combined with the regularity of
-  allocating identical opcodes multiple independent registers) meaning
-  that SRAM or 2R1W can be used for entire regfile (potentially).
-* minus: a more complex instruction set where the parallelism is much
-  more explicitly directly specified in the instruction and
-* minus: if you *don't* have an explicit instruction (opcode) and you
-  need one, the only place it can be added is... in the vector unit and
-* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
-  not useable or accessible in other Extensions.
-* plus-and-minus: Lanes may be utilised for high-speed context-switching
-  but with the down-side that they're an all-or-nothing part of the Extension.
-  No Alt-RVP: no fast register-bank switching.
-* plus: Lane-switching would mean that complex operations not suited to
-  parallelisation can be carried out, followed by further parallel Lane-based
-  work, without moving register contents down to memory (and back)
-* minus: Access to registers across multiple lanes is challenging. "Solution"
-  is to drop data into memory and immediately back in again (like MMX).
-
-### Simple-V
-
-Primary benefit of Simple-V is the OO abstraction of parallel principles
-from actual (internal) parallel hardware.  It's an API in effect that's
-designed to be slotted in to an existing implementation (just after
-instruction decode) with minimum disruption and effort.
-
-* minus: the complexity (if full parallelism is to be exploited)
-  of having to use register renames, OoO, VLIW, register file cacheing,
-  all of which has been done before but is a pain
-* plus: transparent re-use of existing opcodes as-is just indirectly
-  saying "this register's now a vector" which
-* plus: means that future instructions also get to be inherently
-  parallelised because there's no "separate vector opcodes"
-* plus: Compressed instructions may also be (indirectly) parallelised
-* minus: the indirect nature of Simple-V means that setup (setting
-  a CSR register to indicate vector length, a separate one to indicate
-  that it is a predicate register and so on) means a little more setup
-  time than Alt-RVP or RVV's "direct and within the (longer) instruction"
-  approach.
-* plus: shared register file meaning that, like Alt-RVP, complex
-  operations not suited to parallelisation may be carried out interleaved
-  between parallelised instructions *without* requiring data to be dropped
-  down to memory and back (into a separate vectorised register engine).
-* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
-  files means that huge parallel workloads would use up considerable
-  chunks of the register file.  However in the case of RV64 and 32-bit
-  operations, that effectively means 64 slots are available for parallel
-  operations.
-* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
-  be added, yet the instruction opcodes remain unchanged (and still appear
-  to be parallel).  consistent "API" regardless of actual internal parallelism:
-  even an in-order single-issue implementation with a single ALU would still
-  appear to have parallel vectoristion.
-* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
-  hard to say if there would be pluses or minuses (on die area).  At worse it
-  would be "no worse" than existing register renaming, OoO, VLIW and register
-  file cacheing schemes.
-
-### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
-
-RVV is extremely well-designed and has some amazing features, including
-2D reorganisation of memory through LOAD/STORE "strides".
-
-* plus: regular predictable workload means that implementations may
-  streamline effects on L1/L2 Cache.
-* plus: regular and clear parallel workload also means that lanes
-  (similar to Alt-RVP) may be used as an implementation detail,
-  using either SRAM or 2R1W registers.
-* plus: separate engine with no impact on the rest of an implementation
-* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
-  really feasible.
-* minus: no ISA abstraction or re-use either: additions to other Extensions
-  do not gain parallelism, resulting in prolific duplication of functionality
-  inside RVV *and out*.
-* minus: when operations require a different approach (scalar operations
-  using the standard integer or FP regfile) an entire vector must be
-  transferred out to memory, into standard regfiles, then back to memory,
-  then back to the vector unit, this to occur potentially multiple times.
-* minus: will never fit into Compressed instruction space (as-is.  May
-  be able to do so if "indirect" features of Simple-V are partially adopted).
-* plus-and-slight-minus: extended variants may address up to 256
-  vectorised registers (requires 48/64-bit opcodes to do it).
-* minus-and-partial-plus: separate engine plus complexity increases
-  implementation time and die area, meaning that adoption is likely only
-  to be in high-performance specialist supercomputing (where it will
-  be absolutely superb).
-
-### Traditional SIMD
-
-The only really good things about SIMD are how easy it is to implement and
-get good performance.  Unfortunately that makes it quite seductive...
-
-* plus: really straightforward, ALU basically does several packed operations
-  at once.  Parallelism is inherent at the ALU, making the addition of
-  SIMD-style parallelism an easy decision that has zero significant impact
-  on the rest of any given architectural design and layout.
-* plus (continuation): SIMD in simple in-order single-issue designs can
-  therefore result in superb throughput, easily achieved even with a very
-  simple execution model.
-* minus: ridiculously complex setup and corner-cases that disproportionately
-  increase instruction count on what would otherwise be a "simple loop",
-  should the number of elements in an array not happen to exactly match
-  the SIMD group width.
-* minus: getting data usefully out of registers (if separate regfiles
-  are used) means outputting to memory and back.
-* minus: quite a lot of supplementary instructions for bit-level manipulation
-  are needed in order to efficiently extract (or prepare) SIMD operands.
-* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
-  dimension and parallelism (width): an at least O(N^2) and quite probably
-  O(N^3) ISA proliferation that often results in several thousand
-  separate instructions.  all requiring separate and distinct corner-case
-  algorithms!
-* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
-  8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
-  For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
-  four separate and distinct instructions: one for (r1:low r2:high),
-  one for (r1:high r2:low), one for (r1:high r2:high) and one for
-  (r1:low r2:low) *per function*.
-* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
-  between operand and result bit-widths.  In combination with high/low
-  proliferation the situation is made even worse.
-* minor-saving-grace: some implementations *may* have predication masks
-  that allow control over individual elements within the SIMD block.
-
-## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
-
-This section compares the various parallelism proposals as they stand,
-*against* traditional SIMD as opposed to *alongside* SIMD.  In other words,
-the question is asked "How can each of the proposals effectively implement
-(or replace) SIMD, and how effective would they be"?
-
-### [[alt_rvp]]
-
-* Alt-RVP would not actually replace SIMD but would augment it: just as with
-  a SIMD architecture where the ALU becomes responsible for the parallelism,
-  Alt-RVP ALUs would likewise be so responsible... with *additional*
-  (lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
-  at least one dimension are avoided (architectural upgrades introducing
-  128-bit then 256-bit then 512-bit variants of the exact same 64-bit
-  SIMD block)
-* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
-  of instructions as SIMD, albeit not quite as badly (due to Lanes).
-* In the same discussion for Alt-RVP, an additional proposal was made to
-  be able to subdivide the bits of each register lane (columns) down into
-  arbitrary bit-lengths (RGB 565 for example).
-* A recommendation was given instead to make the subdivisions down to 32-bit,
-  16-bit or even 8-bit, effectively dividing the registerfile into
-  Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further.  If inter-lane
-  "swapping" instructions were then introduced, some of the disadvantages
-  of SIMD could be mitigated.
-
-### RVV
-
-* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
-  parallelism.
-* However whilst SIMD is usually designed for single-issue in-order simple
-  DSPs with a focus on Multimedia (Audio, Video and Image processing),
-  RVV's primary focus appears to be on Supercomputing: optimisation of
-  mathematical operations that fit into the OpenCL space.
-* Adding functions (operations) that would normally fit (in parallel)
-  into a SIMD instruction requires an equivalent to be added to the
-  RVV Extension, if one does not exist.  Given the specialist nature of
-  some SIMD instructions (8-bit or 16-bit saturated or halving add),
-  this possibility seems extremely unlikely to occur, even if the
-  implementation overhead of RVV were acceptable (compared to
-  normal SIMD/DSP-style single-issue in-order simplicity).
-
-### Simple-V
-
-* Simple-V borrows hugely from RVV as it is intended to be easy to
-  topologically transplant every single instruction from RVV (as
-  designed) into Simple-V equivalents, with *zero loss of functionality
-   or capability*.
-* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
-  Extension which contained the basic primitives (non-parallelised
-  8, 16 or 32-bit SIMD operations) inherently *become* parallel,
-  automatically.
-* Additionally, standard operations (ADD, MUL) that would normally have
-  to have special SIMD-parallel opcodes added need no longer have *any*
-  of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
-  4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
-  *standard* RV opcodes (present and future) and automatically parallelises
-  them.
-* By inheriting the RVV feature of arbitrary vector-length, then just as
-  with RVV the corner-cases and ISA proliferation of SIMD is avoided.
-* Whilst not entirely finalised, registers are expected to be
-  capable of being subdivided down to an implementor-chosen bitwidth
-  in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
-  and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
-  choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
-  ALUs that perform twin 8-bit operations as they see fit, or anything
-  else including no subdivisions at all.
-* Even though implementors have that choice even to have full 64-bit
-  (with RV64) SIMD, they *must* provide predication that transparently
-  switches off appropriate units on the last loop, thus neatly fitting
-  underlying SIMD ALU implementations *into* the arbitrary vector-length
-  RVV paradigm, keeping the uniform consistent API that is a key strategic
-  feature of Simple-V.
-* With Simple-V fitting into the standard register files, certain classes
-  of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
-  can be done by applying *Parallelised* Bit-manipulation operations
-  followed by parallelised *straight* versions of element-to-element
-  arithmetic operations, even if the bit-manipulation operations require
-  changing the bitwidth of the "vectors" to do so.  Predication can
-  be utilised to skip high words (or low words) in source or destination.
-* In essence, the key downside of SIMD - massive duplication of
-  identical functions over time as an architecture evolves from 32-bit
-  wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
-  vector-style parallelism being dropped on top of 8-bit or 16-bit
-  operations, all the while keeping a consistent ISA-level "API" irrespective
-  of implementor design choices (or indeed actual implementations).
-
-### Example Instruction translation: <a name="example_translation"></a>
-
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FIFO:
-
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
-
-## Example of vector / vector, vector / scalar, scalar / scalar => vector add
-
-    register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
-    register CSRpredicate[XLEN][4]; # 2^4 is max vector length
-    register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
-    register x[32][XLEN];
-
-    function op_add(rd, rs1, rs2, predr)
-    {
-       /* note that this is ADD, not PADD */
-       int i, id, irs1, irs2;
-       # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
-       # also destination makes no sense as a scalar but what the hell...
-       for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
-          if (CSRpredicate[predr][i]) # i *think* this is right...
-             x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
-          # now increment the idxs
-          if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
-             id += 1;
-          if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
-             irs1 += 1;
-          if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
-             irs2 += 1;
-    }
-
-## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication.  However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.
-
-However what if all branch instructions, if referencing a vectorised
-register, were instead given *completely new analogous meanings* that
-resulted in a parallel bit-wise predication register being set?  This
-would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
-BLT and BGE.
-
-We might imagine that FEQ, FLT and FLT would also need to be converted,
-however these are effectively *already* in the precise form needed and
-do not need to be converted *at all*!  The difference is that FEQ, FLT
-and FLE *specifically* write a 1 to an integer register if the condition
-holds, and 0 if not.  All that needs to be done here is to say, "if
-the integer register is tagged with a bit that says it is a predication
-register, the **bit** in the integer register is set based on the
-current vector index" instead.
-
-There is, in the standard Conditional Branch instruction, more than
-adequate space to interpret it in a similar fashion:
-
-[[!table  data="""
-31      |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7       | 6....0 |
-imm[12] | imm[10:5]  |rs2   | rs1  | funct3 | imm[4:1] | imm[11] | opcode |
- 1      | 6          | 5    | 5    | 3      | 4        | 1       |   7    |
-   offset[12,10:5]  || src2 | src1 | BEQ    | offset[11,4:1]    || BRANCH |
-"""]]
-
-This would become:
-
-[[!table  data="""
-31      | 30 .. 25 |24 ... 20 | 19 15 | 14  12 | 11 ..  8 | 7       | 6 ... 0 |
-imm[12] | imm[10:5]| rs2      | rs1   | funct3 | imm[4:1] | imm[11] | opcode  |
-1       | 6        | 5        | 5     | 3      | 4             | 1  | 7       |
-reserved          || src2     | src1  | BEQ    | predicate rs3     || BRANCH  |
-"""]]
-
-Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
-with the interesting side-effect that there is space within what is presently
-the "immediate offset" field to reinterpret that to add in not only a bit
-field to distinguish between floating-point compare and integer compare,
-not only to add in a second source register, but also use some of the bits as
-a predication target as well.
-
-[[!table  data="""
-15..13 | 12 ....... 10 | 9...7 | 6 ......... 2     | 1 .. 0 |
-funct3 | imm           | rs10  | imm               | op     |
-3      | 3             | 3     | 5                 | 2      |
-C.BEQZ | offset[8,4:3] | src   | offset[7:6,2:1,5] | C1     |
-"""]]
-
-Now uses the CS format:
-
-[[!table  data="""
-15..13 | 12 .  10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
-funct3 | imm      | rs10   | imm    |      | op     |
-3      | 3        | 3      | 2      | 3    | 2      |
-C.BEQZ | pred rs3 | src1   | I/F B  | src2 | C1     |
-"""]]
-
-Bit 6 would be decoded as "operation refers to Integer or Float" including
-interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
-"C" Standard, version 2.0,
-whilst Bit 5 would allow the operation to be extended, in combination with
-funct3 = 110 or 111: a combination of four distinct (predicated) comparison
-operators.  In both floating-point and integer cases those could be
-EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
-
-## Register reordering <a name="register_reordering"></a>
-
-### Register File
-
-| Reg Num | Bits |
-| ------- | ---- |
-| r0 | (32..0) |
-| r1 | (32..0) |
-| r2 | (32..0) |
-| r3 | (32..0) |
-| r4 | (32..0) |
-| r5 | (32..0) |
-| r6 | (32..0) |
-| r7 | (32..0) |
-| .. | (32..0) |
-| r31| (32..0) |
-
-### Vectorised CSR
-
-May not be an actual CSR: may be generated from Vector Length CSR:
-single-bit is less burdensome on instruction decode phase.
-
-| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
-| - | - | - | - | - | - | - | - |
-| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
-
-### Vector Length CSR
-
-| Reg Num | (3..0) |
-| ------- | ---- |
-| r0 | 2 |
-| r1 | 0 |
-| r2 | 1 |
-| r3 | 1 |
-| r4 | 3 |
-| r5 | 0 |
-| r6 | 0 |
-| r7 | 1 |
-
-### Virtual Register Reordering
-
-This example assumes the above Vector Length CSR table
-
-| Reg Num | Bits (0) | Bits (1) | Bits (2) |
-| ------- | -------- | -------- | -------- |
-| r0 | (32..0) | (32..0) |
-| r2 | (32..0) |
-| r3 | (32..0) |
-| r4 | (32..0) | (32..0) | (32..0) |
-| r7 | (32..0) |
-
-### Bitwidth Virtual Register Reordering
-
-This example goes a little further and illustrates the effect that a
-bitwidth CSR has been set on a register.  Preconditions:
-
-* RV32 assumed
-* CSRintbitwidth[2] = 010 # integer r2 is 16-bit
-* CSRintvlength[2] = 3 # integer r2 is a vector of length 3
-* vsetl rs1, 5 # set the vector length to 5
-
-This is interpreted as follows:
-
-* Given that the context is RV32, ELEN=32.
-* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
-* Therefore the actual vector length is up to *six* elements
-* However vsetl sets a length 5 therefore the last "element" is skipped
-
-So when using an operation that uses r2 as a source (or destination)
-the operation is carried out as follows:
-
-* 16-bit operation on r2(15..0) - vector element index 0
-* 16-bit operation on r2(31..16) - vector element index 1
-* 16-bit operation on r3(15..0) - vector element index 2
-* 16-bit operation on r3(31..16) - vector element index 3
-* 16-bit operation on r4(15..0) - vector element index 4
-* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5
-
-Predication has been left out of the above example for simplicity, however
-predication is ANDed with the latter stages (vsetl not equal to maximum
-capacity).
-
-Note also that it is entirely an implementor's choice as to whether to have
-actual separate ALUs down to the minimum bitwidth, or whether to have something
-more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD
-operations carried out 32-bits at a time is perfectly acceptable, as is
-8-bit SIMD operations carried out 16-bits at a time requiring two ALUs).
-Regardless of the internal parallelism choice, *predication must
-still be respected*, making Simple-V in effect the "consistent public API".
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth | bytestable |
-| --- | -------- | ---------- |
-| 000 | default  | XLEN/8     |
-| 001 | 8        | 1          |
-| 010 | 16       | 2          |
-| 011 | 32       | 4          |
-| 100 | 64       | 8          |
-| 101 | 128      | 16         |
-| 110 | rsvd     | rsvd       |
-| 111 | rsvd     | rsvd       |
-
-Pseudocode for vector length taking CSR SIMD-bitwidth into account:
-
-    vew = CSRbitwidth[rs1]
-    if (vew == 0)
-        bytesperreg = (XLEN/8) # or FLEN as appropriate
-    else:
-        bytesperreg = bytestable[vew] # 1 2 4 8 16
-    simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
-    vlen = CSRvectorlen[rs1] * simdmult
-
-To index an element in a register rnum where the vector element index is i:
-
-    function regoffs(rnum, i):
-        regidx = floor(i / simdmult)  # integer-div rounded down
-        byteidx = i % simdmult        # integer-remainder
-        return rnum + regidx,         # actual real register
-               byteidx * 8,           # low
-               byteidx * 8 + (vew-1), # high
-
-### Insights
-
-SIMD register file splitting still to consider.  For RV64, benefits of doubling
-(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
-size of the floating point register file to 64 (128 in the case of HP)
-seem pretty clear and worth the complexity.
-
-64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
-done on 64-bit registers it's not so conceptually difficult.  May even
-be achieved by *actually* splitting the regfile into 64 virtual 32-bit
-registers such that a 64-bit FP scalar operation is dropped into (r0.H
-r0.L) tuples.  Implementation therefore hidden through register renaming.
-
-Implementations intending to introduce VLIW, OoO and parallelism
-(even without Simple-V) would then find that the instructions are
-generated quicker (or in a more compact fashion that is less heavy
-on caches).  Interestingly we observe then that Simple-V is about
-"consolidation of instruction generation", where actual parallelism
-of underlying hardware is an implementor-choice that could just as
-equally be applied *without* Simple-V even being implemented.
-
-## Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
-
-It could indeed have been logically deduced (or expected), that there
-would be additional decode latency in this proposal, because if
-overloading the opcodes to have different meanings, there is guaranteed
-to be some state, some-where, directly related to registers.
-
-There are several cases:
-
-* All operands vector-length=1 (scalars), all operands
-  packed-bitwidth="default": instructions are passed through direct as if
-  Simple-V did not exist.  Simple-V is, in effect, completely disabled.
-* At least one operand vector-length > 1, all operands
-  packed-bitwidth="default": any parallel vector ALUs placed on "alert",
-  virtual parallelism looping may be activated.
-* All operands vector-length=1 (scalars), at least one
-  operand packed-bitwidth != default: degenerate case of SIMD,
-  implementation-specific complexity here (packed decode before ALUs or
-  *IN* ALUs)
-* At least one operand vector-length > 1, at least one operand
-  packed-bitwidth != default: parallel vector ALUs (if any)
-  placed on "alert", virtual parallelsim looping may be activated,
-  implementation-specific SIMD complexity kicks in (packed decode before
-  ALUs or *IN* ALUs).
-
-Bear in mind that the proposal includes that the decision whether
-to parallelise in hardware or whether to virtual-parallelise (to
-dramatically simplify compilers and also not to run into the SIMD
-instruction proliferation nightmare) *or* a transprent combination
-of both, be done on a *per-operand basis*, so that implementors can
-specifically choose to create an application-optimised implementation
-that they believe (or know) will sell extremely well, without having
-"Extra Standards-Mandated Baggage" that would otherwise blow their area
-or power budget completely out the window.
-
-Additionally, two possible CSR schemes have been proposed, in order to
-greatly reduce CSR space:
-
-* per-register CSRs (vector-length and packed-bitwidth)
-* a smaller number of CSRs with the same information but with an *INDEX*
-  specifying WHICH register in one of three regfiles (vector, fp, int)
-  the length and bitwidth applies to.
-
-(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
-
-In addition, LOAD/STORE has its own associated proposed CSRs that
-mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
-V (and Hwacha).
-
-Also bear in mind that, for reasons of simplicity for implementors,
-I was coming round to the idea of permitting implementors to choose
-exactly which bitwidths they would like to support in hardware and which
-to allow to fall through to software-trap emulation.
-
-So the question boils down to:
-
-* whether either (or both) of those two CSR schemes have significant
-  latency that could even potentially require an extra pipeline decode stage
-* whether there are implementations that can be thought of which do *not*
-  introduce significant latency
-* whether it is possible to explicitly (through quite simply
-  disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
-  all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
-  the extreme of skipping an entire pipeline stage (if one is needed)
-* whether packed bitwidth and associated regfile splitting is so complex
-  that it should definitely, definitely be made mandatory that implementors
-  move regfile splitting into the ALU, and what are the implications of that
-* whether even if that *is* made mandatory, is software-trapped
-  "unsupported bitwidths" still desirable, on the basis that SIMD is such
-  a complete nightmare that *even* having a software implementation is
-  better, making Simple-V have more in common with a software API than
-  anything else.
-
-Whilst the above may seem to be severe minuses, there are some strong
-pluses:
-
-* Significant reduction of V's opcode space: over 95%.
-* Smaller reduction of P's opcode space: around 10%.
-* The potential to use Compressed instructions in both Vector and SIMD
-  due to the overloading of register meaning (implicit vectorisation,
-  implicit packing)
-* Not only present but also future extensions automatically gain parallelism.
-* Already mentioned but worth emphasising: the simplification to compiler
-  writers and assembly-level writers of having the same consistent ISA
-  regardless of whether the internal level of parallelism (number of
-  parallel ALUs) is only equal to one ("virtual" parallelism), or is
-  greater than one, should not be underestimated.
-
-## Reducing Register Bank porting
-
-This looks quite reasonable.
-<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
-
-The main details are outlined on page 4.  They propose a 2-level register
-cache hierarchy, note that registers are typically only read once, that
-you never write back from upper to lower cache level but always go in a
-cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
-a scheme where you look ahead by only 2 instructions to determine which
-registers to bring into the cache.
-
-The nice thing about a vector architecture is that you *know* that
-*even more* registers are going to be pulled in: Hwacha uses this fact
-to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
-by *introducing* deliberate latency into the execution phase.
-
-## Overflow registers in combination with predication
-
-**TODO**: propose overflow registers be actually one of the integer regs
-(flowing to multiple regs).
-
-**TODO**: propose "mask" (predication) registers likewise.  combination with
-standard RV instructions and overflow registers extremely powerful, see
-Aspex ASP.
-
-When integer overflow is stored in an easily-accessible bit (or another
-register), parallelisation turns this into a group of bits which can
-potentially be interacted with in predication, in interesting and powerful
-ways.  For example, by taking the integer-overflow result as a predication
-field and shifting it by one, a predicated vectorised "add one" can emulate
-"carry" on arbitrary (unlimited) length addition.
-
-However despite RVV having made room for floating-point exceptions, neither
-RVV nor base RV have taken integer-overflow (carry) into account, which
-makes proposing it quite challenging given that the relevant (Base) RV
-sections are frozen.  Consequently it makes sense to forgo this feature.
-
-## Context Switch Example <a name="context_switch"></a>
-
-An unusual side-effect of Simple-V mapping onto the standard register files
-is that LOAD-multiple and STORE-multiple are accidentally available, as long
-as it is acceptable that the register(s) to be loaded/stored are contiguous
-(per instruction).  An additional accidental benefit is that Compressed LD/ST
-may also be used.
-
-To illustrate how this works, here is some example code from FreeRTOS
-(GPLv2 licensed, portasm.S):
-
-    /* Macro for saving task context */
-    .macro portSAVE_CONTEXT
-        .global        pxCurrentTCB
-        /* make room in stack */
-        addi   sp, sp, -REGBYTES * 32
-
-        /* Save Context */
-        STORE  x1, 0x0(sp)
-        STORE  x2, 1 * REGBYTES(sp)
-        STORE  x3, 2 * REGBYTES(sp)
-        ...
-        ...
-        STORE  x30, 29 * REGBYTES(sp)
-        STORE  x31, 30 * REGBYTES(sp)
-
-        /* Store current stackpointer in task control block (TCB) */
-        LOAD   t0, pxCurrentTCB        //pointer
-        STORE  sp, 0x0(t0)
-        .endm
-
-    /* Saves current error program counter (EPC) as task program counter */
-    .macro portSAVE_EPC
-        csrr   t0, mepc
-        STORE  t0, 31 * REGBYTES(sp)
-        .endm
-
-    /* Saves current return adress (RA) as task program counter */
-    .macro portSAVE_RA
-        STORE  ra, 31 * REGBYTES(sp)
-        .endm
-
-    /* Macro for restoring task context */
-    .macro portRESTORE_CONTEXT
-
-        .global        pxCurrentTCB
-        /* Load stack pointer from the current TCB */
-        LOAD   sp, pxCurrentTCB
-        LOAD   sp, 0x0(sp)
-
-        /* Load task program counter */
-        LOAD   t0, 31 * REGBYTES(sp)
-        csrw   mepc, t0
-
-        /* Run in machine mode */
-        li             t0, MSTATUS_PRV1
-        csrs   mstatus, t0
-
-        /* Restore registers,
-           Skip global pointer because that does not change */
-        LOAD   x1, 0x0(sp)
-        LOAD   x4, 3 * REGBYTES(sp)
-        LOAD   x5, 4 * REGBYTES(sp)
-        ...
-        ...
-        LOAD   x30, 29 * REGBYTES(sp)
-        LOAD   x31, 30 * REGBYTES(sp)
-
-        addi   sp, sp, REGBYTES * 32
-        mret
-        .endm
-
-The important bits are the Load / Save context, which may be replaced
-with firstly setting up the Vectors and secondly using a *single* STORE
-(or LOAD) including using C.ST or C.LD, to indicate that the entire
-bank of registers is to be loaded/saved:
-
-    /* a few things are assumed here: (a) that when switching to
-       M-Mode an entirely different set of CSRs is used from that
-       which is used in U-Mode and (b) that the M-Mode x1 and x4
-       vectors are also not used anywhere else in M-Mode, consequently
-       only need to be set up just the once.
-     */
-    .macroVectorSetup
-        MVECTORCSRx1 = 31, defaultlen
-        MVECTORCSRx4 = 28, defaultlen
-
-        /* Save Context */
-        SETVL x0, x0, 31 /* x0 ignored silently */
-        STORE  x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
-
-        /* Restore registers,
-           Skip global pointer because that does not change */
-        LOAD   x1, 0x0(sp)
-        SETVL x0, x0, 28 /* x0 ignored silently */
-        LOAD   x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
-
-Note that although it may just be a bug in portasm.S, x2 and x3 appear not
-to be being restored.  If however this is a bug and they *do* need to be
-restored, then the SETVL call may be moved to *outside* the Save / Restore
-Context assembly code, into the macroVectorSetup, as long as vectors are
-never used anywhere else (i.e. VL is never altered by M-Mode).
-
-In effect the entire bank of repeated LOAD / STORE instructions is replaced
-by one single (compressed if it is available) instruction.
-
-## Virtual Memory page-faults on LOAD/STORE
-
-
-### Notes from conversations
-
-> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
-> riscv-isa-manual in order to work out how to re-map RVV onto the standard
-> ISA, and came across an interesting comments at the bottom of pages 75
-> and 76:
-
-> "   A common mechanism used in other ISAs to further reduce save/restore
-> code size is load- multiple and store-multiple instructions. "
-
-> Fascinatingly, due to Simple-V proposing to use the *standard* register
-> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly
-> that: load-multiple and store-multiple instructions.  Which brings us
-> on to this comment:
-
-> "For virtual memory systems, some data accesses could be resident in
-> physical memory and
->   some could not, which requires a new restart mechanism for partially
-> executed instructions."
-
-> Which then of course brings us to the interesting question: how does RVV
-> cope with the scenario when, particularly with LD.X (Indexed / indirect
-> loads), part-way through the loading a page fault occurs?
-
-> Has this been noted or discussed before?
-
-For applications-class platforms, the RVV exception model is
-element-precise (that is, if an exception occurs on element j of a
-vector instruction, elements 0..j-1 have completed execution and elements
-j+1..vl-1 have not executed).
-
-Certain classes of embedded platforms where exceptions are always fatal
-might choose to offer resumable/swappable interrupts but not precise
-exceptions.
-
-
-> Is RVV designed in any way to be re-entrant?
-
-Yes.
-
-
-> What would the implications be for instructions that were in a FIFO at
-> the time, in out-of-order and VLIW implementations, where partial decode
-> had taken place?
-
-The usual bag of tricks for maintaining precise exceptions applies to
-vector machines as well.  Register renaming makes the job easier, and
-it's relatively cheaper for vectors, since the control cost is amortized
-over longer registers.
-
-
-> Would it be reasonable at least to say *bypass* (and freeze) the
-> instruction FIFO (drop down to a single-issue execution model temporarily)
-> for the purposes of executing the instructions in the interrupt (whilst
-> setting up the VM page), then re-continue the instruction with all
-> state intact?
-
-This approach has been done successfully, but it's desirable to be
-able to swap out the vector unit state to support context switches on
-exceptions that result in long-latency I/O.
-
-
-> Or would it be better to switch to an entirely separate secondary
-> hyperthread context?
-
-> Does anyone have any ideas or know if there is any academic literature
-> on solutions to this problem?
-
-The Vector VAX offered imprecise but restartable and swappable exceptions:
-http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf
-
-Sec. 4.6 of Krste's dissertation assesses some of
-the tradeoffs and references a bunch of related work:
-http://people.eecs.berkeley.edu/~krste/thesis.pdf
-
-
-----
-
-Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P
-exceptions" and thought, "hmmm that could go into a CSR, must re-read
-the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly
-thought, "ah ha!  what if the memory exceptions were, instead of having
-an immediate exception thrown, were simply stored in a type of predication
-bit-field with a flag "error this element failed"?
-
-Then, *after* the vector load (or store, or even operation) was
-performed, you could *then* raise an exception, at which point it
-would be possible (yes in software... I know....) to go "hmmm, these
-indexed operations didn't work, let's get them into memory by triggering
-page-loads", then *re-run the entire instruction* but this time with a
-"memory-predication CSR" that stops the already-performed operations
-(whether they be loads, stores or an arithmetic / FP operation) from
-being carried out a second time.
-
-This theoretically could end up being done multiple times in an SMP
-environment, and also for LD.X there would be the remote outside annoying
-possibility that the indexed memory address could end up being modified.
-
-The advantage would be that the order of execution need not be
-sequential, which potentially could have some big advantages.
-Am still thinking through the implications as any dependent operations
-(particularly ones already decoded and moved into the execution FIFO)
-would still be there (and stalled).  hmmm.
-
-----
-
-    > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
-    > > VSETL r0, 8
-    > > FADD x1, x2, x3
-    >
-    > > x3[0]: ok
-    > > x3[1]: exception
-    > > x3[2]: ok
-    > > ...
-    > > ...
-    > > x3[7]: ok
-    >
-    > > what happens to result elements 2-7?  those may be *big* results
-    > > (RV128)
-    > > or in the RVV-Extended may be arbitrary bit-widths far greater.
-    >
-    >  (you replied:)
-    >
-    > Thrown away.
-
-discussion then led to the question of OoO architectures
-
-> The costs of the imprecise-exception model are greater than the benefit.
-> Software doesn't want to cope with it.  It's hard to debug.  You can't
-> migrate state between different microarchitectures--unless you force all
-> implementations to support the same imprecise-exception model, which would
-> greatly limit implementation flexibility.  (Less important, but still
-> relevant, is that the imprecise model increases the size of the context
-> structure, as the microarchitectural guts have to be spilled to memory.)
-
-## Zero/Non-zero Predication
-
->> >  it just occurred to me that there's another reason why the data
->> > should be left instead of zeroed.  if the standard register file is
->> > used, such that vectorised operations are translated to mean "please
->> > insert multiple register-contiguous operations into the instruction
->> > FIFO" and predication is used to *skip* some of those, then if the
->> > next "vector" operation uses the (standard) registers that were masked
->> > *out* of the previous operation it may proceed without blocking.
->> >
->> >  if however zeroing is made mandatory then that optimisation becomes
->> > flat-out impossible to deploy.
->> >
->> >  whilst i haven't fully thought through the full implications, i
->> > suspect RVV might also be able to benefit by being able to fit more
->> > overlapping operations into the available SRAM by doing something
->> > similar.
->
->
-> Luke, this is called density time masking. It doesn’t apply to only your
-> model with the “standard register file” is used. it applies to any
-> architecture that attempts to speed up by skipping computation and writeback
-> of masked elements.
->
-> That said, the writing of zeros need not be explicit. It is possible to add
-> a “zero bit” per element that, when set, forces a zero to be read from the
-> vector (although the underlying storage may have old data). In this case,
-> there may be a way to implement DTM as well.
-
-
-## Implementation detail for scalar-only op detection <a name="scalar_detection"></a>
-
-Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
-worthwhile.
-
-Note 2: this is just one possible implementation.  Another implementation
-may choose to treat *all* operations as vectorised (including treating
-scalars as vectors of length 1), choosing to add an extra pipeline stage
-dedicated to *all* instructions.
-
-This section *specifically* covers the implementor's freedom to choose
-that they wish to minimise disruption to an existing design by detecting
-"scalar-only operations", bypassing the vectorisation phase (which may
-or may not require an additional pipeline stage)
-
-[[scalardetect.png]]
-
->> For scalar ops an implementation may choose to compare 2-3 bits through an
->> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr
->> FIFO).
-
-> Those bits cannot be known until after the registers are decoded from the
-> instruction and a lookup in the "vector length table" has completed.
-> Considering that one of the reasons RISC-V keeps registers in invariant
-> positions across all instructions is to simplify register decoding, I expect
-> that inserting an SRAM read would lengthen the critical path in most
-> implementations.
-
-reply:
-
-> briefly: the trick i mentioned about ANDing bits together to check if
-> an op was fully-scalar or not was to be read out of a single 32-bit
-> 3R1W SRAM (64-bit if FPU exists).  the 32/64-bit SRAM contains 1 bit per
-> register indicating "is register vectorised yes no".  3R because you need
-> to check src1, src2 and dest simultaneously.  the entries are *generated*
-> from the CSRs and are an optimisation that on slower embedded systems
-> would likely not be needed.
-
->  is there anything unreasonable that anyone can foresee about that?
-> what are the down-sides?
-
-## C.MV predicated src, predicated dest
-
-> Can this be usefully defined in such a way that it is
-> equivalent to vector gather-scatter on each source, followed by a
-> non-predicated vector-compare, followed by vector gather-scatter on the
-> result?
-
-## element width conversion: restrict or remove?
-
-summary: don't restrict / remove.  it's fine.
-
-> > it has virtually no cost/overhead as long as you specify
-> > that inputs can only upconvert, and operations are always done at the
-> > largest size, and downconversion only happens at the output.
->
-> okaaay.  so that's a really good piece of implementation advice.
-> algorithms do require data size conversion, so at some point you need to
-> introduce the feature of upconverting and downconverting.
->
-> > for int and uint, this is dead simple and fits well within the RVV pipeline
-> > without any critical path, pipeline depth, or area implications.
-
-<https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/g3feFnAoKIM>
-
-## Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
-
-**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
-length on all regs**.  This section kept for historical reasons.
-
-So the issue is as follows:
-
-* CSRs are used to set the "span" of a vector (how many of the standard
-  register file to contiguously use)
-* VSETVL in RVV works as follows: it sets the vector length (copy of which
-  is placed in a dest register), and if the "required" length is longer
-  than the *available* length, the dest reg is set to the MIN of those
-  two.
-* **HOWEVER**... in SV, *EVERY* vector register has its own separate
-  length and thus there is no way (at the time that VSETVL is called) to
-  know what to set the vector length *to*.
-* At first glance it seems that it would be perfectly fine to just limit
-  the vector operation to the length specified in the destination
-  register's CSR, at the time that each instruction is issued...
-  except that that cannot possibly be guaranteed to match
-  with the value *already loaded into the target register from VSETVL*.
-
-Therefore a different approach is needed.
-
-Possible options include:
-
-* Removing the CSR "Vector Length" and always using the value from
-  VSETVL.  "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
-  destreg equal to MIN(counterreg, lenimmed), with register-based
-  variant "VSETVL destreg, counterreg, lenreg" doing the same.
-* Keeping the CSR "Vector Length" and having the lenreg version have
-  a "twist": "if lengreg is vectorised, read the length from the CSR"
-* Other (TBD)
-
-The first option (of the ones brainstormed so far) is a lot simpler.
-It does however mean that the length set in VSETVL will apply across-the-board
-to all src1, src2 and dest vectorised registers until it is otherwise changed
-(by another VSETVL call).  This is probably desirable behaviour.
-
-## Implementation Paradigms <a name="implementation_paradigms"></a>
-
-TODO: assess various implementation paradigms.  These are listed roughly
-in order of simplicity (minimum compliance, for ultra-light-weight
-embedded systems or to reduce design complexity and the burden of
-design implementation and compliance, in non-critical areas), right the
-way to high-performance systems.
-
-* Full (or partial) software-emulated (via traps): full support for CSRs
-  required, however when a register is used that is detected (in hardware)
-  to be vectorised, an exception is thrown.
-* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
-* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
-* Out-of-order with instruction FIFOs and aggressive register-renaming
-* VLIW
-
-Also to be taken into consideration:
-
-* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
-* Comphrensive vectorisation: FIFOs and internal parallelism
-* Hybrid Parallelism
-
-### Full or partial software-emulation
-
-The absolute, absolute minimal implementation is to provide the full
-set of CSRs and detection logic for when any of the source or destination
-registers are vectorised.  On detection, a trap is thrown, whether it's
-a branch, LOAD, STORE, or an arithmetic operation.
-
-Implementors are entirely free to choose whether to allow absolutely every
-single operation to be software-emulated, or whether to provide some emulation
-and some hardware support.  In particular, for an RV32E implementation
-where fast context-switching is a requirement (see "Context Switch Example"),
-it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
-exception, as every context-switch will result in double-traps.
-
-# TODO Research
-
-> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
+# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
  
  
-Idea: basic simple butterfly swap on a few element indices, primarily targetted
-at SIMD / DSP.  High-byte low-byte swapping, high-word low-word swapping,
-perhaps allow reindexing of permutations up to 4 elements?  8?  Reason:
-such operations are less costly than a full indexed-shuffle, which requires
-a separate instruction cycle.
+**OBSOLETE. This document is out of date and involved early ideas and discussions. [Go to the up-to-date document](https://libre-soc.org/openpower/sv/)**
  
  
-Predication "all zeros" needs to be "leave alone".  Detection of
-ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
-ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
-Destruction of destination indices requires a copy of the entire vector
-in advance to avoid.
+This list is auto-generated from a page tag "oldstandards":
  
  
-TBD: floating-point compare and other exception handling
+[[!inline pages="tagged(oldstandards)" actions="no" archive="yes" quick="yes"]]
  
  # References
  
  
  # References