fosdem2024_formal: add slides and diagrams

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 761fdc4c6e741f54eefa362d495d5d3b80806ab3..5567b3d8bbee22a321a7887dd866fa1466d58f04 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,1102 +1,12 @@
-# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
-
-[[!toc ]]
-
-This proposal exists so as to be able to satisfy several disparate
-requirements: power-conscious, area-conscious, and performance-conscious
-designs all pull an ISA and its implementation in different conflicting
-directions, as do the specific intended uses for any given implementation.
-
-Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
-whilst each extremely powerful in their own right and clearly desirable,
-are also:
-
-* Clearly independent in their origins (Cray and AndeStar v3 respectively)
-  so need work to adapt to the RISC-V ethos and paradigm
-* Are sufficiently large so as to make adoption (and exploration for
-  analysis and review purposes) prohibitively expensive
-* Both contain partial duplication of pre-existing RISC-V instructions
-  (an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
-  at the instruction level.
-* Both require that their respective parallelism paradigm be implemented
-  along-side and integral to their respective functionality *or not at all*.
-* Both independently have methods for introducing parallelism that
-  could, if separated, benefit
-  *other areas of RISC-V not just DSP or Floating-point respectively*.
-
-Therefore it makes a huge amount of sense to have a means and method
-of introducing instruction parallelism in a flexible way that provides
-implementors with the option to choose exactly where they wish to offer
-performance improvements and where they wish to optimise for power
-and/or area (and if that can be offered even on a per-operation basis that
-would provide even more flexibility).
-
-Additionally it makes sense to *split out* the parallelism inherent within
-each of P and V, and to see if each of P and V then, in *combination* with
-a "best-of-both" parallelism extension, could be added on *on top* of
-this proposal, to topologically provide the exact same functionality of
-each of P and V.
-
-Furthermore, an additional goal of this proposal is to reduce the number
-of opcodes utilised by each of P and V as they currently stand, leveraging
-existing RISC-V opcodes where possible, and also potentially allowing
-P and V to make use of Compressed Instructions as a result.
-
-**TODO**: reword this to better suit this document:
-
-Having looked at both P and V as they stand, they're _both_ very much
-"separate engines" that, despite both their respective merits and
-extremely powerful features, don't really cleanly fit into the RV design
-ethos (or the flexible extensibility) and, as such, are both in danger
-of not being widely adopted.  I'm inclined towards recommending:
-
-* splitting out the DSP aspects of P-SIMD to create a single-issue DSP
-* splitting out the polymorphism, esoteric data types (GF, complex
-  numbers) and unusual operations of V to create a single-issue "Esoteric
-  Floating-Point" extension
-* splitting out the loop-aspects, vector aspects and data-width aspects
-  of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they
-  apply across *all* Extensions, whether those be DSP, M, Base, V, P -
-  everything.
-
-**TODO**: propose overflow registers be actually one of the integer regs
-(flowing to multiple regs).
-
-**TODO**: propose "mask" (predication) registers likewise.  combination with
-standard RV instructions and overflow registers extremely powerful
-
-## CSRs marking registers as Vector
-
-A 32-bit CSR would be needed (1 bit per integer register) to indicate
-whether a register was, if referred to, implicitly to be treated as
-a vector.
-
-A second 32-bit CSR would be needed (1 bit per floating-point register)
-to indicate whether a floating-point register was to be treated as a
-vector.
-
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
-
-## CSR vector-length and CSR SIMD packed-bitwidth
-
-**TODO** analyse each of these:
-
-* splitting out the loop-aspects, vector aspects and data-width aspects
-* integer reg 0 *and* fp reg0 share CSR vlen 0 *and* CSR packed-bitwidth 0
-* integer reg 1 *and* fp reg1 share CSR vlen 1 *and* CSR packed-bitwidth 1
-* ....
-* .... 
-
-instead:
-
-* CSR vlen 0 *and* CSR packed-bitwidth 0 register contain extra bits
-  specifying an *INDEX* of WHICH int/fp register they refer to
-* CSR vlen 1 *and* CSR packed-bitwidth 1 register contain extra bits
-  specifying an *INDEX* of WHICH int/fp register they refer to
-* ...
-* ...
-
-Have to be very *very* careful about not implementing too few of those
-(or too many).  Assess implementation impact on decode latency.  Is it
-worth it?
-
-Implementation of the latter:
-
-Operation involving (referring to) register M:
-
-    bitwidth = default # default for opcode?
-    vectorlen = 1 # scalar
-    
-    for (o = 0, o < 2, o++)
-      if (CSR-Vector_registernum[o] == M)
-          bitwidth = CSR-Vector_bitwidth[o]
-          vectorlen = CSR-Vector_len[o]
-          break
-
-and for the former it would simply be:
-
-    bitwidth = CSR-Vector_bitwidth[M]
-    vectorlen = CSR-Vector_len[M]
-
-Alternatives:
-
-* One single "global" vector-length CSR
-
-## Stride
-
-**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular
-register as being "if you use this reg in LOAD/STORE, use the offset
-amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous".
-can be used for matrix spanning.
-
-> For LOAD/STORE, could a better option be to interpret the offset in the 
-> opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is 
-> configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12), 
-> t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the 
-> vector-control CSRs to select between offset-as-stride and unit-stride 
-> memory accesses? 
-
-So there would be an instruction like this:
-
-| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM |
-| opcode | 5 bit | 1 bit             | 1 bit             | 5 bit, OFFn=XLEN |
-
-
-which would mean:
-
-* CSR-Offset register n <= (float|int) register number N
-* CSR-Offset Stride-mode = offset or unit
-* CSR-Offset amount register n = contents of register M
-
-LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set):
-
-    offs = 0
-    stride = 1
-    vector-len = CSR-Vector-length register N
-  
-    for (o = 0, o < 2, o++)
-      if (CSR-Offset register o == M)
-          offs = CSR-Offset amount register o
-          if CSR-Offset Stride-mode == offset:
-              stride = ldoffs
-          break
-   
-    for (i = 0, i < vector-len; i++)
-      r[N+i] = mem[(offs*i + r[M+i])*stride]
-
-# Analysis and discussion of Vector vs SIMD
-
-There are four combined areas between the two proposals that help with
-parallelism without over-burdening the ISA with a huge proliferation of
-instructions:
-
-* Fixed vs variable parallelism (fixed or variable "M" in SIMD)
-* Implicit vs fixed instruction bit-width (integral to instruction or not)
-* Implicit vs explicit type-conversion (compounded on bit-width)
-* Implicit vs explicit inner loops.
-* Masks / tagging (selecting/preventing certain indexed elements from execution)
-
-The pros and cons of each are discussed and analysed below.
-
-## Fixed vs variable parallelism length
-
-In David Patterson and Andrew Waterman's analysis of SIMD and Vector
-ISAs, the analysis comes out clearly in favour of (effectively) variable
-length SIMD.  As SIMD is a fixed width, typically 4, 8 or in extreme cases
-16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD
-are extremely burdensome except for applications whose requirements
-*specifically* match the *precise and exact* depth of the SIMD engine.
-
-Thus, SIMD, no matter what width is chosen, is never going to be acceptable
-for general-purpose computation, and in the context of developing a
-general-purpose ISA, is never going to satisfy 100 percent of implementors.
-
-That basically leaves "variable-length vector" as the clear *general-purpose*
-winner, at least in terms of greatly simplifying the instruction set,
-reducing the number of instructions required for any given task, and thus
-reducing power consumption for the same.
-
-## Implicit vs fixed instruction bit-width
-
-SIMD again has a severe disadvantage here, over Vector: huge proliferation
-of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
-have to then have operations *for each and between each*.  It gets very
-messy, very quickly.
-
-The V-Extension on the other hand proposes to set the bit-width of
-future instructions on a per-register basis, such that subsequent instructions
-involving that register are *implicitly* of that particular bit-width until
-otherwise changed or reset.
-
-This has some extremely useful properties, without being particularly
-burdensome to implementations, given that instruction decode already has
-to direct the operation to a correctly-sized width ALU engine, anyway.
-
-Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
-implicit bit-width allows the meaning of certain operations to be
-type-overloaded *without* pollution or alteration of frozen and immutable
-instructions, in a fully backwards-compatible fashion.
-
-## Implicit and explicit type-conversion
-
-The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help
-deal with over-population of instructions, such that type-casting from
-integer (and floating point) of various sizes is automatically inferred
-due to "type tagging" that is set with a special instruction.  A register
-will be *specifically* marked as "16-bit Floating-Point" and, if added
-to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take placce *without* requiring that type-conversion
-to be explicitly done with its own separate instruction.
-
-However, implicit type-conversion is not only quite burdensome to
-implement (explosion of inferred type-to-type conversion) but also is
-never really going to be complete.  It gets even worse when bit-widths
-also have to be taken into consideration.
-
-Overall, type-conversion is generally best to leave to explicit
-type-conversion instructions, or in definite specific use-cases left to
-be part of an actual instruction (DSP or FP)
-
-## Zero-overhead loops vs explicit loops
-
-The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology
-contains an extremely interesting feature: zero-overhead loops.  This
-proposal would basically allow an inner loop of instructions to be
-repeated indefinitely, a fixed number of times.
-
-Its specific advantage over explicit loops is that the pipeline in a
-DSP can potentially be kept completely full *even in an in-order
-implementation*.  Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in order
-to keep ALU pipelines 100% occupied.
-
-This very simple proposal offers a way to increase pipeline activity in the
-one key area which really matters: the inner loop.
-
-## Mask and Tagging (Predication)
-
-Tagging (aka Masks aka Predication) is a pseudo-method of implementing
-simplistic branching in a parallel fashion, by allowing execution on
-elements of a vector to be switched on or off depending on the results
-of prior operations in the same array position.
-
-The reason for considering this is simple: by *definition* it
-is not possible to perform individual parallel branches in a SIMD
-(Single-Instruction, **Multiple**-Data) context.  Branches (modifying
-of the Program Counter) will result in *all* parallel data having
-a different instruction executed on it: that's just the definition of
-SIMD, and it is simply unavoidable.
-
-So these are the ways in which conditional execution may be implemented:
-
-* explicit compare and branch: BNE x, y -> offs would jump offs
-  instructions if x was not equal to y
-* explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) ADD results in a carry, carry bit implicitly
-  (or sometimes explicitly) goes into a "tag" (mask) register
-
-The first of these is a "normal" branch method, which is flat-out impossible
-to parallelise without look-ahead and effectively rewriting instructions.
-This would defeat the purpose of RISC.
-
-The latter two are where parallelism becomes easy to do without complexity:
-every operation is modified to be "conditionally executed" (in an explicit
-way directly in the instruction format *or* implicitly).
-
-RVV (Vector-Extension) proposes to have *explicit* storing of the compare
-in a tag/mask register, and to *explicitly* have every vector operation
-*require* that its operation be "predicated" on the bits within an
-explicitly-named tag/mask register.
-
-SIMD (P-Extension) has not yet published precise documentation on what its
-schema is to be: there is however verbal indication at the time of writing
-that:
-
-> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will
-> be executed using the same compare ALU logic for the base ISA with some
-> minor modifications to handle smaller data types. The function will not
-> be duplicated.
-
-This is an *implicit* form of predication as the base RV ISA does not have
-condition-codes or predication.  By adding a CSR it becomes possible
-to also tag certain registers as "predicated if referenced as a destination".
-Example:
-
-    // in future operations if r0 is the destination use r5 as 
-    // the PREDICATION register
-    IMPLICICSRPREDICATE r0, r5
-    // store the compares in r5 as the PREDICATION register 
-    CMPEQ8 r5, r1, r2
-    // r0 is used here.  ah ha!  that means it's predicated using r5! 
-    ADD8 r0, r1, r3
-
-With enough registers (and there are enough registers) some fairly
-complex predication can be set up and yet still execute without significant
-stalling, even in a simple non-superscalar architecture.
-
-### Retro-fitting Predication into branch-explicit ISA
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication.  However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.
-
-However what if all branch instructions, if referencing a vectorised
-register, were instead given *completely new analogous meanings* that
-resulted in a parallel bit-wise predication register being set?  This
-would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
-BLT and BGE.
-
-We might imagine that FEQ, FLT and FLT would also need to be converted,
-however these are effectively *already* in the precise form needed and
-do not need to be converted *at all*!  The difference is that FEQ, FLT
-and FLE *specifically* write a 1 to an integer register if the condition
-holds, and 0 if not.  All that needs to be done here is to say, "if
-the integer register is tagged with a bit that says it is a predication
-register, the **bit** in the integer register is set based on the
-current vector index" instead.
-
-There is, in the standard Conditional Branch instruction, more than
-adequate space to interpret it in a similar fashion:
-
-[[!table  data="""
-  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
-imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
- 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
-   offset[12,10:5]  ||    src2  |    src1   |  BEQ         |    offset[11,4:1]      || BRANCH      |
-"""]]
-
-This would become:
-
-[[!table  data="""
-  31    |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 .......  8 |      7  | 6 ....... 0 |
-imm[12] | imm[10:5]  |        rs2 |     rs1 |       funct3 |      imm[4:1] | imm[11] |    opcode   |
- 1      |        6   |      5   |      5    |       3      |     4         |  1      |   7         |
-   reserved         ||    src2  |    src1   |  BEQ         |   predicate rs3        || BRANCH      |
-"""]]
-
-Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
-with the interesting side-effect that there is space within what is presently
-the "immediate offset" field to reinterpret that to add in not only a bit
-field to distinguish between floating-point compare and integer compare,
-not only to add in a second source register, but also use some of the bits as
-a predication target as well.
-
-[[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |         imm           |   op   |
-      3      |         3          |    3     |           5           |   2    |
-   C.BEQZ    |   offset[8,4:3]    |   src    |   offset[7:6,2:1,5]   |   C1   |
-"""]]
-
-Now uses the CS format:
-
-[[!table  data="""
-15 ...... 13 | 12 ...........  10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
-   funct3    |       imm          |   rs10   |  imm   |              |   op   |
-      3      |         3          |    3     |  2     |  3           |   2    |
-   C.BEQZ    |   predicate rs3    |   src1   |  I/F B | src2         |   C1   |
-"""]]
-
-Bit 6 would be decoded as "operation refers to Integer or Float" including
-interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
-"C" Standard, version 2.0,
-whilst Bit 5 would allow the operation to be extended, in combination with
-funct3 = 110 or 111: a combination of four distinct comparison operators.
-
-## Conclusions
-
-In the above sections the five different ways where parallel instruction
-execution has closely and loosely inter-related implications for the ISA and
-for implementors, were outlined.  The pluses and minuses came out as
-follows:
-
-* Fixed vs variable parallelism: <b>variable</b>
-* Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
-* Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit</b>
-* Tag or no-tag: <b>Complex and needs further thought</b>
-
-In particular: variable-length vectors came out on top because of the
-high setup, teardown and corner-cases associated with the fixed width
-of SIMD.  Implicit bit-width helps to extend the ISA to escape from
-former limitations and restrictions (in a backwards-compatible fashion),
-and implicit (zero-overhead) loops provide a means to keep pipelines
-potentially 100% occupied *without* requiring a super-scalar or out-of-order
-architecture.
-
-Constructing a SIMD/Simple-Vector proposal based around even only these four
-(five?) requirements would therefore seem to be a logical thing to do.
-
-# Instruction Format
-
-**TODO** *basically borrow from both P and V, which should be quite simple
-to do, with the exception of Tag/no-tag, which needs a bit more
-thought.  V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
-gather-scatterer, and, if implemented, could actually be a really useful
-way to span 8-bit up to 64-bit groups of data, where BGS as it stands
-and described by Clifford does **bits** of up to 16 width.  Lots to
-look at and investigate!*
-
-# Note on implementation of parallelism
-
-One extremely important aspect of this proposal is to respect and support
-implementors desire to focus on power, area or performance.  In that regard,
-it is proposed that implementors be free to choose whether to implement
-the Vector (or variable-width SIMD) parallelism as sequential operations
-with a single ALU, fully parallel (if practical) with multiple ALUs, or
-a hybrid combination of both.
-
-In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
-Parallelism".  They achieve a 16-way SIMD at an **instruction** level
-by providing a combination of a 4-way parallel ALU *and* an externally
-transparent loop that feeds 4 sequential sets of data into each of the
-4 ALUs.
-
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
-
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler.  Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward.  All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
-
-The "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
-
-# V-Extension to Simple-V Comparative Analysis
-
-This section covers the ways in which Simple-V is comparable
-to, or more flexible than, V-Extension (V2.3-draft).  Also covered is
-one major weak-point (register files are fixed size, where V is
-arbitrary length), and how best to deal with that, should V be adapted
-to be on top of Simple-V.
-
-The first stages of this section go over each of the sections of V2.3-draft V
-where appropriate
-
-## 17.3 Shape Encoding
-
-Simple-V's proposed means of expressing whether a register (from the
-standard integer or the standard floating-point file) is a scalar or
-a vector is to simply set the vector length to 1.  The instruction
-would however have to specify which register file (integer or FP) that
-the vector-length was to be applied to.
-
-Extended shapes (2-D etc) would not be part of Simple-V at all.
-
-## 17.4 Representation Encoding
-
-Simple-V would not have representation-encoding.  This is part of
-polymorphism, which is considered too complex to implement (TODO: confirm?)
-
-## 17.5 Element Bitwidth
-
-This is directly equivalent to Simple-V's "Packed", and implies that
-integer (or floating-point) are divided down into vector-indexable
-chunks of size Bitwidth.
-
-In this way it becomes possible to have ADD effectively and implicitly
-turn into ADDb (8-bit add), ADDw (16-bit add) and so on, and where
-vector-length has been set to greater than 1, it becomes a "Packed"
-(SIMD) instruction.
-
-It remains to be decided what should be done when RV32 / RV64 ADD (sized)
-opcodes are used.  One useful idea would be, on an RV64 system where
-a 32-bit-sized ADD was performed, to simply use the least significant
-32-bits of the register (exactly as is currently done) but at the same
-time to *respect the packed bitwidth as well*.
-
-The extended encoding (Table 17.6) would not be part of Simple-V.
-
-## 17.6 Base Vector Extension Supported Types
-
-TODO: analyse.  probably exactly the same.
-
-## 17.7 Maximum Vector Element Width
+[[!tag oldstandards]]
  
-No equivalent in Simple-V
-
-## 17.8 Vector Configuration Registers
-
-TODO: analyse.
-
-## 17.9 Legal Vector Unit Configurations
-
-TODO: analyse
-
-## 17.10 Vector Unit CSRs
-
-TODO: analyse
-
-> Ok so this is an aspect of Simple-V that I hadn't thought through,
-> yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section
-> 17.10 the CSRs are listed.  I note that there's some general-purpose
-> CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i
-> don't precisely know what those are for.
-
->  In the Simple-V proposal, *every* register in both the integer
-> register-file *and* the floating-point register-file would have at
-> least a 2-bit "data-width" CSR and probably something like an 8-bit
-> "vector-length" CSR (less in RV32E, by exactly one bit).
-
->  What I *don't* know is whether that would be considered perfectly
-> reasonable or completely insane.  If it turns out that the proposed
-> Simple-V CSRs can indeed be stored in SRAM then I would imagine that
-> adding somewhere in the region of 10 bits per register would be... okay? 
-> I really don't honestly know.
-
->  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
-> be multi-ported? No I don't believe they would.
-
-## 17.11 Maximum Vector Length (MVL)
-
-Basically implicitly this is set to the maximum size of the register
-file multiplied by the number of 8-bit packed ints that can fit into
-a register (4 for RV32, 8 for RV64 and 16 for RV128).
-
-## !7.12 Vector Instruction Formats
-
-No equivalent in Simple-V because *all* instructions of *all* Extensions
-are implicitly parallelised (and packed).
-
-## 17.13 Polymorphic Vector Instructions
-
-Polymorphism (implicit type-casting) is deliberately not supported
-in Simple-V.
-
-## 17.14 Rapid Configuration Instructions
-
-TODO: analyse if this is useful to have an equivalent in Simple-V
-
-## 17.15 Vector-Type-Change Instructions
-
-TODO: analyse if this is useful to have an equivalent in Simple-V
-
-## 17.16 Vector Length
-
-Has a direct corresponding equivalent.
-
-## 17.17 Predicated Execution
-
-Predicated Execution is another name for "masking" or "tagging".  Masked
-(or tagged) implies that there is a bit field which is indexed, and each
-bit associated with the corresponding indexed offset register within
-the "Vector".  If the tag / mask bit is 1, when a parallel operation is
-issued, the indexed element of the vector has the operation carried out.
-However if the tag / mask bit is *zero*, that particular indexed element
-of the vector does *not* have the requested operation carried out.
-
-In V2.3-draft V, there is a significant (not recommended) difference:
-the zero-tagged elements are *set to zero*.  This loses a *significant*
-advantage of mask / tagging, particularly if the entire mask register
-is itself a general-purpose register, as that general-purpose register
-can be inverted, shifted, and'ed, or'ed and so on.  In other words
-it becomes possible, especially if Carry/Overflow from each vector
-operation is also accessible, to do conditional (step-by-step) vector
-operations including things like turn vectors into 1024-bit or greater
-operands with very few instructions, by treating the "carry" from
-one instruction as a way to do "Conditional add of 1 to the register
-next door".  If V2.3-draft V sets zero-tagged elements to zero, such
-extremely powerful techniques are simply not possible.
-
-It is noted that there is no mention of an equivalent to BEXT (element
-skipping) which would be particularly fascinating and powerful to have.
-In this mode, the "mask" would skip elements where its mask bit was zero
-in either the source or the destination operand.
-
-Lots to be discussed.
-
-## 17.18 Vector Load/Store Instructions
-
-The Vector Load/Store instructions as proposed in V are extremely powerful
-and can be used for reordering and regular restructuring.
-
-Vector Load:
-
-    if (unit-strided) stride = elsize;
-    else stride = areg[as2]; // constant-strided
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[p][i])
-        for (int j=0; j<seglen+1; j++)
-          vreg[vd+j][i] = mem[areg[as1] + (i*(seglen+1)+j)*stride];
-
-Store:
-
-    if (unit-strided) stride = elsize;
-    else stride = areg[as2]; // constant-strided
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[p][i])
-        for (int j=0; j<seglen+1; j++)
-          mem[areg[base] + (i*(seglen+1)+j)*stride] = vreg[vd+j][i];
-
-Indexed Load:
-
-    for (int i=0; i<vl; ++i)
-      if ([!]preg[p][i])
-        for (int j=0; j<seglen+1; j++)
-          vreg[vd+j][i] = mem[sreg[base] + vreg[vs2][i] + j*elsize];
-
-Indexed Store:
-
-    for (int i=0; i<vl; ++i)
-    if ([!]preg[p][i])
-      for (int j=0; j<seglen+1; j++)
-        mem[sreg[base] + vreg[vs2][i] + j*elsize] = vreg[vd+j][i];
-
-Keeping these instructions as-is for Simple-V is highly recommended.
-However: one of the goals of this Extension is to retro-fit (re-use)
-existing RV Load/Store:
-
-[[!table  data="""
-31                  20 | 19      15 | 14    12 | 11           7 | 6         0 |
-       imm[11:0]       |     rs1    |  funct3  |       rd       |    opcode |
-            12         |      5     |    3     |        5       |      7 |
-       offset[11:0]    |    base    |  width   |      dest      |    LOAD |
-"""]]
-
-[[!table  data="""
-31          25 | 24    20 | 19     15 | 14    12 | 11          7 | 6         0 |
- imm[11:5]     |   rs2    |    rs1    |  funct3  |   imm[4:0]    |    opcode |
-      7        |    5     |     5     |    3     |       5       |      7 |
- offset[11:5]  |   src    |   base    |  width   |  offset[4:0]  |   STORE |
-"""]]
-
-
-## 17.19 Vector Register Gather
-
-TODO
-
-## TODO, sort
-
-> However, there are also several features that go beyond simply attaching VL
-> to a scalar operation and are crucial to being able to vectorize a lot of
-> code.  To name a few:
-> - Conditional execution (i.e., predicated operations)
-> - Inter-lane data movement (e.g. SLIDE, SELECT)
-> - Reductions (e.g., VADD with a scalar destination)
-
- Ok so the Conditional and also the Reductions is one of the reasons
- why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
- of a decent name) i proposed that it be implemented as "if you say r0
- is to be a vector / SIMD that means operations actually take place on
- r0,r1,r2... r(N-1)".
-
- Consequently any parallel operation could be paused (or... more
- specifically: vectors disabled by resetting it back to a default /
- scalar / vector-length=1) yet the results would actually be in the
- *main register file* (integer or float) and so anything that wasn't
- possible to easily do in "simple" parallel terms could be done *out*
- of parallel "mode" instead.
-
- I do appreciate that the above does imply that there is a limit to the
- length that SimpleV (whatever) can be parallelised, namely that you
- run out of registers!  my thought there was, "leave space for the main
- V-Ext proposal to extend it to the length that V currently supports".
- Honestly i had not thought through precisely how that would work.
-
- Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that,
- it reminds me of the discussion with Clifford on bit-manipulation
- (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if
- applied "globally and outside of V and P" SLIDE and SELECT might become
- an extremely powerful way to do fast memory copy and reordering [2[.
-
- However I haven't quite got my head round how that would work: i am
- used to the concept of register "tags" (the modern term is "masks")
- and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
- STORE you would get the exact same thing as SELECT.
-
- SLIDE you could do simply by setting say r0 vector-length to say 16
- (meaning that if referred to in any operation it would be an implicit
- parallel operation on *all* registers r0 through r15), and temporarily
- set say.... r7 vector-length to say... 5.  Do a LOAD on r7 and it would
- implicitly mean "load from memory into r7 through r11".  Then you go
- back and do an operation on r0 and ta-daa, you're actually doing an
- operation on a SLID {SLIDED?) vector.
-
- The advantage of Simple-V (whatever) over V would be that you could
- actually do *operations* in the middle of vectors (not just SLIDEs)
- simply by (as above) setting r0 vector-length to 16 and r7 vector-length
- to 5.  There would be nothing preventing you from doing an ADD on r0
- (which meant do an ADD on r0 through r15) followed *immediately in the
- next instruction with no setup cost* a MUL on r7 (which actually meant
- "do a parallel MUL on r7 through r11").
-
- btw it's worth mentioning that you'd get scalar-vector and vector-scalar
- implicitly by having one of the source register be vector-length 1
- (the default) and one being N > 1.  but without having special opcodes
- to do it.  i *believe* (or more like "logically infer or deduce" as
- i haven't got access to the spec) that that would result in a further
- opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V.
-
- Also, Reduction *might* be possible by specifying that the destination be
- a scalar (vector-length=1) whilst the source be a vector.  However... it
- would be an awful lot of work to go through *every single instruction*
- in *every* Extension, working out which ones could be parallelised (ADD,
- MUL, XOR) and those that definitely could not (DIV, SUB).  Is that worth
- the effort?  maybe.  Would it result in huge complexity? probably.
- Could an implementor just go "I ain't doing *that* as parallel!
- let's make it virtual-parallelism (sequential reduction) instead"?
- absolutely.  So, now that I think it through, Simple-V (whatever)
- covers Reduction as well.  huh, that's a surprise.
-
-
-> - Vector-length speculation (making it possible to vectorize some loops with
-> unknown trip count) - I don't think this part of the proposal is written
-> down yet.
-
- Now that _is_ an interesting concept.  A little scary, i imagine, with
- the possibility of putting a processor into a hard infinite execution
- loop... :)
-
-
-> Also, note the vector ISA consumes relatively little opcode space (all the
-> arithmetic fits in 7/8ths of a major opcode).  This is mainly because data
-> type and size is a function of runtime configuration, rather than of opcode.
-
- yes.  i love that aspect of V, i am a huge fan of polymorphism [1]
- which is why i am keen to advocate that the same runtime principle be
- extended to the rest of the RISC-V ISA [3]
-
- Yikes that's a lot.  I'm going to need to pull this into the wiki to
- make sure it's not lost.
-
-[1] inherent data type conversion: 25 years ago i designed a hypothetical
-hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit
-(escape-extended) opcodes and 2-bit (escape-extended) operands that
-only required a fixed 8-bit instruction length.  that relied heavily
-on polymorphism and runtime size configurations as well.  At the time
-I thought it would have meant one HELL of a lot of CSRs... but then I
-met RISC-V and was cured instantly of that delusion^Wmisapprehension :)
-
-[2] Interestingly if you then also add in the other aspect of Simple-V
-(the data-size, which is effectively functionally orthogonal / identical
-to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE
-operations become byte / half-word / word augmenters of B-Ext's proposed
-"BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored
-LOAD / STORE would deal with 8 / 16 / 32 bits at a time.  Where it
-would get really REALLY interesting would be masked-packed-vectored
-B-Ext BGS instructions.  I can't even get my head fully round that,
-which is a good sign that the combination would be *really* powerful :)
-
-[3] ok sadly maybe not the polymorphism, it's too complicated and I
-think would be much too hard for implementors to easily "slide in" to an
-existing non-Simple-V implementation.  i say that despite really *really*
-wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some
-fashion, for optimising 3D Graphics.  *sigh*.
-
-## TODO: analyse, auto-increment on unit-stride and constant-stride
-
-so i thought about that for a day or so, and wondered if it would be
-possible to propose a variant of zero-overhead loop that included
-auto-incrementing the two address registers a2 and a3, as well as
-providing a means to interact between the zero-overhead loop and the
-vsetvl instruction.  a sort-of pseudo-assembly of that would look like:
-
-    # a2 to be auto-incremented by t0 times 4
-    zero-overhead-set-auto-increment a2, t0, 4
-    # a2 to be auto-incremented by t0 times 4
-    zero-overhead-set-auto-increment a3, t0, 4
-    zero-overhead-set-loop-terminator-condition a0 zero
-    zero-overhead-set-start-end stripmine, stripmine+endoffset
-    stripmine:
-    vsetvl t0,a0
-    vlw v0, a2
-    vlw v1, a3
-    vfma v1, a1, v0, v1
-    vsw v1, a3
-    sub a0, a0, t0
-    stripmine+endoffset:
-
-the question is: would something like this even be desirable?  it's a
-variant of auto-increment [1].  last time i saw any hint of auto-increment
-register opcodes was in the 1980s... 68000 if i recall correctly... yep
-see [1]
-
-[1] http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node6.html
-
-Reply:
-
-Another option for auto-increment is for vector-memory-access instructions
-to support post-increment addressing for unit-stride and constant-stride
-modes.  This can be implemented by the scalar unit passing the operation
-to the vector unit while itself executing an appropriate multiply-and-add
-to produce the incremented address.  This does *not* require additional
-ports on the scalar register file, unlike scalar post-increment addressing
-modes.
-
-## TODO: instructions (based on Hwacha) V-Ext duplication analysis
-
-This is partly speculative due to lack of access to an up-to-date
-V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing).  However
-basin an analysis instead on Hwacha, a cursory examination shows over
-an **85%** duplication of V-Ext operand-related instructions when
-compared to Simple-V on a standard RG64G base.   Even Vector Fetch
-is analogous to "zero-overhead loop".
-
-Exceptions are:
-
-* Vector Indexed Memory Instructions (non-contiguous)
-* Vector Atomic Memory Instructions.
-* Some of the Vector Misc ops: VEIDX, VFIRST, VCLASS, VPOPC
-  and potentially more.
-* Consensual Jump
-
-Table of RV32V Instructions
-
-| RV32V      | RV Equivalent (FP)   | RV Equivalent (Int) | Notes |
-| -----      | --- | |   |
-| VADD       | FADD    | ADD |   |
-| VSUB       | FSUB    | SUB |   |
-| VSL        |     | SLL |   |
-| VSR        |     | SRL |   |
-| VAND       |     | AND |   |
-| VOR        |     | OR |   |
-| VXOR       |     | XOR |   |
-| VSEQ       | FEQ | BEQ | {1} |
-| VSNE       | !FEQ | BNE | {1} |
-| VSLT       | FLT    | BLT | {1} |
-| VSGE       | !FLE | BGE | {1} |
-| VCLIP      |     | |   |
-| VCVT       | FCVT    | |   |
-| VMPOP      |     | |   |
-| VMFIRST    |     | |   |
-| VEXTRACT   |     | |   |
-| VINSERT    |     | |   |
-| VMERGE     |     | |   |
-| VSELECT    |     | |   |
-| VSLIDE     |     | |   |
-| VDIV       | FDIV    | DIV |   |
-| VREM       |     | REM |   |
-| VMUL       | FMUL    | MUL |   |
-| VMULH      |     | |   |
-| VMIN       | FMIN    | |   |
-| VMAX       | FMUX    | |   |
-| VSGNJ      | FSGNJ    | |   |
-| VSGNJN     | FSGNJN    | |   |
-| VSGNJX     | FSNGJX    | |   |
-| VSQRT      | FSQRT    | |   |
-| VCLASS     |     | |   |
-| VPOPC      |     | |   |
-| VADDI      |     | ADDI |   |
-| VSLI       |     | SLI |   |
-| VSRI       |     | SRI |   |
-| VANDI      |     | ANDI |   |
-| VORI       |     | ORI |   |
-| VXORI      |     | XORI |   |
-| VCLIPI     |     | |   |
-| VMADD      | FMADD    | |   |
-| VMSUB      | FMSUB    | |   |
-| VNMADD     | FNMSUB    | |   |
-| VNMSUB     | FNMADD    | |   |
-| VLD        | FLD    | LD |   |
-| VLDS       |     | LW |   |
-| VLDX       |     | LWU |   |
-| VST        | FST    | ST |   |
-| VSTS       |     | |   |
-| VSTX       |     | |   |
-| VAMOSWAP   |     | AMOSWAP |   |
-| VAMOADD    |     | AMOADD |   |
-| VAMOAND    |     | AMOAND |   |
-| VAMOOR     |     | AMOOR |   |
-| VAMOXOR    |     | AMOXOR |   |
-| VAMOMIN    |     | AMOMIN |   |
-| VAMOMAX    |     | AMOMAX |   |
-
-Notes:
-
-* {1} retro-fit predication variants into branch instructions (base and C),
-  decoding triggered by CSR bit marking register as "Vector type".
-
-## TODO: sort
-
-> I suspect that the "hardware loop" in question is actually a zero-overhead
-> loop unit that diverts execution from address X to address Y if a certain
-> condition is met.
-
- not quite.  The zero-overhead loop unit interestingly would be at
-an [independent] level above vector-length.  The distinctions are
-as follows:
-
-* Vector-length issues *virtual* instructions where the register
-  operands are *specifically* altered (to cover a range of registers),
-  whereas zero-overhead loops *specifically* do *NOT* alter the operands
-  in *ANY* way.
-
-* Vector-length-driven "virtual" instructions are driven by *one*
- and *only* one instruction (whether it be a LOAD, STORE, or pure
- one/two/three-operand opcode) whereas zero-overhead loop units
- specifically apply to *multiple* instructions.
-
-Where vector-length-driven "virtual" instructions might get conceptually
-blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD /
-STORE, to actually be useful, vector-length-driven LOAD / STORE should
-increment the LOAD / STORE memory address to correspondingly match the
-increment in the register bank.  example:
-
-* set vector-length for r0 to 4
-* issue RV32 LOAD from addr 0x1230 to r0
-
-translates effectively to:
-
-* RV32 LOAD from addr 0x1230 to r0
-* ...
-* ...
-* RV32 LOAD from addr 0x123B to r3
-
-# P-Ext ISA
-
-## 16-bit Arithmetic
-
-| Mnemonic           | 16-bit Instruction        | Simple-V Equivalent |
-| ------------------ | ------------------------- | ------------------- |
-| ADD16 rt, ra, rb   | add                       | RV ADD (bitwidth=16) |
-| RADD16 rt, ra, rb  | Signed Halving add        | |
-| URADD16 rt, ra, rb | Unsigned Halving add      | |
-| KADD16 rt, ra, rb  | Signed Saturating add     | |
-| UKADD16 rt, ra, rb | Unsigned Saturating add   | |
-| SUB16 rt, ra, rb   | sub                       | RV SUB (bitwidth=16) |
-| RSUB16 rt, ra, rb  | Signed Halving sub        | |
-| URSUB16 rt, ra, rb | Unsigned Halving sub                | |
-| KSUB16 rt, ra, rb  | Signed Saturating sub               | |
-| UKSUB16 rt, ra, rb | Unsigned Saturating sub             | |
-| CRAS16 rt, ra, rb  | Cross Add & Sub                     | |
-| RCRAS16 rt, ra, rb | Signed Halving Cross Add & Sub      | |
-| URCRAS16 rt, ra, rb| Unsigned Halving Cross Add & Sub    | |
-| KCRAS16 rt, ra, rb | Signed Saturating Cross Add & Sub   | |
-| UKCRAS16 rt, ra, rb| Unsigned Saturating Cross Add & Sub | |
-| CRSA16 rt, ra, rb  | Cross Sub & Add                     | |
-| RCRSA16 rt, ra, rb | Signed Halving Cross Sub & Add      | |
-| URCRSA16 rt, ra, rb| Unsigned Halving Cross Sub & Add    | |
-| KCRSA16 rt, ra, rb | Signed Saturating Cross Sub & Add   | |
-| UKCRSA16 rt, ra, rb| Unsigned Saturating Cross Sub & Add | |
-
-## 8-bit Arithmetic
-
-| Mnemonic           | 16-bit Instruction        | Simple-V Equivalent |
-| ------------------ | ------------------------- | ------------------- |
-| ADD8 rt, ra, rb    | add                       | RV ADD (bitwidth=8)|
-| RADD8 rt, ra, rb   | Signed Halving add        | |
-| URADD8 rt, ra, rb  | Unsigned Halving add      | |
-| KADD8 rt, ra, rb   | Signed Saturating add     | |
-| UKADD8 rt, ra, rb  | Unsigned Saturating add   | |
-| SUB8 rt, ra, rb    | sub                       | RV SUB (bitwidth=8)|
-| RSUB8 rt, ra, rb   | Signed Halving sub        | |
-| URSUB8 rt, ra, rb  | Unsigned Halving sub      | |
-
-# Exceptions
-
-> What does an ADD of two different-sized vectors do in simple-V?
-
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
-  than the destination, throw an exception.
-
-> And what about instructions like JALR? 
-> What does jumping to a vector do? 
-
-* Throw an exception.  Whether that actually results in spawning threads
-  as part of the trap-handling remains to be seen.
-
-# Impementing V on top of Simple-V
-
-* Number of Offset CSRs extends from 2
-* Extra register file: vector-file
-* Setup of Vector length and bitwidth CSRs now can specify vector-file
-  as well as integer or float file.
-* TODO
-
-# Implementing P (renamed to DSP) on top of Simple-V
-
-* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
-  (caveat: anything not specified drops through to software-emulation / traps)
-* TODO
-
-# Analysis of CSR decoding on latency
-
-<a name="csr_decoding_analysis"></a>
-
-It could indeed have been logically deduced (or expected), that there
-would be additional decode latency in this proposal, because if
-overloading the opcodes to have different meanings, there is guaranteed
-to be some state, some-where, directly related to registers.
-
-There are several cases:
-
-* All operands vector-length=1 (scalars), all operands
-  packed-bitwidth="default": instructions are passed through direct as if
-  Simple-V did not exist.  Simple-V is, in effect, completely disabled.
-* At least one operand vector-length > 1, all operands
-  packed-bitwidth="default": any parallel vector ALUs placed on "alert",
-  virtual parallelism looping may be activated.
-* All operands vector-length=1 (scalars), at least one
-  operand packed-bitwidth != default: degenerate case of SIMD,
-  implementation-specific complexity here (packed decode before ALUs or
-  *IN* ALUs)
-* At least one operand vector-length > 1, at least one operand
-  packed-bitwidth != default: parallel vector ALUs (if any)
-  placed on "alert", virtual parallelsim looping may be activated,
-  implementation-specific SIMD complexity kicks in (packed decode before
-  ALUs or *IN* ALUs).
-
-Bear in mind that the proposal includes that the decision whether
-to parallelise in hardware or whether to virtual-parallelise (to
-dramatically simplify compilers and also not to run into the SIMD
-instruction proliferation nightmare) *or* a transprent combination
-of both, be done on a *per-operand basis*, so that implementors can
-specifically choose to create an application-optimised implementation
-that they believe (or know) will sell extremely well, without having
-"Extra Standards-Mandated Baggage" that would otherwise blow their area
-or power budget completely out the window.
-
-Additionally, two possible CSR schemes have been proposed, in order to
-greatly reduce CSR space:
-
-* per-register CSRs (vector-length and packed-bitwidth)
-* a smaller number of CSRs with the same information but with an *INDEX*
-  specifying WHICH register in one of three regfiles (vector, fp, int)
-  the length and bitwidth applies to.
-
-(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
-
-In addition, LOAD/STORE has its own associated proposed CSRs that
-mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
-V (and Hwacha).
-
-Also bear in mind that, for reasons of simplicity for implementors,
-I was coming round to the idea of permitting implementors to choose
-exactly which bitwidths they would like to support in hardware and which
-to allow to fall through to software-trap emulation.
-
-So the question boils down to:
-
-* whether either (or both) of those two CSR schemes have significant
-  latency that could even potentially require an extra pipeline decode stage
-* whether there are implementations that can be thought of which do *not*
-  introduce significant latency
-* whether it is possible to explicitly (through quite simply
-  disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
-  all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
-  the extreme of skipping an entire pipeline stage (if one is needed)
-* whether packed bitwidth and associated regfile splitting is so complex
-  that it should definitely, definitely be made mandatory that implementors
-  move regfile splitting into the ALU, and what are the implications of that
-* whether even if that *is* made mandatory, is software-trapped
-  "unsupported bitwidths" still desirable, on the basis that SIMD is such
-  a complete nightmare that *even* having a software implementation is
-  better, making Simple-V have more in common with a software API than
-  anything else.
+# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
  
-Whilst the above may seem to be severe minuses, there are some strong
-pluses:
+**OBSOLETE. This document is out of date and involved early ideas and discussions. [Go to the up-to-date document](https://libre-soc.org/openpower/sv/)**
  
-* Significant reduction of V's opcode space: over 85%.
-* Smaller reduction of P's opcode space: around 10%.
-* The potential to use Compressed instructions in both Vector and SIMD
-  due to the overloading of register meaning (implicit vectorisation,
-  implicit packing)
-* Not only present but also future extensions automatically gain parallelism.
-* Already mentioned but worth emphasising: the simplification to compiler
-  writers and assembly-level writers of having the same consistent ISA
-  regardless of whether the internal level of parallelism (number of
-  parallel ALUs) is only equal to one ("virtual" parallelism), or is
-  greater than one, should not be underestimated.
+This list is auto-generated from a page tag "oldstandards":
  
+[[!inline pages="tagged(oldstandards)" actions="no" archive="yes" quick="yes"]]
  
  # References
  
@@ -1118,3 +28,22 @@ pluses:
  * Videocore-IV <https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-3d-Graphics-Pipeline>
  * Discussion proposing CSRs that change ISA definition
    <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/InzQ1wr_3Ak>
+* Zero-overhead loops <https://pdfs.semanticscholar.org/dbaa/66985cc730d4b44d79f519e96ec9c43ab5b7.pdf>
+* Multi-ported VLIW Register File Implementation <https://ce-publications.et.tudelft.nl/publications/1517_multiple_contexts_in_a_multiported_vliw_register_file_impl.pdf>
+* Fast context save/restore proposal <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57F823FA.6030701%40gmail.com>
+* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+* Expired Patent on Vector Virtual Memory solutions
+  <https://patentimages.storage.googleapis.com/fc/f6/e2/2cbee92fcd8743/US5895501.pdf>
+* Discussion on RVV "re-entrant" capabilities allowing operations to be
+  restarted if an exception occurs (VM page-table miss)
+  <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
+* Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>
+* RVV slides 2017 <https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
+* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Streaming Pipelines <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf>
+* Barcelona SIMD Presentation <https://content.riscv.org/wp-content/uploads/2018/05/09.05.2018-9.15-9.30am-RISCV201805-Andes-proposed-P-extension.pdf>
+* <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Full Description (last page) of RVV instructions
+  <https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf>
+* PULP Low-energy Cluster Vector Processor
+  <http://iis-projects.ee.ethz.ch/index.php/Low-Energy_Cluster-Coupled_Vector_Coprocessor_for_Special-Purpose_PULP_Acceleration>