X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=4cf0937edde233618bbd012a4a8ad571c9acd473;hb=1df960ad2fec1a7f252ce30ce808dff52f2a072d;hp=fbfe4f3d815e419509fc8f8e6002466d1dff4cb9;hpb=0ac718a479bb386fb3375dc909136180eb4ac4ac;p=libreriscv.git
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn
index fbfe4f3d8..4cf0937ed 100644
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -5,12 +5,12 @@ a consistent "API" to parallelisation of existing *and future* operations.
*Actual* internal hardware-level parallelism is *not* required, such
that Simple-V may be viewed as providing a "compact" or "consolidated"
means of issuing multiple near-identical arithmetic instructions to an
-instruction queue (FILO), pending execution.
+instruction queue (FIFO), pending execution.
*Actual* parallelism, if added independently of Simple-V in the form
of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
[[!toc ]]
@@ -21,7 +21,7 @@ requirements: power-conscious, area-conscious, and performance-conscious
designs all pull an ISA and its implementation in different conflicting
directions, as do the specific intended uses for any given implementation.
-Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
+The existing P (SIMD) proposal and the V (Vector) proposals,
whilst each extremely powerful in their own right and clearly desirable,
are also:
@@ -31,8 +31,8 @@ are also:
analysis and review purposes) prohibitively expensive
* Both contain partial duplication of pre-existing RISC-V instructions
(an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
- at the instruction level.
+* Both have independent, incompatible and disparate methods for introducing
+ parallelism at the instruction level
* Both require that their respective parallelism paradigm be implemented
along-side and integral to their respective functionality *or not at all*.
* Both independently have methods for introducing parallelism that
@@ -52,10 +52,13 @@ details outlined in the Appendix), the key points being:
* Vectorisation typically includes much more comprehensive memory load
and store schemes (unit stride, constant-stride and indexed), which
in turn have ramifications: virtual memory misses (TLB cache misses)
- and even multiple page-faults... all caused by a *single instruction*.
+ and even multiple page-faults... all caused by a *single instruction*,
+ yet with a clear benefit that the regularisation of LOAD/STOREs can
+ be optimised for minimal impact on caches and maximised throughput.
* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
to pages), and these load/stores have absolutely nothing to do with the
- SIMD / ALU engine, no matter how wide the operand.
+ SIMD / ALU engine, no matter how wide the operand. Simplicity but with
+ more impact on instruction and data caches.
Overall it makes a huge amount of sense to have a means and method
of introducing instruction parallelism in a flexible way that provides
@@ -123,7 +126,8 @@ reducing power consumption for the same.
SIMD again has a severe disadvantage here, over Vector: huge proliferation
of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
have to then have operations *for each and between each*. It gets very
-messy, very quickly.
+messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
+proliferation profile.
The V-Extension on the other hand proposes to set the bit-width of
future instructions on a per-register basis, such that subsequent instructions
@@ -135,7 +139,7 @@ burdensome to implementations, given that instruction decode already has
to direct the operation to a correctly-sized width ALU engine, anyway.
Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
+whatever reason, including limitations of the available operand space),
implicit bit-width allows the meaning of certain operations to be
type-overloaded *without* pollution or alteration of frozen and immutable
instructions, in a fully backwards-compatible fashion.
@@ -208,10 +212,10 @@ Interestingly, none of this complexity is faced in SIMD architectures...
but then they do not get the opportunity to optimise for highly-streamlined
memory accesses either.
-With the "bang-per-buck" ratio being so high and the direct improvement
-in L1 Instruction Cache usage, as well as the opportunity to optimise
-L1 and L2 cache usage, the case for including Vector LOAD/STORE is
-compelling.
+With the "bang-per-buck" ratio being so high and the indirect improvement
+in L1 Instruction Cache usage (reduced instruction count), as well as
+the opportunity to optimise L1 and L2 cache usage, the case for including
+Vector LOAD/STORE is compelling.
## Mask and Tagging (Predication)
@@ -232,8 +236,8 @@ So these are the ways in which conditional execution may be implemented:
* explicit compare and branch: BNE x, y -> offs would jump offs
instructions if x was not equal to y
* explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) ADD results in a carry, carry bit implicitly
- (or sometimes explicitly) goes into a "tag" (mask) register
+* implicit (condition-code) such as ADD results in a carry, carry bit
+ implicitly (or sometimes explicitly) goes into a "tag" (mask) register
The first of these is a "normal" branch method, which is flat-out impossible
to parallelise without look-ahead and effectively rewriting instructions.
@@ -304,17 +308,322 @@ In particular:
i.e. *without* requiring a super-scalar or out-of-order architecture,
but doing a proper, full job (ZOLC) is an entirely different matter.
-Constructing a SIMD/Simple-Vector proposal based around four of these five
+Constructing a SIMD/Simple-Vector proposal based around four of these six
requirements would therefore seem to be a logical thing to do.
-# Instruction Format
+# Note on implementation of parallelism
+
+One extremely important aspect of this proposal is to respect and support
+implementors desire to focus on power, area or performance. In that regard,
+it is proposed that implementors be free to choose whether to implement
+the Vector (or variable-width SIMD) parallelism as sequential operations
+with a single ALU, fully parallel (if practical) with multiple ALUs, or
+a hybrid combination of both.
+
+In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
+Parallelism". They achieve a 16-way SIMD at an **instruction** level
+by providing a combination of a 4-way parallel ALU *and* an externally
+transparent loop that feeds 4 sequential sets of data into each of the
+4 ALUs.
+
+Also in the same core, it is worth noting that particularly uncommon
+but essential operations (Reciprocal-Square-Root for example) are
+*not* part of the 4-way parallel ALU but instead have a *single* ALU.
+Under the proposed Vector (varible-width SIMD) implementors would
+be free to do precisely that: i.e. free to choose *on a per operation
+basis* whether and how much "Virtual Parallelism" to deploy.
+
+It is absolutely critical to note that it is proposed that such choices MUST
+be **entirely transparent** to the end-user and the compiler. Whilst
+a Vector (varible-width SIMD) may not precisely match the width of the
+parallelism within the implementation, the end-user **should not care**
+and in this way the performance benefits are gained but the ISA remains
+straightforward. All that happens at the end of an instruction run is: some
+parallel units (if there are any) would remain offline, completely
+transparently to the ISA, the program, and the compiler.
+
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
+instructions to deal with corner-cases is thus avoided, and implementors
+get to choose precisely where to focus and target the benefits of their
+implementation efforts, without "extra baggage".
+
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism. Options are covered in the Appendix.
+
+# CSRs
+
+There are two CSR tables needed to create lookup tables which are used at
+the register decode phase.
+
+* Integer Register N is Vector
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+
+Also (see Appendix, "Context Switch Example") it may turn out to be important
+to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
+Vectorised LOAD / STORE may be used to load and store multiple registers:
+something that is missing from the Base RV ISA.
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+ marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+ of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+ "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+ as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+ state information.
+* TODO: assess whether the same technique could be applied to the other
+ Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+ V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+ needed for context-switches (empty slots need never be stored).
+
+## Predication CSR
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated. However it is important to note
+that the *actual* register is *different* from the one that ends up
+being used, due to the level of indirection through the lookup table.
+This includes (in the future) redirecting to a *second* bank of
+integer registers (as a future option)
+
+* regidx is the actual register that in combination with the
+ i/f flag, if that integer or floating-point register is referred to,
+ results in the lookup table being referenced to find the predication
+ mask to use on the operation in which that (regidx) register has
+ been used
+* predidx (in combination with the bank bit in the future) is the
+ *actual* register to be used for the predication mask. Note:
+ in effect predidx is actually a 6-bit register address, as the bank
+ bit is the MSB (and is nominally set to zero for now).
+* inv indicates that the predication mask bits are to be inverted
+ prior to use *without* actually modifying the contents of the
+ register itself.
+* zeroing is either 1 or 0, and if set to 1, the operation must
+ place zeros in any element position where the predication mask is
+ set to zero. If zeroing is set to 1, unpredicated elements *must*
+ be left alone. Some microarchitectures may choose to interpret
+ this as skipping the operation entirely. Others which wish to
+ stick more closely to a SIMD architecture may choose instead to
+ interpret unpredicated elements as an internal "copy element"
+ operation (which would be necessary in SIMD microarchitectures
+ that perform register-renaming)
+
+| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
+| ----- | - | - | - | - | ------- | ------- |
+| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx |
+| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx |
+| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx |
+| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+ struct pred {
+ bool zero;
+ bool inv;
+ bool bank; // 0 for now, 1=rsvd
+ bool enabled;
+ int predidx; // redirection: actual int register to use
+ }
+
+ struct pred fp_pred_reg[32]; // 64 in future (bank=1)
+ struct pred int_pred_reg[32]; // 64 in future (bank=1)
+
+ for (i = 0; i < 16; i++)
+ tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
+ idx = CSRpred[i].regidx
+ tb[idx].zero = CSRpred[i].zero
+ tb[idx].inv = CSRpred[i].inv
+ tb[idx].bank = CSRpred[i].bank
+ tb[idx].predidx = CSRpred[i].predidx
+ tb[idx].enabled = true
+
+So when an operation is to be predicated, it is the internal state that
+is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+ for (int i=0; i 1;
- s2 = CSRvectorlen[src2] > 1;
- for (int i=0; i
For full analysis of topological adaptation of RVV LOAD/STORE
see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction.
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
Revised LOAD:
@@ -481,20 +864,19 @@ Notes:
* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
* **TODO**: clarify where width maps to elsize
-Pseudo-code (excludes CSR SIMD bitwidth):
+Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
if (unit-strided) stride = elsize;
else stride = areg[as2]; // constant-strided
- pred_enabled = int_pred_enabled
preg = int_pred_reg[rd]
for (int i=0; i
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler. Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward. All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation. The actual instruction's format does not change:
-The "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
+[[!table data="""
+15 12 | 11 7 | 6 2 | 1 0 |
+funct4 | rd | rs | op |
+4 | 5 | 5 | 2 |
+C.MV | dest | src | C0 |
+"""]]
-# CSRs
+A simplified version of the pseudocode for this operation is as follows:
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
+ function op_mv(rd, rs) # MV not VMV!
+ Â rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
+ Â rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
+ Â ps = get_pred_val(FALSE, rs); # predication on src
+ Â pd = get_pred_val(FALSE, rd); # ... AND on dest
+ Â for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_vec[rs].isvec) while (!(ps & 1< r(N..N+M-1)
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
+Note that:
-Notes:
+* elwidth (SIMD) is not covered above
+* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
+ not covered
-* for the purposes of LOAD / STORE, Integer Registers which are
- marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
- of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
- "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
- as opposed to having the predicate register explicitly in the instruction.
+There are several different instructions from RVV that are covered by
+this one opcode:
-## Predication CSR
+[[!table data="""
+src | dest | predication | op |
+scalar | vector | none | VSPLAT |
+scalar | vector | destination | sparse VSPLAT |
+scalar | vector | 1-bit dest | VINSERT |
+vector | scalar | 1-bit? src | VEXTRACT |
+vector | vector | none | VCOPY |
+vector | vector | src | Vector Gather |
+vector | vector | dest | Vector Scatter |
+vector | vector | src & dest | Gather/Scatter |
+vector | vector | src == dest | sparse VCOPY |
+"""]]
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated. The first entry is whether predication
-is enabled. The second entry is whether the register index refers to a
-floating-point or an integer register. The third entry is the index
-of that register which is to be predicated (if referred to). The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6 | 5 | (4..0) | (4..0) |
-| ----- | - | - | ------- | ------- |
-| r0 | pren0 | i/f | regidx | predidx |
-| r1 | pren1 | i/f | regidx | predidx |
-| .. | pren.. | i/f | regidx | predidx |
-| r15 | pren15 | i/f | regidx | predidx |
+Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
+operations with inversion on the src and dest predication for one of the
+two C.MV operations.
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
+Note that in the instance where the Compressed Extension is not implemented,
+MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
+Note that the behaviour is **different** from C.MV because with addi the
+predication mask to use is taken **only** from rd and is applied against
+all elements: rs[i] = rd[i].
- fp_pred_enabled[32];
- int_pred_enabled[32];
- for (i = 0; i < 16; i++)
- if CSRpred[i].pren:
- idx = CSRpred[i].regidx
- predidx = CSRpred[i].predidx
- if CSRpred[i].type == 0: # integer
- int_pred_enabled[idx] = 1
- int_pred_reg[idx] = predidx
- else:
- fp_pred_enabled[idx] = 1
- fp_pred_reg[idx] = predidx
+### FMV, FNEG and FABS Instructions
-So when an operation is to be predicated, it is the internal state that
-is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
+These are identical in form to C.MV, except covering floating-point
+register copying. The same double-predication rules also apply.
+However when elwidth is not set to default the instruction is implicitly
+and automatic converted to a (vectorised) floating-point type conversion
+operation of the appropriate size covering the source and destination
+register bitwidths.
- for (int i=0; i What does an ADD of two different-sized vectors do in simple-V?
-Vector lengths are interpreted as meaning "any instruction referring to
-r(N) generates implicit identical instructions referring to registers
-r(N+M-1) where M is the Vector Length". Vector Lengths may be set to
-use up to 16 registers in the register file.
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+ than the destination, throw an exception.
-One separate CSR table is needed for each of the integer and floating-point
-register files:
+> And what about instructions like JALR?Â
+> What does jumping to a vector do?
-| RegNo | (3..0) |
-| ----- | ------ |
-| r0 | vlen0 |
-| r1 | vlen1 |
-| .. | vlen.. |
-| r31 | vlen31 |
+* Throw an exception. Whether that actually results in spawning threads
+ as part of the trap-handling remains to be seen.
-An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
-whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector. A vector length of 1 indicates
-that it is to be treated as a scalar. Vector lengths of 0 are reserved.
+# Under consideration
-Internally, implementations may choose to use the non-zero vector length
-to set a bit-field per register, to be used in the instruction decode phase.
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+ (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+ (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+ (a bit like misaligned addressing... for registers)
+ or just use predication to skip start?
-Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
-bitwidth is specifically not set) it becomes:
+## Future (extra) bank be included (made mandatory)
- CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change. Also it has implications for context-switching.
-This is in contrast to RVV:
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV. SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
- CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+## How large should the Register and Predication CSR key-value stores be?
-## Element (SIMD) bitwidth CSRs
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed. At the time of writing
+(12jul2018) that is too early to tell. An approximate best-guess
+however would be 16 entries.
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0 | vew0 |
-| r1 | vew1 |
-| .. | vew.. |
-| r31 | vew31 |
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
-vew may be one of the following (giving a table "bytestable", used below):
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point. However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free. However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes. This would be considered quite
+rare but not outside of the realm of possibility.
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
+Conclusion: all needs careful analysis and future work.
- vew = CSRbitwidth[rs1]
- if (vew == 0)
- bytesperreg = (XLEN/8) # or FLEN as appropriate
- else:
- bytesperreg = bytestable[vew] # 1 2 4 8 16
- simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
- vlen = CSRvectorlen[rs1] * simdmult
- CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear. It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero". The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
+## Can CLIP be done as a CSR (mode, like elwidth)
-# Exceptions
+RVV appears to be going this way. At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
-> What does an ADD of two different-sized vectors do in simple-V?
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis. Other bits may be used for other purposes
+(see SIMD saturation below)
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
- than the destination, throw an exception.
+## SIMD saturation (etc.) also set as a mode?
-> And what about instructions like JALR?Â
-> What does jumping to a vector do?
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
-* Throw an exception. Whether that actually results in spawning threads
- as part of the trap-handling remains to be seen.
+## Include src1/src2 predication on Comparison Ops?
+
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
+
+The natural question therefore to ask is: where else could this
+flexibility be deployed? What about comparison operations?
+
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+ regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
+
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
+
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
+
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
+
+It is quite complex, in other words, and needs careful consideration.
+
+## 8/16-bit ops is it worthwhile adding a "start offset"?
+
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
+
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
+
+ r3[0] = add r4[0], r6[0]
+ r3[1] = add r4[1], r6[1]
+ r3[2] = add r4[2], r6[2]
+ r3[3] = add r4[3], r6[3]
+
+an offset of 1 would result in four operations as follows, instead:
+
+ r3[0] = add r4[1], r6[0]
+ r3[1] = add r4[2], r6[1]
+ r3[2] = add r4[3], r6[2]
+ r3[3] = add r5[0], r6[3]
+
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in. So this
+leaves just the packed-SIMD case to consider.
+
+Two ways in which this could be implemented / emulated (without special
+hardware):
+
+* bit-manipulation that shuffles the data along by one byte (or one word)
+ either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+ penalties for doing so.
+
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset. On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
# Impementing V on top of Simple-V
-* Number of Offset CSRs extends from 2
-* Extra register file: vector-file
-* Setup of Vector length and bitwidth CSRs now can specify vector-file
- as well as integer or float file.
-* Extend CSR tables (bitwidth) with extra bits
-* TODO
+With Simple-V converting the original RVV draft concept-for-concept
+from explicit opcodes to implicit overloading of existing RV Standard
+Extensions, certain features were (deliberately) excluded that need
+to be added back in for RVV to reach its full potential. This is
+made slightly complicated by the fact that RVV itself has two
+levels: Base and reserved future functionality.
+
+* Representation Encoding is entirely left out of Simple-V in favour of
+ implicitly taking the exact (explicit) meaning from RV Standard Extensions.
+* VCLIP and VCLIPI do not have corresponding RV Standard Extension
+ opcodes (and are the only such operations).
+* Extended Element bitwidths (1 through to 24576 bits) were left out
+ of Simple-V as, again, there is no corresponding RV Standard Extension
+ that covers anything even below 32-bit operands.
+* Polymorphism was entirely left out of Simple-V due to the inherent
+ complexity of automatic type-conversion.
+* Vector Register files were specifically left out of Simple-V in favour
+ of fitting on top of the integer and floating-point files. An
+ "RVV re-retro-fit" needs to be able to mark (implicitly marked)
+ registers as being actually in a separate *vector* register file.
+* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
+ register file size is 5 bits (32 registers), whilst the "Extended"
+ variant of RVV specifies 8 bits (256 registers) and has yet to
+ be published.
+* One big difference: Sections 17.12 and 17.17, there are only two possible
+ predication registers in RVV "Base". Through the "indirect" method,
+ Simple-V provides a key-value CSR table that allows (arbitrarily)
+ up to 16 (TBD) of either the floating-point or integer registers to
+ be marked as "predicated" (key), and if so, which integer register to
+ use as the predication mask (value).
+
+**TODO**
# Implementing P (renamed to DSP) on top of Simple-V
@@ -817,6 +1278,11 @@ This section compares the various parallelism proposals as they stand,
including traditional SIMD, in terms of features, ease of implementation,
complexity, flexibility, and die area.
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
### [[alt_rvp]]
Primary benefit of Alt-RVP is the simplicity with which parallelism
@@ -847,9 +1313,9 @@ from actual (internal) parallel hardware. It's an API in effect that's
designed to be slotted in to an existing implementation (just after
instruction decode) with minimum disruption and effort.
-* minus: the complexity of having to use register renames, OoO, VLIW,
- register file cacheing, all of which has been done before but is a
- pain
+* minus: the complexity (if full parallelism is to be exploited)
+ of having to use register renames, OoO, VLIW, register file cacheing,
+ all of which has been done before but is a pain
* plus: transparent re-use of existing opcodes as-is just indirectly
saying "this register's now a vector" which
* plus: means that future instructions also get to be inherently
@@ -958,7 +1424,7 @@ the question is asked "How can each of the proposals effectively implement
a SIMD architecture where the ALU becomes responsible for the parallelism,
Alt-RVP ALUs would likewise be so responsible... with *additional*
(lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
at least one dimension are avoided (architectural upgrades introducing
128-bit then 256-bit then 512-bit variants of the exact same 64-bit
SIMD block)
@@ -1034,6 +1500,15 @@ the question is asked "How can each of the proposals effectively implement
operations, all the while keeping a consistent ISA-level "API" irrespective
of implementor design choices (or indeed actual implementations).
+### Example Instruction translation:
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FIFO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
## Example of vector / vector, vector / scalar, scalar / scalar => vector add
register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
@@ -1085,10 +1560,10 @@ There is, in the standard Conditional Branch instruction, more than
adequate space to interpret it in a similar fashion:
[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
+imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
"""]]
This would become:
@@ -1108,19 +1583,19 @@ not only to add in a second source register, but also use some of the bits as
a predication target as well.
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | op |
+3 | 3 | 3 | 5 | 2 |
+C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
"""]]
Now uses the CS format:
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
"""]]
Bit 6 would be decoded as "operation refers to Integer or Float" including
@@ -1223,16 +1698,16 @@ still be respected*, making Simple-V in effect the "consistent public API".
vew may be one of the following (giving a table "bytestable", used below):
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
Pseudocode for vector length taking CSR SIMD-bitwidth into account:
@@ -1253,15 +1728,6 @@ To index an element in a register rnum where the vector element index is i:
byteidx * 8, # low
byteidx * 8 + (vew-1), # high
-### Example Instruction translation:
-
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FILO:
-
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
-
### Insights
SIMD register file splitting still to consider. For RV64, benefits of doubling
@@ -1359,7 +1825,7 @@ So the question boils down to:
Whilst the above may seem to be severe minuses, there are some strong
pluses:
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
* Smaller reduction of P's opcode space: around 10%.
* The potential to use Compressed instructions in both Vector and SIMD
due to the overloading of register meaning (implicit vectorisation,
@@ -1409,7 +1875,116 @@ RVV nor base RV have taken integer-overflow (carry) into account, which
makes proposing it quite challenging given that the relevant (Base) RV
sections are frozen. Consequently it makes sense to forgo this feature.
-## Virtual Memory page-faults
+## Context Switch Example
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction). An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+ /* Macro for saving task context */
+ .macro portSAVE_CONTEXT
+ .global pxCurrentTCB
+ /* make room in stack */
+ addi sp, sp, -REGBYTES * 32
+
+ /* Save Context */
+ STORE x1, 0x0(sp)
+ STORE x2, 1 * REGBYTES(sp)
+ STORE x3, 2 * REGBYTES(sp)
+ ...
+ ...
+ STORE x30, 29 * REGBYTES(sp)
+ STORE x31, 30 * REGBYTES(sp)
+
+ /* Store current stackpointer in task control block (TCB) */
+ LOAD t0, pxCurrentTCB //pointer
+ STORE sp, 0x0(t0)
+ .endm
+
+ /* Saves current error program counter (EPC) as task program counter */
+ .macro portSAVE_EPC
+ csrr t0, mepc
+ STORE t0, 31 * REGBYTES(sp)
+ .endm
+
+ /* Saves current return adress (RA) as task program counter */
+ .macro portSAVE_RA
+ STORE ra, 31 * REGBYTES(sp)
+ .endm
+
+ /* Macro for restoring task context */
+ .macro portRESTORE_CONTEXT
+
+ .global pxCurrentTCB
+ /* Load stack pointer from the current TCB */
+ LOAD sp, pxCurrentTCB
+ LOAD sp, 0x0(sp)
+
+ /* Load task program counter */
+ LOAD t0, 31 * REGBYTES(sp)
+ csrw mepc, t0
+
+ /* Run in machine mode */
+ li t0, MSTATUS_PRV1
+ csrs mstatus, t0
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ LOAD x4, 3 * REGBYTES(sp)
+ LOAD x5, 4 * REGBYTES(sp)
+ ...
+ ...
+ LOAD x30, 29 * REGBYTES(sp)
+ LOAD x31, 30 * REGBYTES(sp)
+
+ addi sp, sp, REGBYTES * 32
+ mret
+ .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+ /* a few things are assumed here: (a) that when switching to
+ M-Mode an entirely different set of CSRs is used from that
+ which is used in U-Mode and (b) that the M-Mode x1 and x4
+ vectors are also not used anywhere else in M-Mode, consequently
+ only need to be set up just the once.
+ */
+ .macroVectorSetup
+ MVECTORCSRx1 = 31, defaultlen
+ MVECTORCSRx4 = 28, defaultlen
+
+ /* Save Context */
+ SETVL x0, x0, 31 /* x0 ignored silently */
+ STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ SETVL x0, x0, 28 /* x0 ignored silently */
+ LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored. If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
+## Virtual Memory page-faults on LOAD/STORE
+
+
+### Notes from conversations
> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
> riscv-isa-manual in order to work out how to re-map RVV onto the standard
@@ -1513,6 +2088,225 @@ Am still thinking through the implications as any dependent operations
(particularly ones already decoded and moved into the execution FIFO)
would still be there (and stalled). hmmm.
+----
+
+ > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
+ > > VSETL r0, 8
+ > > FADD x1, x2, x3
+ >
+ > > x3[0]: ok
+ > > x3[1]: exception
+ > > x3[2]: ok
+ > > ...
+ > > ...
+ > > x3[7]: ok
+ >
+ > > what happens to result elements 2-7? Â those may be *big* results
+ > > (RV128)
+ > > or in the RVV-Extended may be arbitrary bit-widths far greater.
+ >
+ > Â (you replied:)
+ >
+ > Thrown away.
+
+discussion then led to the question of OoO architectures
+
+> The costs of the imprecise-exception model are greater than the benefit.
+> Software doesn't want to cope with it. Â It's hard to debug. Â You can't
+> migrate state between different microarchitectures--unless you force all
+> implementations to support the same imprecise-exception model, which would
+> greatly limit implementation flexibility. Â (Less important, but still
+> relevant, is that the imprecise model increases the size of the context
+> structure, as the microarchitectural guts have to be spilled to memory.)
+
+## Zero/Non-zero Predication
+
+>> > Â it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed. Â if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> > Â if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> > Â whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesnât apply to only your
+> model with the âstandard register fileâ is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a âzero bitâ per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation. Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU Â (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no". 3R because you need
+> to check src1, src2 and dest simultaneously. the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+> is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove. it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay. so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+
+
+## Under review / discussion: remove CSR vector length, use VSETVL
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**. This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+ register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+ is placed in a dest register), and if the "required" length is longer
+ than the *available* length, the dest reg is set to the MIN of those
+ two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+ length and thus there is no way (at the time that VSETVL is called) to
+ know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+ the vector operation to the length specified in the destination
+ register's CSR, at the time that each instruction is issued...
+ except that that cannot possibly be guaranteed to match
+ with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+ VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+ destreg equal to MIN(counterreg, lenimmed), with register-based
+ variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+ a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call). This is probably desirable behaviour.
+
+## Implementation Paradigms
+
+TODO: assess various implementation paradigms. These are listed roughly
+in order of simplicity (minimum compliance, for ultra-light-weight
+embedded systems or to reduce design complexity and the burden of
+design implementation and compliance, in non-critical areas), right the
+way to high-performance systems.
+
+* Full (or partial) software-emulated (via traps): full support for CSRs
+ required, however when a register is used that is detected (in hardware)
+ to be vectorised, an exception is thrown.
+* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
+* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
+* Out-of-order with instruction FIFOs and aggressive register-renaming
+* VLIW
+
+Also to be taken into consideration:
+
+* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
+* Comphrensive vectorisation: FIFOs and internal parallelism
+* Hybrid Parallelism
+
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised. On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support. In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
+# TODO Research
+
+> For great floating point DSPs check TIâs C3x, C4X, and C6xx DSPs
+
+Idea: basic simple butterfly swap on a few element indices, primarily targetted
+at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping,
+perhaps allow reindexing of permutations up to 4 elements? 8? Reason:
+such operations are less costly than a full indexed-shuffle, which requires
+a separate instruction cycle.
+
+Predication "all zeros" needs to be "leave alone". Detection of
+ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
+ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
+Destruction of destination indices requires a copy of the entire vector
+in advance to avoid.
+
+TBD: floating-point compare and other exception handling
+
# References
* SIMD considered harmful
@@ -1542,3 +2336,11 @@ would still be there (and stalled). hmmm.
* Discussion on RVV "re-entrant" capabilities allowing operations to be
restarted if an exception occurs (VM page-table miss)
+* Dot Product Vector
+* RVV slides 2017
+* Wavefront skipping using BRAMS
+* Streaming Pipelines
+* Barcelona SIMD Presentation
+*
+* Full Description (last page) of RVV instructions
+