X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=4cf0937edde233618bbd012a4a8ad571c9acd473;hb=1df960ad2fec1a7f252ce30ce808dff52f2a072d;hp=6805549905920f3e14fc60d90426960c494c43f3;hpb=08521a40f097fc96cc9890625fef491e42964aeb;p=libreriscv.git
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn
index 680554990..4cf0937ed 100644
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -9,8 +9,8 @@ instruction queue (FIFO), pending execution.
*Actual* parallelism, if added independently of Simple-V in the form
of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
[[!toc ]]
@@ -126,7 +126,8 @@ reducing power consumption for the same.
SIMD again has a severe disadvantage here, over Vector: huge proliferation
of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
have to then have operations *for each and between each*. It gets very
-messy, very quickly.
+messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
+proliferation profile.
The V-Extension on the other hand proposes to set the bit-width of
future instructions on a per-register basis, such that subsequent instructions
@@ -356,16 +357,19 @@ level all-hardware parallelism. Options are covered in the Appendix.
# CSRs
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
+There are two CSR tables needed to create lookup tables which are used at
+the register decode phase.
-* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Integer Register N is Vector
* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
+
+Also (see Appendix, "Context Switch Example") it may turn out to be important
+to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
+Vectorised LOAD / STORE may be used to load and store multiple registers:
+something that is missing from the Base RV ISA.
Notes:
@@ -384,40 +388,68 @@ Notes:
V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
needed for context-switches (empty slots need never be stored).
-## Predication CSR
+## Predication CSR
The Predication CSR is a key-value store indicating whether, if a given
destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated. The first entry is whether predication
-is enabled. The second entry is whether the register index refers to a
-floating-point or an integer register. The third entry is the index
-of that register which is to be predicated (if referred to). The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6 | 5 | (4..0) | (4..0) |
-| ----- | - | - | ------- | ------- |
-| r0 | pren0 | i/f | regidx | predidx |
-| r1 | pren1 | i/f | regidx | predidx |
-| .. | pren.. | i/f | regidx | predidx |
-| r15 | pren15 | i/f | regidx | predidx |
+instruction, it is to be predicated. However it is important to note
+that the *actual* register is *different* from the one that ends up
+being used, due to the level of indirection through the lookup table.
+This includes (in the future) redirecting to a *second* bank of
+integer registers (as a future option)
+
+* regidx is the actual register that in combination with the
+ i/f flag, if that integer or floating-point register is referred to,
+ results in the lookup table being referenced to find the predication
+ mask to use on the operation in which that (regidx) register has
+ been used
+* predidx (in combination with the bank bit in the future) is the
+ *actual* register to be used for the predication mask. Note:
+ in effect predidx is actually a 6-bit register address, as the bank
+ bit is the MSB (and is nominally set to zero for now).
+* inv indicates that the predication mask bits are to be inverted
+ prior to use *without* actually modifying the contents of the
+ register itself.
+* zeroing is either 1 or 0, and if set to 1, the operation must
+ place zeros in any element position where the predication mask is
+ set to zero. If zeroing is set to 1, unpredicated elements *must*
+ be left alone. Some microarchitectures may choose to interpret
+ this as skipping the operation entirely. Others which wish to
+ stick more closely to a SIMD architecture may choose instead to
+ interpret unpredicated elements as an internal "copy element"
+ operation (which would be necessary in SIMD microarchitectures
+ that perform register-renaming)
+
+| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
+| ----- | - | - | - | - | ------- | ------- |
+| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx |
+| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx |
+| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx |
+| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx |
The Predication CSR Table is a key-value store, so implementation-wise
it will be faster to turn the table around (maintain topologically
equivalent state):
- fp_pred_enabled[32];
- int_pred_enabled[32];
+ struct pred {
+ bool zero;
+ bool inv;
+ bool bank; // 0 for now, 1=rsvd
+ bool enabled;
+ int predidx; // redirection: actual int register to use
+ }
+
+ struct pred fp_pred_reg[32]; // 64 in future (bank=1)
+ struct pred int_pred_reg[32]; // 64 in future (bank=1)
+
for (i = 0; i < 16; i++)
- if CSRpred[i].pren:
- idx = CSRpred[i].regidx
- predidx = CSRpred[i].predidx
- if CSRpred[i].type == 0: # integer
- int_pred_enabled[idx] = 1
- int_pred_reg[idx] = predidx
- else:
- fp_pred_enabled[idx] = 1
- fp_pred_reg[idx] = predidx
+ tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
+ idx = CSRpred[i].regidx
+ tb[idx].zero = CSRpred[i].zero
+ tb[idx].inv = CSRpred[i].inv
+ tb[idx].bank = CSRpred[i].bank
+ tb[idx].predidx = CSRpred[i].predidx
+ tb[idx].enabled = true
So when an operation is to be predicated, it is the internal state that
is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
@@ -431,24 +463,54 @@ reference to the predication register to be used:
s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store:
+table generated from the Predication CSR key-value store, which iwws used
+as follows.
if type(iop) == INT:
- pred_enabled = int_pred_enabled
preg = int_pred_reg[rd]
else:
- pred_enabled = fp_pred_enabled
preg = fp_pred_reg[rd]
for (int i=0; i 1;
- s2 = CSRvectorlen[src2] > 1;
- for (int i=0; i
For full analysis of topological adaptation of RVV LOAD/STORE
see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
@@ -767,15 +869,14 @@ Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
if (unit-strided) stride = elsize;
else stride = areg[as2]; // constant-strided
- pred_enabled = int_pred_enabled
preg = int_pred_reg[rd]
for (int i=0; i
+
+There is no MV instruction in RV however there is a C.MV instruction.
+It is used for copying integer-to-integer registers (vectorised FMV
+is used for copying floating-point).
+
+If either the source or the destination register are marked as vectors
+C.MV is reinterpreted to be a vectorised (multi-register) predicated
+move operation. The actual instruction's format does not change:
+
+[[!table data="""
+15 12 | 11 7 | 6 2 | 1 0 |
+funct4 | rd | rs | op |
+4 | 5 | 5 | 2 |
+C.MV | dest | src | C0 |
+"""]]
+
+A simplified version of the pseudocode for this operation is as follows:
+
+ function op_mv(rd, rs) # MV not VMV!
+ Â rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
+ Â rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
+ Â ps = get_pred_val(FALSE, rs); # predication on src
+ Â pd = get_pred_val(FALSE, rd); # ... AND on dest
+ Â for (int i = 0, int j = 0; i < VL && j < VL;):
+ if (int_vec[rs].isvec) while (!(ps & 1< What does an ADD of two different-sized vectors do in simple-V?
@@ -849,6 +1059,168 @@ Section "Virtual Memory Page Faults".
* Throw an exception. Whether that actually results in spawning threads
as part of the trap-handling remains to be seen.
+# Under consideration
+
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+ (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+ (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+ (a bit like misaligned addressing... for registers)
+ or just use predication to skip start?
+
+## Future (extra) bank be included (made mandatory)
+
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change. Also it has implications for context-switching.
+
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV. SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
+
+## How large should the Register and Predication CSR key-value stores be?
+
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed. At the time of writing
+(12jul2018) that is too early to tell. An approximate best-guess
+however would be 16 entries.
+
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
+
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
+
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point. However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
+
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free. However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
+
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes. This would be considered quite
+rare but not outside of the realm of possibility.
+
+Conclusion: all needs careful analysis and future work.
+
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear. It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero". The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
+
+## Can CLIP be done as a CSR (mode, like elwidth)
+
+RVV appears to be going this way. At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
+
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis. Other bits may be used for other purposes
+(see SIMD saturation below)
+
+## SIMD saturation (etc.) also set as a mode?
+
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
+
+## Include src1/src2 predication on Comparison Ops?
+
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
+
+The natural question therefore to ask is: where else could this
+flexibility be deployed? What about comparison operations?
+
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+ regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
+
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
+
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
+
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
+
+It is quite complex, in other words, and needs careful consideration.
+
+## 8/16-bit ops is it worthwhile adding a "start offset"?
+
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
+
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
+
+ r3[0] = add r4[0], r6[0]
+ r3[1] = add r4[1], r6[1]
+ r3[2] = add r4[2], r6[2]
+ r3[3] = add r4[3], r6[3]
+
+an offset of 1 would result in four operations as follows, instead:
+
+ r3[0] = add r4[1], r6[0]
+ r3[1] = add r4[2], r6[1]
+ r3[2] = add r4[3], r6[2]
+ r3[3] = add r5[0], r6[3]
+
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in. So this
+leaves just the packed-SIMD case to consider.
+
+Two ways in which this could be implemented / emulated (without special
+hardware):
+
+* bit-manipulation that shuffles the data along by one byte (or one word)
+ either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+ penalties for doing so.
+
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset. On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
+
# Impementing V on top of Simple-V
With Simple-V converting the original RVV draft concept-for-concept
@@ -906,6 +1278,11 @@ This section compares the various parallelism proposals as they stand,
including traditional SIMD, in terms of features, ease of implementation,
complexity, flexibility, and die area.
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
### [[alt_rvp]]
Primary benefit of Alt-RVP is the simplicity with which parallelism
@@ -1321,16 +1698,16 @@ still be respected*, making Simple-V in effect the "consistent public API".
vew may be one of the following (giving a table "bytestable", used below):
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
Pseudocode for vector length taking CSR SIMD-bitwidth into account:
@@ -1448,7 +1825,7 @@ So the question boils down to:
Whilst the above may seem to be severe minuses, there are some strong
pluses:
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
* Smaller reduction of P's opcode space: around 10%.
* The potential to use Compressed instructions in both Vector and SIMD
due to the overloading of register meaning (implicit vectorisation,
@@ -1498,6 +1875,112 @@ RVV nor base RV have taken integer-overflow (carry) into account, which
makes proposing it quite challenging given that the relevant (Base) RV
sections are frozen. Consequently it makes sense to forgo this feature.
+## Context Switch Example
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction). An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+ /* Macro for saving task context */
+ .macro portSAVE_CONTEXT
+ .global pxCurrentTCB
+ /* make room in stack */
+ addi sp, sp, -REGBYTES * 32
+
+ /* Save Context */
+ STORE x1, 0x0(sp)
+ STORE x2, 1 * REGBYTES(sp)
+ STORE x3, 2 * REGBYTES(sp)
+ ...
+ ...
+ STORE x30, 29 * REGBYTES(sp)
+ STORE x31, 30 * REGBYTES(sp)
+
+ /* Store current stackpointer in task control block (TCB) */
+ LOAD t0, pxCurrentTCB //pointer
+ STORE sp, 0x0(t0)
+ .endm
+
+ /* Saves current error program counter (EPC) as task program counter */
+ .macro portSAVE_EPC
+ csrr t0, mepc
+ STORE t0, 31 * REGBYTES(sp)
+ .endm
+
+ /* Saves current return adress (RA) as task program counter */
+ .macro portSAVE_RA
+ STORE ra, 31 * REGBYTES(sp)
+ .endm
+
+ /* Macro for restoring task context */
+ .macro portRESTORE_CONTEXT
+
+ .global pxCurrentTCB
+ /* Load stack pointer from the current TCB */
+ LOAD sp, pxCurrentTCB
+ LOAD sp, 0x0(sp)
+
+ /* Load task program counter */
+ LOAD t0, 31 * REGBYTES(sp)
+ csrw mepc, t0
+
+ /* Run in machine mode */
+ li t0, MSTATUS_PRV1
+ csrs mstatus, t0
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ LOAD x4, 3 * REGBYTES(sp)
+ LOAD x5, 4 * REGBYTES(sp)
+ ...
+ ...
+ LOAD x30, 29 * REGBYTES(sp)
+ LOAD x31, 30 * REGBYTES(sp)
+
+ addi sp, sp, REGBYTES * 32
+ mret
+ .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+ /* a few things are assumed here: (a) that when switching to
+ M-Mode an entirely different set of CSRs is used from that
+ which is used in U-Mode and (b) that the M-Mode x1 and x4
+ vectors are also not used anywhere else in M-Mode, consequently
+ only need to be set up just the once.
+ */
+ .macroVectorSetup
+ MVECTORCSRx1 = 31, defaultlen
+ MVECTORCSRx4 = 28, defaultlen
+
+ /* Save Context */
+ SETVL x0, x0, 31 /* x0 ignored silently */
+ STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ SETVL x0, x0, 28 /* x0 ignored silently */
+ LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored. If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
## Virtual Memory page-faults on LOAD/STORE
@@ -1636,8 +2119,141 @@ discussion then led to the question of OoO architectures
> relevant, is that the imprecise model increases the size of the context
> structure, as the microarchitectural guts have to be spilled to memory.)
+## Zero/Non-zero Predication
+
+>> > Â it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed. Â if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> > Â if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> > Â whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesnât apply to only your
+> model with the âstandard register fileâ is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a âzero bitâ per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation. Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU Â (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no". 3R because you need
+> to check src1, src2 and dest simultaneously. the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+> is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove. it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay. so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+
+
+## Under review / discussion: remove CSR vector length, use VSETVL
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**. This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+ register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+ is placed in a dest register), and if the "required" length is longer
+ than the *available* length, the dest reg is set to the MIN of those
+ two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+ length and thus there is no way (at the time that VSETVL is called) to
+ know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+ the vector operation to the length specified in the destination
+ register's CSR, at the time that each instruction is issued...
+ except that that cannot possibly be guaranteed to match
+ with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+ VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+ destreg equal to MIN(counterreg, lenimmed), with register-based
+ variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+ a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
-## Implementation Paradigms
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call). This is probably desirable behaviour.
+
+## Implementation Paradigms
TODO: assess various implementation paradigms. These are listed roughly
in order of simplicity (minimum compliance, for ultra-light-weight
@@ -1659,6 +2275,20 @@ Also to be taken into consideration:
* Comphrensive vectorisation: FIFOs and internal parallelism
* Hybrid Parallelism
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised. On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support. In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
# TODO Research
> For great floating point DSPs check TIâs C3x, C4X, and C6xx DSPs
@@ -1707,3 +2337,10 @@ TBD: floating-point compare and other exception handling
restarted if an exception occurs (VM page-table miss)
* Dot Product Vector
+* RVV slides 2017
+* Wavefront skipping using BRAMS
+* Streaming Pipelines
+* Barcelona SIMD Presentation
+*
+* Full Description (last page) of RVV instructions
+