+## Overflow registers in combination with predication
+
+**TODO**: propose overflow registers be actually one of the integer regs
+(flowing to multiple regs).
+
+**TODO**: propose "mask" (predication) registers likewise. combination with
+standard RV instructions and overflow registers extremely powerful, see
+Aspex ASP.
+
+When integer overflow is stored in an easily-accessible bit (or another
+register), parallelisation turns this into a group of bits which can
+potentially be interacted with in predication, in interesting and powerful
+ways. For example, by taking the integer-overflow result as a predication
+field and shifting it by one, a predicated vectorised "add one" can emulate
+"carry" on arbitrary (unlimited) length addition.
+
+However despite RVV having made room for floating-point exceptions, neither
+RVV nor base RV have taken integer-overflow (carry) into account, which
+makes proposing it quite challenging given that the relevant (Base) RV
+sections are frozen. Consequently it makes sense to forgo this feature.
+
+## Context Switch Example <a name="context_switch"></a>
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction). An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+ /* Macro for saving task context */
+ .macro portSAVE_CONTEXT
+ .global pxCurrentTCB
+ /* make room in stack */
+ addi sp, sp, -REGBYTES * 32
+
+ /* Save Context */
+ STORE x1, 0x0(sp)
+ STORE x2, 1 * REGBYTES(sp)
+ STORE x3, 2 * REGBYTES(sp)
+ ...
+ ...
+ STORE x30, 29 * REGBYTES(sp)
+ STORE x31, 30 * REGBYTES(sp)
+
+ /* Store current stackpointer in task control block (TCB) */
+ LOAD t0, pxCurrentTCB //pointer
+ STORE sp, 0x0(t0)
+ .endm
+
+ /* Saves current error program counter (EPC) as task program counter */
+ .macro portSAVE_EPC
+ csrr t0, mepc
+ STORE t0, 31 * REGBYTES(sp)
+ .endm
+
+ /* Saves current return adress (RA) as task program counter */
+ .macro portSAVE_RA
+ STORE ra, 31 * REGBYTES(sp)
+ .endm
+
+ /* Macro for restoring task context */
+ .macro portRESTORE_CONTEXT
+
+ .global pxCurrentTCB
+ /* Load stack pointer from the current TCB */
+ LOAD sp, pxCurrentTCB
+ LOAD sp, 0x0(sp)
+
+ /* Load task program counter */
+ LOAD t0, 31 * REGBYTES(sp)
+ csrw mepc, t0
+
+ /* Run in machine mode */
+ li t0, MSTATUS_PRV1
+ csrs mstatus, t0
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ LOAD x4, 3 * REGBYTES(sp)
+ LOAD x5, 4 * REGBYTES(sp)
+ ...
+ ...
+ LOAD x30, 29 * REGBYTES(sp)
+ LOAD x31, 30 * REGBYTES(sp)
+
+ addi sp, sp, REGBYTES * 32
+ mret
+ .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+ /* a few things are assumed here: (a) that when switching to
+ M-Mode an entirely different set of CSRs is used from that
+ which is used in U-Mode and (b) that the M-Mode x1 and x4
+ vectors are also not used anywhere else in M-Mode, consequently
+ only need to be set up just the once.
+ */
+ .macroVectorSetup
+ MVECTORCSRx1 = 31, defaultlen
+ MVECTORCSRx4 = 28, defaultlen
+
+ /* Save Context */
+ SETVL x0, x0, 31 /* x0 ignored silently */
+ STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ SETVL x0, x0, 28 /* x0 ignored silently */
+ LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored. If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
+## Virtual Memory page-faults on LOAD/STORE
+
+
+### Notes from conversations
+
+> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
+> riscv-isa-manual in order to work out how to re-map RVV onto the standard
+> ISA, and came across an interesting comments at the bottom of pages 75
+> and 76:
+
+> " A common mechanism used in other ISAs to further reduce save/restore
+> code size is load- multiple and store-multiple instructions. "
+
+> Fascinatingly, due to Simple-V proposing to use the *standard* register
+> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly
+> that: load-multiple and store-multiple instructions. Which brings us
+> on to this comment:
+
+> "For virtual memory systems, some data accesses could be resident in
+> physical memory and
+> some could not, which requires a new restart mechanism for partially
+> executed instructions."
+
+> Which then of course brings us to the interesting question: how does RVV
+> cope with the scenario when, particularly with LD.X (Indexed / indirect
+> loads), part-way through the loading a page fault occurs?
+
+> Has this been noted or discussed before?
+
+For applications-class platforms, the RVV exception model is
+element-precise (that is, if an exception occurs on element j of a
+vector instruction, elements 0..j-1 have completed execution and elements
+j+1..vl-1 have not executed).
+
+Certain classes of embedded platforms where exceptions are always fatal
+might choose to offer resumable/swappable interrupts but not precise
+exceptions.
+
+
+> Is RVV designed in any way to be re-entrant?
+
+Yes.
+
+
+> What would the implications be for instructions that were in a FIFO at
+> the time, in out-of-order and VLIW implementations, where partial decode
+> had taken place?
+
+The usual bag of tricks for maintaining precise exceptions applies to
+vector machines as well. Register renaming makes the job easier, and
+it's relatively cheaper for vectors, since the control cost is amortized
+over longer registers.
+
+
+> Would it be reasonable at least to say *bypass* (and freeze) the
+> instruction FIFO (drop down to a single-issue execution model temporarily)
+> for the purposes of executing the instructions in the interrupt (whilst
+> setting up the VM page), then re-continue the instruction with all
+> state intact?
+
+This approach has been done successfully, but it's desirable to be
+able to swap out the vector unit state to support context switches on
+exceptions that result in long-latency I/O.
+
+
+> Or would it be better to switch to an entirely separate secondary
+> hyperthread context?
+
+> Does anyone have any ideas or know if there is any academic literature
+> on solutions to this problem?
+
+The Vector VAX offered imprecise but restartable and swappable exceptions:
+http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf
+
+Sec. 4.6 of Krste's dissertation assesses some of
+the tradeoffs and references a bunch of related work:
+http://people.eecs.berkeley.edu/~krste/thesis.pdf
+
+
+----
+
+Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P
+exceptions" and thought, "hmmm that could go into a CSR, must re-read
+the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly
+thought, "ah ha! what if the memory exceptions were, instead of having
+an immediate exception thrown, were simply stored in a type of predication
+bit-field with a flag "error this element failed"?
+
+Then, *after* the vector load (or store, or even operation) was
+performed, you could *then* raise an exception, at which point it
+would be possible (yes in software... I know....) to go "hmmm, these
+indexed operations didn't work, let's get them into memory by triggering
+page-loads", then *re-run the entire instruction* but this time with a
+"memory-predication CSR" that stops the already-performed operations
+(whether they be loads, stores or an arithmetic / FP operation) from
+being carried out a second time.
+
+This theoretically could end up being done multiple times in an SMP
+environment, and also for LD.X there would be the remote outside annoying
+possibility that the indexed memory address could end up being modified.
+
+The advantage would be that the order of execution need not be
+sequential, which potentially could have some big advantages.
+Am still thinking through the implications as any dependent operations
+(particularly ones already decoded and moved into the execution FIFO)
+would still be there (and stalled). hmmm.
+
+----
+
+ > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
+ > > VSETL r0, 8
+ > > FADD x1, x2, x3
+ >
+ > > x3[0]: ok
+ > > x3[1]: exception
+ > > x3[2]: ok
+ > > ...
+ > > ...
+ > > x3[7]: ok
+ >
+ > > what happens to result elements 2-7? those may be *big* results
+ > > (RV128)
+ > > or in the RVV-Extended may be arbitrary bit-widths far greater.
+ >
+ > (you replied:)
+ >
+ > Thrown away.
+
+discussion then led to the question of OoO architectures
+
+> The costs of the imprecise-exception model are greater than the benefit.
+> Software doesn't want to cope with it. It's hard to debug. You can't
+> migrate state between different microarchitectures--unless you force all
+> implementations to support the same imprecise-exception model, which would
+> greatly limit implementation flexibility. (Less important, but still
+> relevant, is that the imprecise model increases the size of the context
+> structure, as the microarchitectural guts have to be spilled to memory.)
+
+## Zero/Non-zero Predication
+
+>> > it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed. if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> > if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> > whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesn’t apply to only your
+> model with the “standard register file” is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a “zero bit” per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection <a name="scalar_detection"></a>
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation. Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no". 3R because you need
+> to check src1, src2 and dest simultaneously. the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+> is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove. it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay. so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+<https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/g3feFnAoKIM>
+
+## Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**. This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+ register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+ is placed in a dest register), and if the "required" length is longer
+ than the *available* length, the dest reg is set to the MIN of those
+ two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+ length and thus there is no way (at the time that VSETVL is called) to
+ know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+ the vector operation to the length specified in the destination
+ register's CSR, at the time that each instruction is issued...
+ except that that cannot possibly be guaranteed to match
+ with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+ VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+ destreg equal to MIN(counterreg, lenimmed), with register-based
+ variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+ a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call). This is probably desirable behaviour.
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
+
+TODO: assess various implementation paradigms. These are listed roughly
+in order of simplicity (minimum compliance, for ultra-light-weight
+embedded systems or to reduce design complexity and the burden of
+design implementation and compliance, in non-critical areas), right the
+way to high-performance systems.
+
+* Full (or partial) software-emulated (via traps): full support for CSRs
+ required, however when a register is used that is detected (in hardware)
+ to be vectorised, an exception is thrown.
+* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
+* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
+* Out-of-order with instruction FIFOs and aggressive register-renaming
+* VLIW
+
+Also to be taken into consideration:
+
+* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
+* Comphrensive vectorisation: FIFOs and internal parallelism
+* Hybrid Parallelism
+
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised. On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support. In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
+# TODO Research
+
+> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
+
+Idea: basic simple butterfly swap on a few element indices, primarily targetted
+at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping,
+perhaps allow reindexing of permutations up to 4 elements? 8? Reason:
+such operations are less costly than a full indexed-shuffle, which requires
+a separate instruction cycle.
+
+Predication "all zeros" needs to be "leave alone". Detection of
+ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
+ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
+Destruction of destination indices requires a copy of the entire vector
+in advance to avoid.
+
+TBD: floating-point compare and other exception handling
+