+## Zero/Non-zero Predication
+
+>> > it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed. if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> > if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> > whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesn’t apply to only your
+> model with the “standard register file” is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a “zero bit” per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection <a name="scalar_detection"></a>
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation. Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no". 3R because you need
+> to check src1, src2 and dest simultaneously. the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+> is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+## C.MV predicated src, predicated dest
+
+> Can this be usefully defined in such a way that it is
+> equivalent to vector gather-scatter on each source, followed by a
+> non-predicated vector-compare, followed by vector gather-scatter on the
+> result?
+
+## element width conversion: restrict or remove?
+
+summary: don't restrict / remove. it's fine.
+
+> > it has virtually no cost/overhead as long as you specify
+> > that inputs can only upconvert, and operations are always done at the
+> > largest size, and downconversion only happens at the output.
+>
+> okaaay. so that's a really good piece of implementation advice.
+> algorithms do require data size conversion, so at some point you need to
+> introduce the feature of upconverting and downconverting.
+>
+> > for int and uint, this is dead simple and fits well within the RVV pipeline
+> > without any critical path, pipeline depth, or area implications.
+
+<https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/g3feFnAoKIM>
+
+## Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines
+length on all regs**. This section kept for historical reasons.
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+ register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+ is placed in a dest register), and if the "required" length is longer
+ than the *available* length, the dest reg is set to the MIN of those
+ two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+ length and thus there is no way (at the time that VSETVL is called) to
+ know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+ the vector operation to the length specified in the destination
+ register's CSR, at the time that each instruction is issued...
+ except that that cannot possibly be guaranteed to match
+ with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+ VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+ destreg equal to MIN(counterreg, lenimmed), with register-based
+ variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+ a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)