# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
+* TODO 23may2018: CSR-CAM-ify regfile tables
+* TODO 23may2018: zero-mark predication CSR
+* TODO 28may2018: sort out VSETVL: CSR length to be removed?
+
Key insight: Simple-V is intended as an abstraction layer to provide
a consistent "API" to parallelisation of existing *and future* operations.
*Actual* internal hardware-level parallelism is *not* required, such
* Integer Register N is a Predication Register (note: a key-value store)
* Vector Length CSR (VSETVL, VGETVL)
+Also (see Appendix, "Context Switch Example") it may turn out to be important
+to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
+Vectorised LOAD / STORE may be used to load and store multiple registers:
+something that is missing from the Base RV ISA.
+
Notes:
* for the purposes of LOAD / STORE, Integer Registers which are
The reason for setting this limit is so that predication registers, when
marked as such, may fit into a single register as opposed to fanning out
over several registers. This keeps the implementation a little simpler.
+Note also (as also described in the VSETVL section) that the *minimum*
+for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
+and 31 for RV32 or RV64).
+
Note that RVV on top of Simple-V may choose to over-ride this decision.
## Vector-length CSRs
bringing parallelised opcodes down to 32-bit (when combined with C)
and having the benefit of being explicit.*
+## VSETVL
+
+NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
+with the instruction format remaining the same.
+
+VSETVL is slightly different from RVV in that the minimum vector length
+is required to be at least the number of registers in the register file,
+and no more than XLEN. This allows vector LOAD/STORE to be used to switch
+the entire bank of registers using a single instruction (see Appendix,
+"Context Switch Example"). The reason for limiting VSETVL to XLEN is
+down to the fact that predication bits fit into a single register of length
+XLEN bits.
+
+The second minor change is that when VSETVL is requested to be stored
+into x0, it is *ignored* silently.
+
+Unlike RVV, implementors *must* provide pseudo-parallelism (using sequential
+loops in hardware) if actual hardware-parallelism in the ALUs is not deployed.
+A hybrid is also permitted (as used in Broadcom's VideoCore-IV) however this
+must be *entirely* transparent to the ISA.
+
+### Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+ register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+ is placed in a dest register), and if the "required" length is longer
+ than the *available* length, the dest reg is set to the MIN of those
+ two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+ length and thus there is no way (at the time that VSETVL is called) to
+ know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+ the vector operation to the length specified in the destination
+ register's CSR, at the time that each instruction is issued...
+ except that that cannot possibly be guaranteed to match
+ with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+ VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+ destreg equal to MIN(counterreg, lenimmed), with register-based
+ variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+ a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call). This is probably desirable behaviour.
+
## Branch Instruction:
Branch operations use standard RV opcodes that are reinterpreted to be
comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
src1 and src2).
-## LOAD / STORE Instructions
+## LOAD / STORE Instructions <a name="load_store"></a>
For full analysis of topological adaptation of RVV LOAD/STORE
see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
including traditional SIMD, in terms of features, ease of implementation,
complexity, flexibility, and die area.
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
### [[alt_rvp]]
Primary benefit of Alt-RVP is the simplicity with which parallelism
[[!table data="""
31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
"""]]
vew may be one of the following (giving a table "bytestable", used below):
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
Pseudocode for vector length taking CSR SIMD-bitwidth into account:
Whilst the above may seem to be severe minuses, there are some strong
pluses:
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
* Smaller reduction of P's opcode space: around 10%.
* The potential to use Compressed instructions in both Vector and SIMD
due to the overloading of register meaning (implicit vectorisation,
makes proposing it quite challenging given that the relevant (Base) RV
sections are frozen. Consequently it makes sense to forgo this feature.
+## Context Switch Example <a name="context_switch"></a>
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction). An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+ /* Macro for saving task context */
+ .macro portSAVE_CONTEXT
+ .global pxCurrentTCB
+ /* make room in stack */
+ addi sp, sp, -REGBYTES * 32
+
+ /* Save Context */
+ STORE x1, 0x0(sp)
+ STORE x2, 1 * REGBYTES(sp)
+ STORE x3, 2 * REGBYTES(sp)
+ ...
+ ...
+ STORE x30, 29 * REGBYTES(sp)
+ STORE x31, 30 * REGBYTES(sp)
+
+ /* Store current stackpointer in task control block (TCB) */
+ LOAD t0, pxCurrentTCB //pointer
+ STORE sp, 0x0(t0)
+ .endm
+
+ /* Saves current error program counter (EPC) as task program counter */
+ .macro portSAVE_EPC
+ csrr t0, mepc
+ STORE t0, 31 * REGBYTES(sp)
+ .endm
+
+ /* Saves current return adress (RA) as task program counter */
+ .macro portSAVE_RA
+ STORE ra, 31 * REGBYTES(sp)
+ .endm
+
+ /* Macro for restoring task context */
+ .macro portRESTORE_CONTEXT
+
+ .global pxCurrentTCB
+ /* Load stack pointer from the current TCB */
+ LOAD sp, pxCurrentTCB
+ LOAD sp, 0x0(sp)
+
+ /* Load task program counter */
+ LOAD t0, 31 * REGBYTES(sp)
+ csrw mepc, t0
+
+ /* Run in machine mode */
+ li t0, MSTATUS_PRV1
+ csrs mstatus, t0
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ LOAD x4, 3 * REGBYTES(sp)
+ LOAD x5, 4 * REGBYTES(sp)
+ ...
+ ...
+ LOAD x30, 29 * REGBYTES(sp)
+ LOAD x31, 30 * REGBYTES(sp)
+
+ addi sp, sp, REGBYTES * 32
+ mret
+ .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+ /* a few things are assumed here: (a) that when switching to
+ M-Mode an entirely different set of CSRs is used from that
+ which is used in U-Mode and (b) that the M-Mode x1 and x4
+ vectors are also not used anywhere else in M-Mode, consequently
+ only need to be set up just the once.
+ */
+ .macroVectorSetup
+ MVECTORCSRx1 = 31, defaultlen
+ MVECTORCSRx4 = 28, defaultlen
+
+ /* Save Context */
+ SETVL x0, x0, 31 /* x0 ignored silently */
+ STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth
+
+ /* Restore registers,
+ Skip global pointer because that does not change */
+ LOAD x1, 0x0(sp)
+ SETVL x0, x0, 28 /* x0 ignored silently */
+ LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored. If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
## Virtual Memory page-faults on LOAD/STORE
> relevant, is that the imprecise model increases the size of the context
> structure, as the microarchitectural guts have to be spilled to memory.)
-
-## Implementation Paradigms
+## Zero/Non-zero Predication
+
+>> > it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed. if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> > if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> > whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesn’t apply to only your
+> model with the “standard register file” is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a “zero bit” per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection <a name="scalar_detection"></a>
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation. Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed.
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no". 3R because you need
+> to check src1, src2 and dest simultaneously. the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+> is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
TODO: assess various implementation paradigms. These are listed roughly
in order of simplicity (minimum compliance, for ultra-light-weight
* Comphrensive vectorisation: FIFOs and internal parallelism
* Hybrid Parallelism
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised. On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support. In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
# TODO Research
> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
restarted if an exception occurs (VM page-table miss)
<https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
* Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>
+* RVV slides 2017 <https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
+* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Streaming Pipelines <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf>
+* Barcelona SIMD Presentation <https://content.riscv.org/wp-content/uploads/2018/05/09.05.2018-9.15-9.30am-RISCV201805-Andes-proposed-P-extension.pdf>
+* <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>