add images

[libreriscv.git] / simple_v_extension.mdwn
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn

index 6d1c9e5a52cb3f80cdc93f8ec2612080965c3911..e57f69c7f9c6f3a49d1ea54ccbac194ea6a45b8f 100644 (file)
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,5 +1,9 @@
  # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
  
+* TODO 23may2018: CSR-CAM-ify regfile tables
+* TODO 23may2018: zero-mark predication CSR
+* TODO 28may2018: sort out VSETVL: CSR length to be removed?
+
  Key insight: Simple-V is intended as an abstraction layer to provide
  a consistent "API" to parallelisation of existing *and future* operations.
  *Actual* internal hardware-level parallelism is *not* required, such
@@ -367,6 +371,11 @@ precedent in the setting of MISA to enable / disable extensions).
  * Integer Register N is a Predication Register (note: a key-value store)
  * Vector Length CSR (VSETVL, VGETVL)
  
+Also (see Appendix, "Context Switch Example") it may turn out to be important
+to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
+Vectorised LOAD / STORE may be used to load and store multiple registers:
+something that is missing from the Base RV ISA.
+
  Notes:
  
  * for the purposes of LOAD / STORE, Integer Registers which are
@@ -457,6 +466,10 @@ and so on).
  The reason for setting this limit is so that predication registers, when
  marked as such, may fit into a single register as opposed to fanning out
  over several registers.  This keeps the implementation a little simpler.
+Note also (as also described in the VSETVL section) that the *minimum*
+for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
+and 31 for RV32 or RV64).
+
  Note that RVV on top of Simple-V may choose to over-ride this decision.
  
  ## Vector-length CSRs
@@ -586,6 +599,63 @@ instruction codes in Quadrant 1 could be used as a parallelism prefix,
  bringing parallelised opcodes down to 32-bit (when combined with C)
  and having the benefit of being explicit.*
  
+## VSETVL
+
+NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
+with the instruction format remaining the same.
+
+VSETVL is slightly different from RVV in that the minimum vector length
+is required to be at least the number of registers in the register file,
+and no more than XLEN.  This allows vector LOAD/STORE to be used to switch
+the entire bank of registers using a single instruction (see Appendix,
+"Context Switch Example").  The reason for limiting VSETVL to XLEN is
+down to the fact that predication bits fit into a single register of length
+XLEN bits.
+
+The second minor change is that when VSETVL is requested to be stored
+into x0, it is *ignored* silently.
+
+Unlike RVV, implementors *must* provide pseudo-parallelism (using sequential
+loops in hardware) if actual hardware-parallelism in the ALUs is not deployed.
+A hybrid is also permitted (as used in Broadcom's VideoCore-IV) however this
+must be *entirely* transparent to the ISA.
+
+### Under review / discussion: remove CSR vector length, use VSETVL <a name="vsetvl"></a>
+
+So the issue is as follows:
+
+* CSRs are used to set the "span" of a vector (how many of the standard
+  register file to contiguously use)
+* VSETVL in RVV works as follows: it sets the vector length (copy of which
+  is placed in a dest register), and if the "required" length is longer
+  than the *available* length, the dest reg is set to the MIN of those
+  two.
+* **HOWEVER**... in SV, *EVERY* vector register has its own separate
+  length and thus there is no way (at the time that VSETVL is called) to
+  know what to set the vector length *to*.
+* At first glance it seems that it would be perfectly fine to just limit
+  the vector operation to the length specified in the destination
+  register's CSR, at the time that each instruction is issued...
+  except that that cannot possibly be guaranteed to match
+  with the value *already loaded into the target register from VSETVL*.
+
+Therefore a different approach is needed.
+
+Possible options include:
+
+* Removing the CSR "Vector Length" and always using the value from
+  VSETVL.  "VSETVL destreg, counterreg, #lenimmed" will set VL *and*
+  destreg equal to MIN(counterreg, lenimmed), with register-based
+  variant "VSETVL destreg, counterreg, lenreg" doing the same.
+* Keeping the CSR "Vector Length" and having the lenreg version have
+  a "twist": "if lengreg is vectorised, read the length from the CSR"
+* Other (TBD)
+
+The first option (of the ones brainstormed so far) is a lot simpler.
+It does however mean that the length set in VSETVL will apply across-the-board
+to all src1, src2 and dest vectorised registers until it is otherwise changed
+(by another VSETVL call).  This is probably desirable behaviour.
+
  ## Branch Instruction:
  
  Branch operations use standard RV opcodes that are reinterpreted to be
@@ -727,7 +797,7 @@ Notes:
    comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
    src1 and src2).
  
-## LOAD / STORE Instructions
+## LOAD / STORE Instructions <a name="load_store"></a>
  
  For full analysis of topological adaptation of RVV LOAD/STORE
  see [[v_comparative_analysis]].  All three types (LD, LD.S and LD.X)
@@ -906,6 +976,11 @@ This section compares the various parallelism proposals as they stand,
  including traditional SIMD, in terms of features, ease of implementation,
  complexity, flexibility, and die area.
  
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
  ### [[alt_rvp]]
  
  Primary benefit of Alt-RVP is the simplicity with which parallelism
@@ -1185,7 +1260,7 @@ adequate space to interpret it in a similar fashion:
  [[!table  data="""
  31      |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7       | 6....0 |
  imm[12] | imm[10:5]  |rs2   | rs1  | funct3 | imm[4:1] | imm[11] | opcode |
- 1      |        6   | 5    | 5    | 3      | 4        | 1       |   7    |
+ 1      | 6          | 5    | 5    | 3      | 4        | 1       |   7    |
     offset[12,10:5]  || src2 | src1 | BEQ    | offset[11,4:1]    || BRANCH |
  """]]
  
@@ -1321,16 +1396,16 @@ still be respected*, making Simple-V in effect the "consistent public API".
  
  vew may be one of the following (giving a table "bytestable", used below):
  
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default  |
-| 001 | 8        |
-| 010 | 16       |
-| 011 | 32       |
-| 100 | 64       |
-| 101 | 128      |
-| 110 | rsvd     |
-| 111 | rsvd     |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default  | XLEN/8     |
+| 001 | 8        | 1          |
+| 010 | 16       | 2          |
+| 011 | 32       | 4          |
+| 100 | 64       | 8          |
+| 101 | 128      | 16         |
+| 110 | rsvd     | rsvd       |
+| 111 | rsvd     | rsvd       |
  
  Pseudocode for vector length taking CSR SIMD-bitwidth into account:
  
@@ -1448,7 +1523,7 @@ So the question boils down to:
  Whilst the above may seem to be severe minuses, there are some strong
  pluses:
  
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
  * Smaller reduction of P's opcode space: around 10%.
  * The potential to use Compressed instructions in both Vector and SIMD
    due to the overloading of register meaning (implicit vectorisation,
@@ -1498,6 +1573,112 @@ RVV nor base RV have taken integer-overflow (carry) into account, which
  makes proposing it quite challenging given that the relevant (Base) RV
  sections are frozen.  Consequently it makes sense to forgo this feature.
  
+## Context Switch Example <a name="context_switch"></a>
+
+An unusual side-effect of Simple-V mapping onto the standard register files
+is that LOAD-multiple and STORE-multiple are accidentally available, as long
+as it is acceptable that the register(s) to be loaded/stored are contiguous
+(per instruction).  An additional accidental benefit is that Compressed LD/ST
+may also be used.
+
+To illustrate how this works, here is some example code from FreeRTOS
+(GPLv2 licensed, portasm.S):
+
+    /* Macro for saving task context */
+    .macro portSAVE_CONTEXT
+        .global        pxCurrentTCB
+        /* make room in stack */
+        addi   sp, sp, -REGBYTES * 32
+
+        /* Save Context */
+        STORE  x1, 0x0(sp)
+        STORE  x2, 1 * REGBYTES(sp)
+        STORE  x3, 2 * REGBYTES(sp)
+        ...
+        ...
+        STORE  x30, 29 * REGBYTES(sp)
+        STORE  x31, 30 * REGBYTES(sp)
+        
+        /* Store current stackpointer in task control block (TCB) */
+        LOAD   t0, pxCurrentTCB        //pointer
+        STORE  sp, 0x0(t0)
+        .endm
+
+    /* Saves current error program counter (EPC) as task program counter */
+    .macro portSAVE_EPC
+        csrr   t0, mepc
+        STORE  t0, 31 * REGBYTES(sp)
+        .endm
+
+    /* Saves current return adress (RA) as task program counter */
+    .macro portSAVE_RA
+        STORE  ra, 31 * REGBYTES(sp)
+        .endm
+
+    /* Macro for restoring task context */
+    .macro portRESTORE_CONTEXT
+
+        .global        pxCurrentTCB
+        /* Load stack pointer from the current TCB */
+        LOAD   sp, pxCurrentTCB
+        LOAD   sp, 0x0(sp)
+
+        /* Load task program counter */
+        LOAD   t0, 31 * REGBYTES(sp)
+        csrw   mepc, t0
+
+        /* Run in machine mode */
+        li             t0, MSTATUS_PRV1
+        csrs   mstatus, t0
+
+        /* Restore registers,
+           Skip global pointer because that does not change */
+        LOAD   x1, 0x0(sp)
+        LOAD   x4, 3 * REGBYTES(sp)
+        LOAD   x5, 4 * REGBYTES(sp)
+        ...
+        ...
+        LOAD   x30, 29 * REGBYTES(sp)
+        LOAD   x31, 30 * REGBYTES(sp)
+
+        addi   sp, sp, REGBYTES * 32
+        mret
+        .endm
+
+The important bits are the Load / Save context, which may be replaced
+with firstly setting up the Vectors and secondly using a *single* STORE
+(or LOAD) including using C.ST or C.LD, to indicate that the entire
+bank of registers is to be loaded/saved:
+
+    /* a few things are assumed here: (a) that when switching to
+       M-Mode an entirely different set of CSRs is used from that
+       which is used in U-Mode and (b) that the M-Mode x1 and x4
+       vectors are also not used anywhere else in M-Mode, consequently
+       only need to be set up just the once.
+     */
+    .macroVectorSetup
+        MVECTORCSRx1 = 31, defaultlen
+        MVECTORCSRx4 = 28, defaultlen
+    
+        /* Save Context */
+        SETVL x0, x0, 31 /* x0 ignored silently */
+        STORE  x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth 
+        
+        /* Restore registers,
+           Skip global pointer because that does not change */
+        LOAD   x1, 0x0(sp)
+        SETVL x0, x0, 28 /* x0 ignored silently */
+        LOAD   x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth
+
+Note that although it may just be a bug in portasm.S, x2 and x3 appear not
+to be being restored.  If however this is a bug and they *do* need to be
+restored, then the SETVL call may be moved to *outside* the Save / Restore
+Context assembly code, into the macroVectorSetup, as long as vectors are
+never used anywhere else (i.e. VL is never altered by M-Mode).
+
+In effect the entire bank of repeated LOAD / STORE instructions is replaced
+by one single (compressed if it is available) instruction.
+
  ## Virtual Memory page-faults on LOAD/STORE
  
  
@@ -1636,8 +1817,79 @@ discussion then led to the question of OoO architectures
  > relevant, is that the imprecise model increases the size of the context
  > structure, as the microarchitectural guts have to be spilled to memory.)
  
-
-## Implementation Paradigms
+## Zero/Non-zero Predication
+
+>> >  it just occurred to me that there's another reason why the data
+>> > should be left instead of zeroed.  if the standard register file is
+>> > used, such that vectorised operations are translated to mean "please
+>> > insert multiple register-contiguous operations into the instruction
+>> > FIFO" and predication is used to *skip* some of those, then if the
+>> > next "vector" operation uses the (standard) registers that were masked
+>> > *out* of the previous operation it may proceed without blocking.
+>> >
+>> >  if however zeroing is made mandatory then that optimisation becomes
+>> > flat-out impossible to deploy.
+>> >
+>> >  whilst i haven't fully thought through the full implications, i
+>> > suspect RVV might also be able to benefit by being able to fit more
+>> > overlapping operations into the available SRAM by doing something
+>> > similar.
+>
+>
+> Luke, this is called density time masking. It doesn’t apply to only your
+> model with the “standard register file” is used. it applies to any
+> architecture that attempts to speed up by skipping computation and writeback
+> of masked elements.
+>
+> That said, the writing of zeros need not be explicit. It is possible to add
+> a “zero bit” per element that, when set, forces a zero to be read from the
+> vector (although the underlying storage may have old data). In this case,
+> there may be a way to implement DTM as well.
+
+
+## Implementation detail for scalar-only op detection <a name="scalar_detection"></a>
+
+Note 1: this idea is a pipeline-bypass concept, which may *or may not* be
+worthwhile.
+
+Note 2: this is just one possible implementation.  Another implementation
+may choose to treat *all* operations as vectorised (including treating
+scalars as vectors of length 1), choosing to add an extra pipeline stage
+dedicated to *all* instructions.
+
+This section *specifically* covers the implementor's freedom to choose
+that they wish to minimise disruption to an existing design by detecting
+"scalar-only operations", bypassing the vectorisation phase (which may
+or may not require an additional pipeline stage)
+
+[[scalardetect.png]]
+
+>> For scalar ops an implementation may choose to compare 2-3 bits through an
+>> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr
+>> FIFO).
+
+> Those bits cannot be known until after the registers are decoded from the
+> instruction and a lookup in the "vector length table" has completed. 
+> Considering that one of the reasons RISC-V keeps registers in invariant
+> positions across all instructions is to simplify register decoding, I expect
+> that inserting an SRAM read would lengthen the critical path in most
+> implementations.
+
+reply:
+
+> briefly: the trick i mentioned about ANDing bits together to check if
+> an op was fully-scalar or not was to be read out of a single 32-bit
+> 3R1W SRAM (64-bit if FPU exists).  the 32/64-bit SRAM contains 1 bit per
+> register indicating "is register vectorised yes no".  3R because you need
+> to check src1, src2 and dest simultaneously.  the entries are *generated*
+> from the CSRs and are an optimisation that on slower embedded systems
+> would likely not be needed.
+
+>  is there anything unreasonable that anyone can foresee about that?
+> what are the down-sides?
+
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
  
  TODO: assess various implementation paradigms.  These are listed roughly
  in order of simplicity (minimum compliance, for ultra-light-weight
@@ -1659,6 +1911,20 @@ Also to be taken into consideration:
  * Comphrensive vectorisation: FIFOs and internal parallelism
  * Hybrid Parallelism
  
+### Full or partial software-emulation
+
+The absolute, absolute minimal implementation is to provide the full
+set of CSRs and detection logic for when any of the source or destination
+registers are vectorised.  On detection, a trap is thrown, whether it's
+a branch, LOAD, STORE, or an arithmetic operation.
+
+Implementors are entirely free to choose whether to allow absolutely every
+single operation to be software-emulated, or whether to provide some emulation
+and some hardware support.  In particular, for an RV32E implementation
+where fast context-switching is a requirement (see "Context Switch Example"),
+it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an
+exception, as every context-switch will result in double-traps.
+
  # TODO Research
  
  > For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
@@ -1707,3 +1973,8 @@ TBD: floating-point compare and other exception handling
    restarted if an exception occurs (VM page-table miss)
    <https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
  * Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>
+* RVV slides 2017 <https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
+* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf> 
+* Streaming Pipelines <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf>
+* Barcelona SIMD Presentation <https://content.riscv.org/wp-content/uploads/2018/05/09.05.2018-9.15-9.30am-RISCV201805-Andes-proposed-P-extension.pdf>
+* <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>