# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
-* TODO 23may2018: CSR-CAM-ify regfile tables
-* TODO 23may2018: zero-mark predication CSR
-* TODO 28may2018: sort out VSETVL: CSR length to be removed?
-* TODO 09jun2018: Chennai Presentation more up-to-date
-* TODO 09jun2019: elwidth only 4 values (dflt, dflt/2, 8, 16)
-* TODO 09jun2019: extra register banks (future option)
-* TODO 09jun2019: new Reg CSR table (incl. packed=Y/N)
-
-
Key insight: Simple-V is intended as an abstraction layer to provide
a consistent "API" to parallelisation of existing *and future* operations.
*Actual* internal hardware-level parallelism is *not* required, such
*Actual* parallelism, if added independently of Simple-V in the form
of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit *if*
-Simple-V was added on top.
+implementations, or SIMD, or anything else, would then benefit from
+the uniformity of a consistent API.
[[!toc ]]
* Throw an exception. Whether that actually results in spawning threads
as part of the trap-handling remains to be seen.
+# Under consideration <a name="issues"></a>
+
+From the Chennai 2018 slides the following issues were raised.
+Efforts to analyse and answer these questions are below.
+
+* Should future extra bank be included now?
+* How many Register and Predication CSRs should there be?
+ (and how many in RV32E)
+* How many in M-Mode (for doing context-switch)?
+* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+* Can CLIP be done as a CSR (mode, like elwidth)
+* SIMD saturation (etc.) also set as a mode?
+* Include src1/src2 predication on Comparison Ops?
+ (same arrangement as C.MV, with same flexibility/power)
+* 8/16-bit ops is it worthwhile adding a "start offset"?
+ (a bit like misaligned addressing... for registers)
+ or just use predication to skip start?
+
+## Future (extra) bank be included (made mandatory)
+
+The implications of expanding the *standard* register file from
+32 entries per bank to 64 per bank is quite an extensive architectural
+change. Also it has implications for context-switching.
+
+Therefore, on balance, it is not recommended and certainly should
+not be made a *mandatory* requirement for the use of SV. SV's design
+ethos is to be minimally-disruptive for implementors to shoe-horn
+into an existing design.
+
+## How large should the Register and Predication CSR key-value stores be?
+
+This is something that definitely needs actual evaluation and for
+code to be run and the results analysed. At the time of writing
+(12jul2018) that is too early to tell. An approximate best-guess
+however would be 16 entries.
+
+RV32E however is a special case, given that it is highly unlikely
+(but not outside the realm of possibility) that it would be used
+for performance reasons but instead for reducing instruction count.
+The number of CSR entries therefore has to be considered extremely
+carefully.
+
+## How many CSR entries in M-Mode or S-Mode (for context-switching)?
+
+The minimum required CSR entries would be 1 for each register-bank:
+one for integer and one for floating-point. However, as shown
+in the "Context Switch Example" section, for optimal efficiency
+(minimal instructions in a low-latency situation) the CSRs for
+the context-switch should be set up *and left alone*.
+
+This means that it is not really a good idea to touch the CSRs
+used for context-switching in the M-Mode (or S-Mode) trap, so
+if there is ever demonstrated a need for vectors then there would
+need to be *at least* one more free. However just one does not make
+much sense (as it one only covers scalar-vector ops) so it is more
+likely that at least two extra would be needed.
+
+This *in addition* - in the RV32E case - if an RV32E implementation
+happens also to support U/S/M modes. This would be considered quite
+rare but not outside of the realm of possibility.
+
+Conclusion: all needs careful analysis and future work.
+
+## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
+
+On balance it's a neat idea however it does seem to be one where the
+benefits are not really clear. It would however obviate the need for
+an exception to be raised if the VL runs out of registers to put
+things in (gets to x31, tries a non-existent x32 and fails), however
+the "fly in the ointment" is that x0 is hard-coded to "zero". The
+increment therefore would need to be double-stepped to skip over x0.
+Some microarchitectures could run into difficulties (SIMD-like ones
+in particular) so it needs a lot more thought.
+
+## Can CLIP be done as a CSR (mode, like elwidth)
+
+RVV appears to be going this way. At the time of writing (12jun2018)
+it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
+clip by way of exactly this method: setting a "clip mode" in a CSR.
+
+No details are given however the most sensible thing to have would be
+to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
+extra bits specifying the type of clipping to be carried out, on
+a per-register basis. Other bits may be used for other purposes
+(see SIMD saturation below)
+
+## SIMD saturation (etc.) also set as a mode?
+
+Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
+may also need extra details (what the saturation maximum is for example).
+
+## Include src1/src2 predication on Comparison Ops?
+
+In the C.MV (and other ops - see "C.MV Instruction"), the decision
+was taken, unlike in ADD (etc.) which are 3-operand ops, to use
+*both* the src *and* dest predication masks to give an extremely
+powerful and flexible instruction that covers a huge number of
+"traditional" vector opcodes.
+
+The natural question therefore to ask is: where else could this
+flexibility be deployed? What about comparison operations?
+
+Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
+predicated comparison operations are actually a *three* operand
+instruction:
+
+ regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
+
+Therefore at first glance it does not make sense to use src1 and src2
+predication masks, as it breaks the rule of 3-operand instructions
+to use the *destination* predication register.
+
+In this case however, the destination *is* a predication register
+as opposed to being a predication mask that is applied *to* the
+(vectorised) operation, element-at-a-time on src1 and src2.
+
+Thus the question is directly inter-related to whether the modification
+of the predication mask should *itself* be predicated.
+
+It is quite complex, in other words, and needs careful consideration.
+
+## 8/16-bit ops is it worthwhile adding a "start offset"?
+
+The idea here is to make it possible, particularly in a "Packed SIMD"
+case, to be able to avoid doing unaligned Load/Store operations
+by specifying that operations, instead of being carried out
+element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
+element Packed SIMD cases.
+
+For example rather than take 2 32-bit registers divided into 4 8-bit
+elements and have them ADDed element-for-element as follows:
+
+ r3[0] = add r4[0], r6[0]
+ r3[1] = add r4[1], r6[1]
+ r3[2] = add r4[2], r6[2]
+ r3[3] = add r4[3], r6[3]
+
+an offset of 1 would result in four operations as follows, instead:
+
+ r3[0] = add r4[1], r6[0]
+ r3[1] = add r4[2], r6[1]
+ r3[2] = add r4[3], r6[2]
+ r3[3] = add r5[0], r6[3]
+
+In non-packed-SIMD mode there is no benefit at all, as a vector may
+be created using a different CSR that has the offset built-in. So this
+leaves just the packed-SIMD case to consider.
+
+Two ways in which this could be implemented / emulated (without special
+hardware):
+
+* bit-manipulation that shuffles the data along by one byte (or one word)
+ either prior to or as part of the operation requiring the offset.
+* just use an unaligned Load/Store sequence, even if there are performance
+ penalties for doing so.
+
+The question then is whether the performance hit is worth the extra hardware
+involving byte-shuffling/shifting the data by an arbitrary offset. On
+balance given that there are two reasonable instruction-based options, the
+hardware-offset option should be left out for the initial version of SV,
+with the option to consider it in an "advanced" version of the specification.
+
# Impementing V on top of Simple-V
With Simple-V converting the original RVV draft concept-for-concept
### Example Instruction translation: <a name="example_translation"></a>
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FIFO:
+Instructions "ADD r7 r4 r4" would result in three instructions being
+generated and placed into the FIFO. r7 and r4 are marked as "vectorised":
+
+* ADD r7 r4 r4
+* ADD r8 r5 r5
+* ADD r9 r6 r6
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
+Instructions "ADD r7 r4 r1" would result in three instructions being
+generated and placed into the FIFO. r7 and r1 are marked as "vectorised"
+whilst r4 is not:
+
+* ADD r7 r4 r1
+* ADD r8 r4 r2
+* ADD r9 r4 r3
## Example of vector / vector, vector / scalar, scalar / scalar => vector add
- register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
- register CSRpredicate[XLEN][4]; # 2^4 is max vector length
- register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
- register x[32][XLEN];
-
- function op_add(rd, rs1, rs2, predr)
- {
- /* note that this is ADD, not PADD */
- int i, id, irs1, irs2;
- # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
- # also destination makes no sense as a scalar but what the hell...
- for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
- if (CSRpredicate[predr][i]) # i *think* this is right...
- x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
- # now increment the idxs
- if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
- id += 1;
- if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
- irs1 += 1;
- if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
- irs2 += 1;
- }
+ function op_add(rd, rs1, rs2) # add not VADD!
+ int i, id=0, irs1=0, irs2=0;
+ rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
+ rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
+ rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
+ predval = get_pred_val(FALSE, rd);
+ for (i = 0; i < VL; i++)
+ if (predval & 1<<i) # predication uses intregs
+ ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
+ if (int_vec[rd ].isvector) { id += 1; }
+ if (int_vec[rs1].isvector) { irs1 += 1; }
+ if (int_vec[rs2].isvector) { irs2 += 1; }
## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
* <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
* Full Description (last page) of RVV instructions
<https://inst.eecs.berkeley.edu/~cs152/sp18/handouts/lab4-1.0.pdf>
+* PULP Low-energy Cluster Vector Processor
+ <http://iis-projects.ee.ethz.ch/index.php/Low-Energy_Cluster-Coupled_Vector_Coprocessor_for_Special-Purpose_PULP_Acceleration>