X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=0642ce926d702469769bae2210773b1035398a95;hb=69f396a3a5e7b113ac7f4edc60dbb3641726b176;hp=8209cd73eb7f211c3157fd0eb2b38c533904d5e8;hpb=9969963343a68c1057eb56e3e8f0397d60946f9b;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 8209cd73e..0642ce926 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,5 +1,7 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal +**Note: this document is out of date and involved early ideas and discussions** + Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. *Actual* internal hardware-level parallelism is *not* required, such @@ -9,8 +11,13 @@ instruction queue (FIFO), pending execution. *Actual* parallelism, if added independently of Simple-V in the form of Out-of-order restructuring (including parallel ALU lanes) or VLIW -implementations, or SIMD, or anything else, would then benefit *if* -Simple-V was added on top. +implementations, or SIMD, or anything else, would then benefit from +the uniformity of a consistent API. + +**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E. + +* Talk slides: +* Specification: now move to its own page: [[specification]] [[!toc ]] @@ -21,7 +28,7 @@ requirements: power-conscious, area-conscious, and performance-conscious designs all pull an ISA and its implementation in different conflicting directions, as do the specific intended uses for any given implementation. -Additionally, the existing P (SIMD) proposal and the V (Vector) proposals, +The existing P (SIMD) proposal and the V (Vector) proposals, whilst each extremely powerful in their own right and clearly desirable, are also: @@ -31,8 +38,8 @@ are also: analysis and review purposes) prohibitively expensive * Both contain partial duplication of pre-existing RISC-V instructions (an undesirable characteristic) -* Both have independent and disparate methods for introducing parallelism - at the instruction level. +* Both have independent, incompatible and disparate methods for introducing + parallelism at the instruction level * Both require that their respective parallelism paradigm be implemented along-side and integral to their respective functionality *or not at all*. * Both independently have methods for introducing parallelism that @@ -52,10 +59,13 @@ details outlined in the Appendix), the key points being: * Vectorisation typically includes much more comprehensive memory load and store schemes (unit stride, constant-stride and indexed), which in turn have ramifications: virtual memory misses (TLB cache misses) - and even multiple page-faults... all caused by a *single instruction*. + and even multiple page-faults... all caused by a *single instruction*, + yet with a clear benefit that the regularisation of LOAD/STOREs can + be optimised for minimal impact on caches and maximised throughput. * By contrast, SIMD can use "standard" memory load/stores (32-bit aligned to pages), and these load/stores have absolutely nothing to do with the - SIMD / ALU engine, no matter how wide the operand. + SIMD / ALU engine, no matter how wide the operand. Simplicity but with + more impact on instruction and data caches. Overall it makes a huge amount of sense to have a means and method of introducing instruction parallelism in a flexible way that provides @@ -123,7 +133,8 @@ reducing power consumption for the same. SIMD again has a severe disadvantage here, over Vector: huge proliferation of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and have to then have operations *for each and between each*. It gets very -messy, very quickly. +messy, very quickly: *six* separate dimensions giving an O(N^6) instruction +proliferation profile. The V-Extension on the other hand proposes to set the bit-width of future instructions on a per-register basis, such that subsequent instructions @@ -135,7 +146,7 @@ burdensome to implementations, given that instruction decode already has to direct the operation to a correctly-sized width ALU engine, anyway. Not least: in places where an ISA was previously constrained (due for -whatever reason, including limitations of the available operand spcace), +whatever reason, including limitations of the available operand space), implicit bit-width allows the meaning of certain operations to be type-overloaded *without* pollution or alteration of frozen and immutable instructions, in a fully backwards-compatible fashion. @@ -208,10 +219,10 @@ Interestingly, none of this complexity is faced in SIMD architectures... but then they do not get the opportunity to optimise for highly-streamlined memory accesses either. -With the "bang-per-buck" ratio being so high and the direct improvement -in L1 Instruction Cache usage, as well as the opportunity to optimise -L1 and L2 cache usage, the case for including Vector LOAD/STORE is -compelling. +With the "bang-per-buck" ratio being so high and the indirect improvement +in L1 Instruction Cache usage (reduced instruction count), as well as +the opportunity to optimise L1 and L2 cache usage, the case for including +Vector LOAD/STORE is compelling. ## Mask and Tagging (Predication) @@ -232,8 +243,8 @@ So these are the ways in which conditional execution may be implemented: * explicit compare and branch: BNE x, y -> offs would jump offs instructions if x was not equal to y * explicit store of tag condition: CMP x, y -> tagbit -* implicit (condition-code) ADD results in a carry, carry bit implicitly - (or sometimes explicitly) goes into a "tag" (mask) register +* implicit (condition-code) such as ADD results in a carry, carry bit + implicitly (or sometimes explicitly) goes into a "tag" (mask) register The first of these is a "normal" branch method, which is flat-out impossible to parallelise without look-ahead and effectively rewriting instructions. @@ -304,256 +315,9 @@ In particular: i.e. *without* requiring a super-scalar or out-of-order architecture, but doing a proper, full job (ZOLC) is an entirely different matter. -Constructing a SIMD/Simple-Vector proposal based around four of these five +Constructing a SIMD/Simple-Vector proposal based around four of these six requirements would therefore seem to be a logical thing to do. -# Instruction Format - -The instruction format for Simple-V does not actually have *any* explicit -compare operations, *any* arithmetic, floating point or memory instructions. -Instead it *overloads* pre-existing branch operations into predicated -variants, and implicitly overloads arithmetic operations and LOAD/STORE -depending on implicit CSR configurations for both vector length and -bitwidth. *This includes Compressed instructions* as well as future ones. - -* For analysis of RVV see [[v_comparative_analysis]] which begins to - outline topologically-equivalent mappings of instructions -* Also see Appendix "Retro-fitting Predication into branch-explicit ISA" - for format of Branch opcodes. - -**TODO**: *analyse and decide whether the implicit nature of predication -as proposed is or is not a lot of hassle, and if explicit prefixes are -a better idea instead. Parallelism therefore effectively may end up -as always being 64-bit opcodes (32 for the prefix, 32 for the instruction) -with some opportunities for to use Compressed bringing it down to 48. -Also to consider is whether one or both of the last two remaining Compressed -instruction codes in Quadrant 1 could be used as a parallelism prefix, -bringing parallelised opcodes down to 32-bit (when combined with C) -and having the benefit of being explicit.* - -## Branch Instruction: - -This is the overloaded table for Integer-base Branch operations. Opcode -(bits 6..0) is set in all cases to 1100011. - -[[!table data=""" -31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | -imm[12|10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | -7 | 5 | 5 | 3 | 4 | 1 | 7 | -reserved | src2 | src1 | BPR | predicate rs3 || BRANCH | -reserved | src2 | src1 | 000 | predicate rs3 || BEQ | -reserved | src2 | src1 | 001 | predicate rs3 || BNE | -reserved | src2 | src1 | 010 | predicate rs3 || rsvd | -reserved | src2 | src1 | 011 | predicate rs3 || rsvd | -reserved | src2 | src1 | 100 | predicate rs3 || BLE | -reserved | src2 | src1 | 101 | predicate rs3 || BGE | -reserved | src2 | src1 | 110 | predicate rs3 || BLTU | -reserved | src2 | src1 | 111 | predicate rs3 || BGEU | -"""]] - -This is the overloaded table for Floating-point Predication operations. -Interestingly no change is needed to the instruction format because -FP Compare already stores a 1 or a zero in its "rd" integer register -target, i.e. it's not actually a Branch at all: it's a compare. -The target needs to simply change to be a predication bitfield. - -As with -Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011. -Likewise Single-precision, fmt bits 26..25) is still set to 00. -Double-precision is still set to 01, whilst Quad-precision -appears not to have a definition in V2.3-Draft (but should be unaffected). - -It is however noted that an entry "FNE" (the opposite of FEQ) is missing, -and whilst in ordinary branch code this is fine because the standard -RVF compare can always be followed up with an integer BEQ or a BNE (or -a compressed comparison to zero or non-zero), in predication terms that -becomes more of an impact as an explicit (scalar) instruction is needed -to invert the predicate. An additional encoding funct3=011 is therefore -proposed to cater for this. - -[[!table data=""" -31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 | -funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode | -5 | 2 | 5 | 5 | 3 | 4 | 7 | -10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ | -10100 | 00/01/11 | src2 | src1 | *011* | pred rs3 | FNE | -10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT | -10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE | -"""]] - -Note (**TBD**): floating-point exceptions will need to be extended -to cater for multiple exceptions (and statuses of the same). The -usual approach is to have an array of status codes and bit-fields, -and one exception, rather than throw separate exceptions for each -Vector element. - -In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given -for predicated compare operations of function "cmp": - - for (int i=0; i 1; - s2 = CSRvectorlen[src2] > 1; - for (int i=0; i - -There are a number of CSRs needed, which are used at the instruction -decode phase to re-interpret standard RV opcodes (a practice that has -precedent in the setting of MISA to enable / disable extensions). - -* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Integer Register N is a Predication Register (note: a key-value store) -* Vector Length CSR (VSETVL, VGETVL) - -Notes: - -* for the purposes of LOAD / STORE, Integer Registers which are - marked as a Vector will result in a Vector LOAD / STORE. -* Vector Lengths are *not* the same as vsetl but are an integral part - of vsetl. -* Actual vector length is *multipled* by how many blocks of length - "bitwidth" may fit into an XLEN-sized register file. -* Predication is a key-value store due to the implicit referencing, - as opposed to having the predicate register explicitly in the instruction. - -## Predication CSR - -The Predication CSR is a key-value store indicating whether, if a given -destination register (integer or floating-point) is referred to in an -instruction, it is to be predicated. The first entry is whether predication -is enabled. The second entry is whether the register index refers to a -floating-point or an integer register. The third entry is the index -of that register which is to be predicated (if referred to). The fourth entry -is the integer register that is treated as a bitfield, indexable by the -vector element index. - -| RegNo | 6 | 5 | (4..0) | (4..0) | -| ----- | - | - | ------- | ------- | -| r0 | pren0 | i/f | regidx | predidx | -| r1 | pren1 | i/f | regidx | predidx | -| .. | pren.. | i/f | regidx | predidx | -| r15 | pren15 | i/f | regidx | predidx | - -The Predication CSR Table is a key-value store, so implementation-wise -it will be faster to turn the table around (maintain topologically -equivalent state): - - fp_pred_enabled[32]; - int_pred_enabled[32]; - for (i = 0; i < 16; i++) - if CSRpred[i].pren: - idx = CSRpred[i].regidx - predidx = CSRpred[i].predidx - if CSRpred[i].type == 0: # integer - int_pred_enabled[idx] = 1 - int_pred_reg[idx] = predidx - else: - fp_pred_enabled[idx] = 1 - fp_pred_reg[idx] = predidx - -So when an operation is to be predicated, it is the internal state that -is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following -pseudo-code for operations is given, where p is the explicit (direct) -reference to the predication register to be used: - - for (int i=0; i What does an ADD of two different-sized vectors do in simple-V? -Vector lengths are interpreted as meaning "any instruction referring to -r(N) generates implicit identical instructions referring to registers -r(N+M-1) where M is the Vector Length". Vector Lengths may be set to -use up to 16 registers in the register file. +* if the two source operands are not the same, throw an exception. +* if the destination operand is also a vector, and the source is longer + than the destination, throw an exception. -One separate CSR table is needed for each of the integer and floating-point -register files: +> And what about instructions like JALR?  +> What does jumping to a vector do? -| RegNo | (3..0) | -| ----- | ------ | -| r0 | vlen0 | -| r1 | vlen1 | -| .. | vlen.. | -| r31 | vlen31 | +* Throw an exception. Whether that actually results in spawning threads + as part of the trap-handling remains to be seen. -An array of 32 4-bit CSRs is needed (4 bits per register) to indicate -whether a register was, if referred to in any standard instructions, -implicitly to be treated as a vector. A vector length of 1 indicates -that it is to be treated as a scalar. Vector lengths of 0 are reserved. +# Under consideration -Internally, implementations may choose to use the non-zero vector length -to set a bit-field per register, to be used in the instruction decode phase. -In this way any standard (current or future) operation involving -register operands may detect if the operation is to be vector-vector, -vector-scalar or scalar-scalar (standard) simply through a single -bit test. +From the Chennai 2018 slides the following issues were raised. +Efforts to analyse and answer these questions are below. + +* Should future extra bank be included now? +* How many Register and Predication CSRs should there be? + (and how many in RV32E) +* How many in M-Mode (for doing context-switch)? +* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? +* Can CLIP be done as a CSR (mode, like elwidth) +* SIMD saturation (etc.) also set as a mode? +* Include src1/src2 predication on Comparison Ops? + (same arrangement as C.MV, with same flexibility/power) +* 8/16-bit ops is it worthwhile adding a "start offset"? + (a bit like misaligned addressing... for registers) + or just use predication to skip start? -Note that when using the "vsetl rs1, rs2" instruction (caveat: when the -bitwidth is specifically not set) it becomes: +## Future (extra) bank be included (made mandatory) - CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2) +The implications of expanding the *standard* register file from +32 entries per bank to 64 per bank is quite an extensive architectural +change. Also it has implications for context-switching. -This is in contrast to RVV: +Therefore, on balance, it is not recommended and certainly should +not be made a *mandatory* requirement for the use of SV. SV's design +ethos is to be minimally-disruptive for implementors to shoe-horn +into an existing design. - CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2) +## How large should the Register and Predication CSR key-value stores be? -## Element (SIMD) bitwidth CSRs +This is something that definitely needs actual evaluation and for +code to be run and the results analysed. At the time of writing +(12jul2018) that is too early to tell. An approximate best-guess +however would be 16 entries. -Element bitwidths may be specified with a per-register CSR, and indicate -how a register (integer or floating-point) is to be subdivided. +RV32E however is a special case, given that it is highly unlikely +(but not outside the realm of possibility) that it would be used +for performance reasons but instead for reducing instruction count. +The number of CSR entries therefore has to be considered extremely +carefully. -| RegNo | (2..0) | -| ----- | ------ | -| r0 | vew0 | -| r1 | vew1 | -| .. | vew.. | -| r31 | vew31 | +## How many CSR entries in M-Mode or S-Mode (for context-switching)? -vew may be one of the following (giving a table "bytestable", used below): +The minimum required CSR entries would be 1 for each register-bank: +one for integer and one for floating-point. However, as shown +in the "Context Switch Example" section, for optimal efficiency +(minimal instructions in a low-latency situation) the CSRs for +the context-switch should be set up *and left alone*. -| vew | bitwidth | -| --- | -------- | -| 000 | default | -| 001 | 8 | -| 010 | 16 | -| 011 | 32 | -| 100 | 64 | -| 101 | 128 | -| 110 | rsvd | -| 111 | rsvd | +This means that it is not really a good idea to touch the CSRs +used for context-switching in the M-Mode (or S-Mode) trap, so +if there is ever demonstrated a need for vectors then there would +need to be *at least* one more free. However just one does not make +much sense (as it one only covers scalar-vector ops) so it is more +likely that at least two extra would be needed. -Extending this table (with extra bits) is covered in the section -"Implementing RVV on top of Simple-V". +This *in addition* - in the RV32E case - if an RV32E implementation +happens also to support U/S/M modes. This would be considered quite +rare but not outside of the realm of possibility. -Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth -into account, it becomes: +Conclusion: all needs careful analysis and future work. - vew = CSRbitwidth[rs1] - if (vew == 0) - bytesperreg = (XLEN/8) # or FLEN as appropriate - else: - bytesperreg = bytestable[vew] # 1 2 4 8 16 - simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate - vlen = CSRvectorlen[rs1] * simdmult - CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2) +## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? -The reason for multiplying the vector length by the number of SIMD elements -(in each individual register) is so that each SIMD element may optionally be -predicated. +On balance it's a neat idea however it does seem to be one where the +benefits are not really clear. It would however obviate the need for +an exception to be raised if the VL runs out of registers to put +things in (gets to x31, tries a non-existent x32 and fails), however +the "fly in the ointment" is that x0 is hard-coded to "zero". The +increment therefore would need to be double-stepped to skip over x0. +Some microarchitectures could run into difficulties (SIMD-like ones +in particular) so it needs a lot more thought. -An example of how to subdivide the register file when bitwidth != default -is given in the section "Bitwidth Virtual Register Reordering". +## Can CLIP be done as a CSR (mode, like elwidth) -# Exceptions +RVV appears to be going this way. At the time of writing (12jun2018) +it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do +clip by way of exactly this method: setting a "clip mode" in a CSR. -> What does an ADD of two different-sized vectors do in simple-V? +No details are given however the most sensible thing to have would be +to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have +extra bits specifying the type of clipping to be carried out, on +a per-register basis. Other bits may be used for other purposes +(see SIMD saturation below) -* if the two source operands are not the same, throw an exception. -* if the destination operand is also a vector, and the source is longer - than the destination, throw an exception. +## SIMD saturation (etc.) also set as a mode? -> And what about instructions like JALR?  -> What does jumping to a vector do? +Similar to "CLIP" as an extension to the CSR key-value store, "saturate" +may also need extra details (what the saturation maximum is for example). -* Throw an exception. Whether that actually results in spawning threads - as part of the trap-handling remains to be seen. +## Include src1/src2 predication on Comparison Ops? + +In the C.MV (and other ops - see "C.MV Instruction"), the decision +was taken, unlike in ADD (etc.) which are 3-operand ops, to use +*both* the src *and* dest predication masks to give an extremely +powerful and flexible instruction that covers a huge number of +"traditional" vector opcodes. + +The natural question therefore to ask is: where else could this +flexibility be deployed? What about comparison operations? + +Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst +predicated comparison operations are actually a *three* operand +instruction: + + regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0) + +Therefore at first glance it does not make sense to use src1 and src2 +predication masks, as it breaks the rule of 3-operand instructions +to use the *destination* predication register. + +In this case however, the destination *is* a predication register +as opposed to being a predication mask that is applied *to* the +(vectorised) operation, element-at-a-time on src1 and src2. + +Thus the question is directly inter-related to whether the modification +of the predication mask should *itself* be predicated. + +It is quite complex, in other words, and needs careful consideration. + +## 8/16-bit ops is it worthwhile adding a "start offset"? + +The idea here is to make it possible, particularly in a "Packed SIMD" +case, to be able to avoid doing unaligned Load/Store operations +by specifying that operations, instead of being carried out +element-for-element, are offset by a fixed amount *even* in 8 and 16-bit +element Packed SIMD cases. + +For example rather than take 2 32-bit registers divided into 4 8-bit +elements and have them ADDed element-for-element as follows: + + r3[0] = add r4[0], r6[0] + r3[1] = add r4[1], r6[1] + r3[2] = add r4[2], r6[2] + r3[3] = add r4[3], r6[3] + +an offset of 1 would result in four operations as follows, instead: + + r3[0] = add r4[1], r6[0] + r3[1] = add r4[2], r6[1] + r3[2] = add r4[3], r6[2] + r3[3] = add r5[0], r6[3] + +In non-packed-SIMD mode there is no benefit at all, as a vector may +be created using a different CSR that has the offset built-in. So this +leaves just the packed-SIMD case to consider. + +Two ways in which this could be implemented / emulated (without special +hardware): + +* bit-manipulation that shuffles the data along by one byte (or one word) + either prior to or as part of the operation requiring the offset. +* just use an unaligned Load/Store sequence, even if there are performance + penalties for doing so. + +The question then is whether the performance hit is worth the extra hardware +involving byte-shuffling/shifting the data by an arbitrary offset. On +balance given that there are two reasonable instruction-based options, the +hardware-offset option should be left out for the initial version of SV, +with the option to consider it in an "advanced" version of the specification. # Impementing V on top of Simple-V @@ -820,7 +607,7 @@ levels: Base and reserved future functionality. up to 16 (TBD) of either the floating-point or integer registers to be marked as "predicated" (key), and if so, which integer register to use as the predication mask (value). - + **TODO** # Implementing P (renamed to DSP) on top of Simple-V @@ -845,6 +632,11 @@ This section compares the various parallelism proposals as they stand, including traditional SIMD, in terms of features, ease of implementation, complexity, flexibility, and die area. +### [[harmonised_rvv_rvp]] + +This is an interesting proposal under development to retro-fit the AndesStar +P-Ext into V-Ext. + ### [[alt_rvp]] Primary benefit of Alt-RVP is the simplicity with which parallelism @@ -875,9 +667,9 @@ from actual (internal) parallel hardware. It's an API in effect that's designed to be slotted in to an existing implementation (just after instruction decode) with minimum disruption and effort. -* minus: the complexity of having to use register renames, OoO, VLIW, - register file cacheing, all of which has been done before but is a - pain +* minus: the complexity (if full parallelism is to be exploited) + of having to use register renames, OoO, VLIW, register file cacheing, + all of which has been done before but is a pain * plus: transparent re-use of existing opcodes as-is just indirectly saying "this register's now a vector" which * plus: means that future instructions also get to be inherently @@ -986,7 +778,7 @@ the question is asked "How can each of the proposals effectively implement a SIMD architecture where the ALU becomes responsible for the parallelism, Alt-RVP ALUs would likewise be so responsible... with *additional* (lane-based) parallelism on top. -* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by +* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by at least one dimension are avoided (architectural upgrades introducing 128-bit then 256-bit then 512-bit variants of the exact same 64-bit SIMD block) @@ -1064,37 +856,35 @@ the question is asked "How can each of the proposals effectively implement ### Example Instruction translation: -Instructions "ADD r2 r4 r4" would result in three instructions being -generated and placed into the FIFO: +Instructions "ADD r7 r4 r4" would result in three instructions being +generated and placed into the FIFO. r7 and r4 are marked as "vectorised": -* ADD r2 r4 r4 -* ADD r2 r5 r5 -* ADD r2 r6 r6 +* ADD r7 r4 r4 +* ADD r8 r5 r5 +* ADD r9 r6 r6 + +Instructions "ADD r7 r4 r1" would result in three instructions being +generated and placed into the FIFO. r7 and r1 are marked as "vectorised" +whilst r4 is not: + +* ADD r7 r4 r1 +* ADD r8 r4 r2 +* ADD r9 r4 r3 ## Example of vector / vector, vector / scalar, scalar / scalar => vector add - register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... - register CSRpredicate[XLEN][4]; # 2^4 is max vector length - register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well - register x[32][XLEN]; - - function op_add(rd, rs1, rs2, predr) - { -    /* note that this is ADD, not PADD */ -    int i, id, irs1, irs2; -    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored -    # also destination makes no sense as a scalar but what the hell... -    for (i = 0, id=0, irs1=0, irs2=0; i @@ -1122,10 +912,10 @@ There is, in the standard Conditional Branch instruction, more than adequate space to interpret it in a similar fashion: [[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | +31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 | +imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | + 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | + offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | """]] This would become: @@ -1145,19 +935,19 @@ not only to add in a second source register, but also use some of the bits as a predication target as well. [[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | op | - 3 | 3 | 3 | 5 | 2 | - C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | +15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 | +funct3 | imm | rs10 | imm | op | +3 | 3 | 3 | 5 | 2 | +C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | """]] Now uses the CS format: [[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | | op | - 3 | 3 | 3 | 2 | 3 | 2 | - C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | +15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 | +funct3 | imm | rs10 | imm | | op | +3 | 3 | 3 | 2 | 3 | 2 | +C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 | """]] Bit 6 would be decoded as "operation refers to Integer or Float" including @@ -1260,16 +1050,16 @@ still be respected*, making Simple-V in effect the "consistent public API". vew may be one of the following (giving a table "bytestable", used below): -| vew | bitwidth | -| --- | -------- | -| 000 | default | -| 001 | 8 | -| 010 | 16 | -| 011 | 32 | -| 100 | 64 | -| 101 | 128 | -| 110 | rsvd | -| 111 | rsvd | +| vew | bitwidth | bytestable | +| --- | -------- | ---------- | +| 000 | default | XLEN/8 | +| 001 | 8 | 1 | +| 010 | 16 | 2 | +| 011 | 32 | 4 | +| 100 | 64 | 8 | +| 101 | 128 | 16 | +| 110 | rsvd | rsvd | +| 111 | rsvd | rsvd | Pseudocode for vector length taking CSR SIMD-bitwidth into account: @@ -1387,7 +1177,7 @@ So the question boils down to: Whilst the above may seem to be severe minuses, there are some strong pluses: -* Significant reduction of V's opcode space: over 85%. +* Significant reduction of V's opcode space: over 95%. * Smaller reduction of P's opcode space: around 10%. * The potential to use Compressed instructions in both Vector and SIMD due to the overloading of register meaning (implicit vectorisation, @@ -1437,7 +1227,116 @@ RVV nor base RV have taken integer-overflow (carry) into account, which makes proposing it quite challenging given that the relevant (Base) RV sections are frozen. Consequently it makes sense to forgo this feature. -## Virtual Memory page-faults +## Context Switch Example + +An unusual side-effect of Simple-V mapping onto the standard register files +is that LOAD-multiple and STORE-multiple are accidentally available, as long +as it is acceptable that the register(s) to be loaded/stored are contiguous +(per instruction). An additional accidental benefit is that Compressed LD/ST +may also be used. + +To illustrate how this works, here is some example code from FreeRTOS +(GPLv2 licensed, portasm.S): + + /* Macro for saving task context */ + .macro portSAVE_CONTEXT + .global pxCurrentTCB + /* make room in stack */ + addi sp, sp, -REGBYTES * 32 + + /* Save Context */ + STORE x1, 0x0(sp) + STORE x2, 1 * REGBYTES(sp) + STORE x3, 2 * REGBYTES(sp) + ... + ... + STORE x30, 29 * REGBYTES(sp) + STORE x31, 30 * REGBYTES(sp) + + /* Store current stackpointer in task control block (TCB) */ + LOAD t0, pxCurrentTCB //pointer + STORE sp, 0x0(t0) + .endm + + /* Saves current error program counter (EPC) as task program counter */ + .macro portSAVE_EPC + csrr t0, mepc + STORE t0, 31 * REGBYTES(sp) + .endm + + /* Saves current return adress (RA) as task program counter */ + .macro portSAVE_RA + STORE ra, 31 * REGBYTES(sp) + .endm + + /* Macro for restoring task context */ + .macro portRESTORE_CONTEXT + + .global pxCurrentTCB + /* Load stack pointer from the current TCB */ + LOAD sp, pxCurrentTCB + LOAD sp, 0x0(sp) + + /* Load task program counter */ + LOAD t0, 31 * REGBYTES(sp) + csrw mepc, t0 + + /* Run in machine mode */ + li t0, MSTATUS_PRV1 + csrs mstatus, t0 + + /* Restore registers, + Skip global pointer because that does not change */ + LOAD x1, 0x0(sp) + LOAD x4, 3 * REGBYTES(sp) + LOAD x5, 4 * REGBYTES(sp) + ... + ... + LOAD x30, 29 * REGBYTES(sp) + LOAD x31, 30 * REGBYTES(sp) + + addi sp, sp, REGBYTES * 32 + mret + .endm + +The important bits are the Load / Save context, which may be replaced +with firstly setting up the Vectors and secondly using a *single* STORE +(or LOAD) including using C.ST or C.LD, to indicate that the entire +bank of registers is to be loaded/saved: + + /* a few things are assumed here: (a) that when switching to + M-Mode an entirely different set of CSRs is used from that + which is used in U-Mode and (b) that the M-Mode x1 and x4 + vectors are also not used anywhere else in M-Mode, consequently + only need to be set up just the once. + */ + .macroVectorSetup + MVECTORCSRx1 = 31, defaultlen + MVECTORCSRx4 = 28, defaultlen + + /* Save Context */ + SETVL x0, x0, 31 /* x0 ignored silently */ + STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth + + /* Restore registers, + Skip global pointer because that does not change */ + LOAD x1, 0x0(sp) + SETVL x0, x0, 28 /* x0 ignored silently */ + LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth + +Note that although it may just be a bug in portasm.S, x2 and x3 appear not +to be being restored. If however this is a bug and they *do* need to be +restored, then the SETVL call may be moved to *outside* the Save / Restore +Context assembly code, into the macroVectorSetup, as long as vectors are +never used anywhere else (i.e. VL is never altered by M-Mode). + +In effect the entire bank of repeated LOAD / STORE instructions is replaced +by one single (compressed if it is available) instruction. + +## Virtual Memory page-faults on LOAD/STORE + + +### Notes from conversations > I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft > riscv-isa-manual in order to work out how to re-map RVV onto the standard @@ -1562,9 +1461,9 @@ would still be there (and stalled). hmmm. > > Thrown away. -discussion then led to the question of OoO architectures +discussion then led to the question of OoO architectures -> The costs of the imprecise-exception model are greater than the benefit. +> The costs of the imprecise-exception model are greater than the benefit. > Software doesn't want to cope with it.  It's hard to debug.  You can't > migrate state between different microarchitectures--unless you force all > implementations to support the same imprecise-exception model, which would @@ -1572,11 +1471,151 @@ discussion then led to the question of OoO architectures > relevant, is that the imprecise model increases the size of the context > structure, as the microarchitectural guts have to be spilled to memory.) - -## Implementation Paradigms - -TODO: assess various implementation paradigms: - +## Zero/Non-zero Predication + +>> >  it just occurred to me that there's another reason why the data +>> > should be left instead of zeroed.  if the standard register file is +>> > used, such that vectorised operations are translated to mean "please +>> > insert multiple register-contiguous operations into the instruction +>> > FIFO" and predication is used to *skip* some of those, then if the +>> > next "vector" operation uses the (standard) registers that were masked +>> > *out* of the previous operation it may proceed without blocking. +>> > +>> >  if however zeroing is made mandatory then that optimisation becomes +>> > flat-out impossible to deploy. +>> > +>> >  whilst i haven't fully thought through the full implications, i +>> > suspect RVV might also be able to benefit by being able to fit more +>> > overlapping operations into the available SRAM by doing something +>> > similar. +> +> +> Luke, this is called density time masking. It doesn’t apply to only your +> model with the “standard register file” is used. it applies to any +> architecture that attempts to speed up by skipping computation and writeback +> of masked elements. +> +> That said, the writing of zeros need not be explicit. It is possible to add +> a “zero bit” per element that, when set, forces a zero to be read from the +> vector (although the underlying storage may have old data). In this case, +> there may be a way to implement DTM as well. + + +## Implementation detail for scalar-only op detection + +Note 1: this idea is a pipeline-bypass concept, which may *or may not* be +worthwhile. + +Note 2: this is just one possible implementation. Another implementation +may choose to treat *all* operations as vectorised (including treating +scalars as vectors of length 1), choosing to add an extra pipeline stage +dedicated to *all* instructions. + +This section *specifically* covers the implementor's freedom to choose +that they wish to minimise disruption to an existing design by detecting +"scalar-only operations", bypassing the vectorisation phase (which may +or may not require an additional pipeline stage) + +[[scalardetect.png]] + +>> For scalar ops an implementation may choose to compare 2-3 bits through an +>> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr +>> FIFO). + +> Those bits cannot be known until after the registers are decoded from the +> instruction and a lookup in the "vector length table" has completed. +> Considering that one of the reasons RISC-V keeps registers in invariant +> positions across all instructions is to simplify register decoding, I expect +> that inserting an SRAM read would lengthen the critical path in most +> implementations. + +reply: + +> briefly: the trick i mentioned about ANDing bits together to check if +> an op was fully-scalar or not was to be read out of a single 32-bit +> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per +> register indicating "is register vectorised yes no". 3R because you need +> to check src1, src2 and dest simultaneously. the entries are *generated* +> from the CSRs and are an optimisation that on slower embedded systems +> would likely not be needed. + +> is there anything unreasonable that anyone can foresee about that? +> what are the down-sides? + +## C.MV predicated src, predicated dest + +> Can this be usefully defined in such a way that it is +> equivalent to vector gather-scatter on each source, followed by a +> non-predicated vector-compare, followed by vector gather-scatter on the +> result? + +## element width conversion: restrict or remove? + +summary: don't restrict / remove. it's fine. + +> > it has virtually no cost/overhead as long as you specify +> > that inputs can only upconvert, and operations are always done at the +> > largest size, and downconversion only happens at the output. +> +> okaaay.  so that's a really good piece of implementation advice. +> algorithms do require data size conversion, so at some point you need to +> introduce the feature of upconverting and downconverting. +> +> > for int and uint, this is dead simple and fits well within the RVV pipeline +> > without any critical path, pipeline depth, or area implications. + + + +## Under review / discussion: remove CSR vector length, use VSETVL + +**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines +length on all regs**. This section kept for historical reasons. + +So the issue is as follows: + +* CSRs are used to set the "span" of a vector (how many of the standard + register file to contiguously use) +* VSETVL in RVV works as follows: it sets the vector length (copy of which + is placed in a dest register), and if the "required" length is longer + than the *available* length, the dest reg is set to the MIN of those + two. +* **HOWEVER**... in SV, *EVERY* vector register has its own separate + length and thus there is no way (at the time that VSETVL is called) to + know what to set the vector length *to*. +* At first glance it seems that it would be perfectly fine to just limit + the vector operation to the length specified in the destination + register's CSR, at the time that each instruction is issued... + except that that cannot possibly be guaranteed to match + with the value *already loaded into the target register from VSETVL*. + +Therefore a different approach is needed. + +Possible options include: + +* Removing the CSR "Vector Length" and always using the value from + VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and* + destreg equal to MIN(counterreg, lenimmed), with register-based + variant "VSETVL destreg, counterreg, lenreg" doing the same. +* Keeping the CSR "Vector Length" and having the lenreg version have + a "twist": "if lengreg is vectorised, read the length from the CSR" +* Other (TBD) + +The first option (of the ones brainstormed so far) is a lot simpler. +It does however mean that the length set in VSETVL will apply across-the-board +to all src1, src2 and dest vectorised registers until it is otherwise changed +(by another VSETVL call). This is probably desirable behaviour. + +## Implementation Paradigms + +TODO: assess various implementation paradigms. These are listed roughly +in order of simplicity (minimum compliance, for ultra-light-weight +embedded systems or to reduce design complexity and the burden of +design implementation and compliance, in non-critical areas), right the +way to high-performance systems. + +* Full (or partial) software-emulated (via traps): full support for CSRs + required, however when a register is used that is detected (in hardware) + to be vectorised, an exception is thrown. * Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP) * In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming * Out-of-order with instruction FIFOs and aggressive register-renaming @@ -1588,10 +1627,93 @@ Also to be taken into consideration: * Comphrensive vectorisation: FIFOs and internal parallelism * Hybrid Parallelism +### Full or partial software-emulation + +The absolute, absolute minimal implementation is to provide the full +set of CSRs and detection logic for when any of the source or destination +registers are vectorised. On detection, a trap is thrown, whether it's +a branch, LOAD, STORE, or an arithmetic operation. + +Implementors are entirely free to choose whether to allow absolutely every +single operation to be software-emulated, or whether to provide some emulation +and some hardware support. In particular, for an RV32E implementation +where fast context-switching is a requirement (see "Context Switch Example"), +it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an +exception, as every context-switch will result in double-traps. + # TODO Research > For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs +Idea: basic simple butterfly swap on a few element indices, primarily targetted +at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping, +perhaps allow reindexing of permutations up to 4 elements? 8? Reason: +such operations are less costly than a full indexed-shuffle, which requires +a separate instruction cycle. + +Predication "all zeros" needs to be "leave alone". Detection of +ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas +ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0. +Destruction of destination indices requires a copy of the entire vector +in advance to avoid. + +TBD: floating-point compare and other exception handling + +------ + +Multi-LR/SC + +Please don't try to use the L1 itself. + +Use the Load and Store buffers which capture instruction state prior +to being accessed in the L1 (and prior to data arriving in the case of +Store buffer). + +Also, use the L1 Miss buffers as these already HAVE to be snooped by +coherence traffic. These are used to monitor that all participating +cache lines remain interference free, and amalgamate same into a CPU +signal accessible ia branch or predicate. + +The Load buffers manage inbound traffic +The Store buffers manage outbound traffic. + +Done properly, the participating cache lines can exceed the associativity +of the L1 cache without architectural harm (may incur additional latency). + + + +> > > so, let's say instead of another LR *cancelling* the load +> > > reservation, the SMP core / hardware thread *blocks* for +> > > up to 63 further instructions, waiting for the reservation +> > > to clear. +> > +> > Can you explain what you mean by this paragraph? +> +> best put in sequential events, probably. +> +> LR <-- 64-instruction countdown starts here +> ... 63 +> ... 62 +> LR same address <--- notes that core1 is on 61, +> so pauses for **UP TO** 61 cycles +> ... 32 +> SC <- core1 didn't reach zero, therefore valid, therefore +> core2 is now **UNBLOCKED**, is granted the +> load-reservation (and begins its **own** 64-cycle +> LR instruction countdown) +> ... 63 +> ... 62 +> ... +> ... +> SC <- also valid + +Looks to me that you could effect the same functionality by simply +holding onto the cache line in core 1 preventing core 2 from + getting past the LR. + +On the other hand, the freeze is similar to how the MP CRAYs did +ATOMIC stuff. + # References * SIMD considered harmful @@ -1622,3 +1744,12 @@ Also to be taken into consideration: restarted if an exception occurs (VM page-table miss) * Dot Product Vector +* RVV slides 2017 +* Wavefront skipping using BRAMS +* Streaming Pipelines +* Barcelona SIMD Presentation +* +* Full Description (last page) of RVV instructions + +* PULP Low-energy Cluster Vector Processor +