X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=1f479badad0209ee49ebd85864e2376ee9613fb4;hb=a537f8e87eaf740e6eadb4517e1f93c8112bb3cb;hp=ad8fd3a5d13949a3f17c0b66937f8bb23c51a32a;hpb=4d66d4c9900081d26aa4a9fe7c84464e629a805b;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index ad8fd3a5d..1f479bada 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,5 +1,21 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal +Key insight: Simple-V is intended as an abstraction layer to provide +a consistent "API" to parallelisation of existing *and future* operations. +*Actual* internal hardware-level parallelism is *not* required, such +that Simple-V may be viewed as providing a "compact" or "consolidated" +means of issuing multiple near-identical arithmetic instructions to an +instruction queue (FILO), pending execution. + +*Actual* parallelism, if added independently of Simple-V in the form +of Out-of-order restructuring (including parallel ALU lanes) or VLIW +implementations, or SIMD, or anything else, would then benefit *if* +Simple-V was added on top. + +[[!toc ]] + +# Introduction + This proposal exists so as to be able to satisfy several disparate requirements: power-conscious, area-conscious, and performance-conscious designs all pull an ISA and its implementation in different conflicting @@ -9,7 +25,7 @@ Additionally, the existing P (SIMD) proposal and the V (Vector) proposals, whilst each extremely powerful in their own right and clearly desirable, are also: -* Clearly independent in their origins (Cray and AndeStar v3 respectively) +* Clearly independent in their origins (Cray and AndesStar v3 respectively) so need work to adapt to the RISC-V ethos and paradigm * Are sufficiently large so as to make adoption (and exploration for analysis and review purposes) prohibitively expensive @@ -32,40 +48,27 @@ would provide even more flexibility). Additionally it makes sense to *split out* the parallelism inherent within each of P and V, and to see if each of P and V then, in *combination* with -a "best-of-both" parallelism extension, would work well. - -**TODO**: reword this to better suit this document: +a "best-of-both" parallelism extension, could be added on *on top* of +this proposal, to topologically provide the exact same functionality of +each of P and V. Each of P and V then can focus on providing the best +operations possible for their respective target areas, without being +hugely concerned about the actual parallelism. -Having looked at both P and V as they stand, they're _both_ very much -"separate engines" that, despite both their respective merits and -extremely powerful features, don't really cleanly fit into the RV design -ethos (or the flexible extensibility) and, as such, are both in danger -of not being widely adopted. I'm inclined towards recommending: - -* splitting out the DSP aspects of P-SIMD to create a single-issue DSP -* splitting out the polymorphism, esoteric data types (GF, complex - numbers) and unusual operations of V to create a single-issue "Esoteric - Floating-Point" extension -* splitting out the loop-aspects, vector aspects and data-width aspects - of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they - apply across *all* Extensions, whether those be DSP, M, Base, V, P - - everything. +Furthermore, an additional goal of this proposal is to reduce the number +of opcodes utilised by each of P and V as they currently stand, leveraging +existing RISC-V opcodes where possible, and also potentially allowing +P and V to make use of Compressed Instructions as a result. **TODO**: propose overflow registers be actually one of the integer regs (flowing to multiple regs). **TODO**: propose "mask" (predication) registers likewise. combination with -standard RV instructions and overflow registers extremely powerful - -**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular -register as being "if you use this reg in LOAD/STORE, use the offset -amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous". -can be used for matrix spanning. - +standard RV instructions and overflow registers extremely powerful, see +Aspex ASP. # Analysis and discussion of Vector vs SIMD -There are four combined areas between the two proposals that help with +There are five combined areas between the two proposals that help with parallelism without over-burdening the ISA with a huge proliferation of instructions: @@ -90,6 +93,13 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable for general-purpose computation, and in the context of developing a general-purpose ISA, is never going to satisfy 100 percent of implementors. +To explain this further: for increased workloads over time, as the +performance requirements increase for new target markets, implementors +choose to extend the SIMD width (so as to again avoid mixing parallelism +into the instruction issue phases: the primary "simplicity" benefit of +SIMD in the first place), with the result that the entire opcode space +effectively doubles with each new SIMD width that's added to the ISA. + That basically leaves "variable-length vector" as the clear *general-purpose* winner, at least in terms of greatly simplifying the instruction set, reducing the number of instructions required for any given task, and thus @@ -125,13 +135,16 @@ integer (and floating point) of various sizes is automatically inferred due to "type tagging" that is set with a special instruction. A register will be *specifically* marked as "16-bit Floating-Point" and, if added to an operand that is specifically tagged as "32-bit Integer" an implicit -type-conversion will take placce *without* requiring that type-conversion +type-conversion will take place *without* requiring that type-conversion to be explicitly done with its own separate instruction. However, implicit type-conversion is not only quite burdensome to implement (explosion of inferred type-to-type conversion) but also is never really going to be complete. It gets even worse when bit-widths -also have to be taken into consideration. +also have to be taken into consideration. Each new type results in +an increased O(N^2) conversion space that, as anyone who has examined +python's source code (which has built-in polymorphic type-conversion), +knows that the task is more complex than it first seems. Overall, type-conversion is generally best to leave to explicit type-conversion instructions, or in definite specific use-cases left to @@ -144,27 +157,89 @@ contains an extremely interesting feature: zero-overhead loops. This proposal would basically allow an inner loop of instructions to be repeated indefinitely, a fixed number of times. -Its specific advantage over explicit loops is that the pipeline in a -DSP can potentially be kept completely full *even in an in-order +Its specific advantage over explicit loops is that the pipeline in a DSP +can potentially be kept completely full *even in an in-order single-issue implementation*. Normally, it requires a superscalar architecture and -out-of-order execution capabilities to "pre-process" instructions in order -to keep ALU pipelines 100% occupied. - -This very simple proposal offers a way to increase pipeline activity in the -one key area which really matters: the inner loop. - -## Mask and Tagging - -*TODO: research masks as they can be superb and extremely powerful. -If B-Extension is implemented and provides Bit-Gather-Scatter it -becomes really cool and easy to switch out certain indexed values -from an array of data, but actually BGS **on its own** might be -sufficient. Bottom line, this is complex, and needs a proper analysis. -The other sections are pretty straightforward.* +out-of-order execution capabilities to "pre-process" instructions in +order to keep ALU pipelines 100% occupied. + +By bringing that capability in, this proposal could offer a way to increase +pipeline activity even in simpler implementations in the one key area +which really matters: the inner loop. + +However when looking at much more comprehensive schemes +"A portable specification of zero-overhead loop control hardware +applied to embedded processors" (ZOLC), optimising only the single +inner loop seems inadequate, tending to suggest that ZOLC may be +better off being proposed as an entirely separate Extension. + +## Mask and Tagging (Predication) + +Tagging (aka Masks aka Predication) is a pseudo-method of implementing +simplistic branching in a parallel fashion, by allowing execution on +elements of a vector to be switched on or off depending on the results +of prior operations in the same array position. + +The reason for considering this is simple: by *definition* it +is not possible to perform individual parallel branches in a SIMD +(Single-Instruction, **Multiple**-Data) context. Branches (modifying +of the Program Counter) will result in *all* parallel data having +a different instruction executed on it: that's just the definition of +SIMD, and it is simply unavoidable. + +So these are the ways in which conditional execution may be implemented: + +* explicit compare and branch: BNE x, y -> offs would jump offs + instructions if x was not equal to y +* explicit store of tag condition: CMP x, y -> tagbit +* implicit (condition-code) ADD results in a carry, carry bit implicitly + (or sometimes explicitly) goes into a "tag" (mask) register + +The first of these is a "normal" branch method, which is flat-out impossible +to parallelise without look-ahead and effectively rewriting instructions. +This would defeat the purpose of RISC. + +The latter two are where parallelism becomes easy to do without complexity: +every operation is modified to be "conditionally executed" (in an explicit +way directly in the instruction format *or* implicitly). + +RVV (Vector-Extension) proposes to have *explicit* storing of the compare +in a tag/mask register, and to *explicitly* have every vector operation +*require* that its operation be "predicated" on the bits within an +explicitly-named tag/mask register. + +SIMD (P-Extension) has not yet published precise documentation on what its +schema is to be: there is however verbal indication at the time of writing +that: + +> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will +> be executed using the same compare ALU logic for the base ISA with some +> minor modifications to handle smaller data types. The function will not +> be duplicated. + +This is an *implicit* form of predication as the base RV ISA does not have +condition-codes or predication. By adding a CSR it becomes possible +to also tag certain registers as "predicated if referenced as a destination". +Example: + + // in future operations from now on, if r0 is the destination use r5 as + // the PREDICATION register + SET_IMPLICIT_CSRPREDICATE r0, r5 + // store the compares in r5 as the PREDICATION register + CMPEQ8 r5, r1, r2 + // r0 is used here. ah ha! that means it's predicated using r5! + ADD8 r0, r1, r3 + +With enough registers (and in RISC-V there are enough registers) some fairly +complex predication can be set up and yet still execute without significant +stalling, even in a simple non-superscalar architecture. + +(For details on how Branch Instructions would be retro-fitted to indirectly +predicated equivalents, see Appendix) ## Conclusions -In the above sections the four different ways where parallel instruction +In the above sections the five different ways where parallel instruction execution has closely and loosely inter-related implications for the ISA and for implementors, were outlined. The pluses and minuses came out as follows: @@ -172,29 +247,236 @@ follows: * Fixed vs variable parallelism: variable * Implicit (indirect) vs fixed (integral) instruction bit-width: indirect * Implicit vs explicit type-conversion: explicit -* Implicit vs explicit inner loops: implicit -* Tag or no-tag: TODO +* Implicit vs explicit inner loops: implicit but best done separately +* Tag or no-tag: Complex but highly beneficial + +In particular: -In particular: variable-length vectors came out on top because of the -high setup, teardown and corner-cases associated with the fixed width -of SIMD. Implicit bit-width helps to extend the ISA to escape from -former limitations and restrictions (in a backwards-compatible fashion), -and implicit (zero-overhead) loops provide a means to keep pipelines -potentially 100% occupied *without* requiring a super-scalar or out-of-order -architecture. +* variable-length vectors came out on top because of the high setup, teardown + and corner-cases associated with the fixed width of SIMD. +* Implicit bit-width helps to extend the ISA to escape from + former limitations and restrictions (in a backwards-compatible fashion), + whilst also leaving implementors free to simmplify implementations + by using actual explicit internal parallelism. +* Implicit (zero-overhead) loops provide a means to keep pipelines + potentially 100% occupied in a single-issue in-order implementation + i.e. *without* requiring a super-scalar or out-of-order architecture, + but doing a proper, full job (ZOLC) is an entirely different matter. -Constructing a SIMD/Simple-Vector proposal based around even only these four -(five?) requirements would therefore seem to be a logical thing to do. +Constructing a SIMD/Simple-Vector proposal based around four of these five +requirements would therefore seem to be a logical thing to do. # Instruction Format -**TODO** *basically borrow from both P and V, which should be quite simple -to do, with the exception of Tag/no-tag, which needs a bit more -thought. V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS -gather-scatterer, and, if implemented, could actually be a really useful -way to span 8-bit up to 64-bit groups of data, where BGS as it stands -and described by Clifford does **bits** of up to 16 width. Lots to -look at and investigate!* +The instruction format for Simple-V does not actually have *any* compare +operations, *any* arithmetic, floating point or memory instructions. +Instead it *overloads* pre-existing branch operations into predicated +variants, and implicitly overloads arithmetic operations and LOAD/STORE +depending on implicit CSR configurations for both vector length and +bitwidth. This includes Compressed instructions. + +* For analysis of RVV see [[v_comparative_analysis]] which begins to + outline topologically-equivalent mappings of instructions +* Also see Appendix "Retro-fitting Predication into branch-explicit ISA" + for format of Branch opcodes. + +**TODO**: *analyse and decide whether the implicit nature of predication +as proposed is or is not a lot of hassle, and if explicit prefixes are +a better idea instead. Parallelism therefore effectively may end up +as always being 64-bit opcodes (32 for the prefix, 32 for the instruction) +with some opportunities for to use Compressed bringing it down to 48. +Also to consider is whether one or both of the last two remaining Compressed +instruction codes in Quadrant 1 could be used as a parallelism prefix, +bringing parallelised opcodes down to 32-bit and having the benefit of +being explicit.* + +## Branch Instruction: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +I/F | reserved | src2 | src1 | BPR | predicate rs3 || BRANCH | +0 | reserved | src2 | src1 | 000 | predicate rs3 || BEQ | +0 | reserved | src2 | src1 | 001 | predicate rs3 || BNE | +0 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd | +0 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd | +0 | reserved | src2 | src1 | 100 | predicate rs3 || BLE | +0 | reserved | src2 | src1 | 101 | predicate rs3 || BGE | +0 | reserved | src2 | src1 | 110 | predicate rs3 || BLTU | +0 | reserved | src2 | src1 | 111 | predicate rs3 || BGEU | +1 | reserved | src2 | src1 | 000 | predicate rs3 || FEQ | +1 | reserved | src2 | src1 | 001 | predicate rs3 || FNE | +1 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd | +1 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd | +1 | reserved | src2 | src1 | 100 | predicate rs3 || FLT | +1 | reserved | src2 | src1 | 101 | predicate rs3 || FLE | +1 | reserved | src2 | src1 | 110 | predicate rs3 || rsvd | +1 | reserved | src2 | src1 | 111 | predicate rs3 || rsvd | +"""]] + +In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given +for predicated compare operations of function "cmp": + + for (int i=0; i 1; + s2 = CSRvectorlen[src2] > 1; + for (int i=0; i Ok so this is an aspect of Simple-V that I hadn't thought through, -> yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section -> 17.10 the CSRs are listed.  I note that there's some general-purpose -> CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i -> don't precisely know what those are for. - ->  In the Simple-V proposal, *every* register in both the integer -> register-file *and* the floating-point register-file would have at -> least a 2-bit "data-width" CSR and probably something like an 8-bit -> "vector-length" CSR (less in RV32E, by exactly one bit). - ->  What I *don't* know is whether that would be considered perfectly -> reasonable or completely insane.  If it turns out that the proposed -> Simple-V CSRs can indeed be stored in SRAM then I would imagine that -> adding somewhere in the region of 10 bits per register would be... okay?  -> I really don't honestly know. - ->  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to -> be multi-ported? No I don't believe they would. - -## 17.11 Maximum Vector Length (MVL) - -Basically implicitly this is set to the maximum size of the register -file multiplied by the number of 8-bit packed ints that can fit into -a register (4 for RV32, 8 for RV64 and 16 for RV128). - -## !7.12 Vector Instruction Formats - -No equivalent in Simple-V because *all* instructions of *all* Extensions -are implicitly parallelised (and packed). - -## 17.13 Polymorphic Vector Instructions - -Polymorphism (implicit type-casting) is deliberately not supported -in Simple-V. - -## 17.14 Rapid Configuration Instructions - -TODO: analyse if this is useful to have an equivalent in Simple-V - -## 17.15 Vector-Type-Change Instructions - -TODO: analyse if this is useful to have an equivalent in Simple-V - -## 17.16 Vector Length - -Has a direct corresponding equivalent. - -## 17.17 Predicated Execution - -Predicated Execution is another name for "masking" or "tagging". Masked -(or tagged) implies that there is a bit field which is indexed, and each -bit associated with the corresponding indexed offset register within -the "Vector". If the tag / mask bit is 1, when a parallel operation is -issued, the indexed element of the vector has the operation carried out. -However if the tag / mask bit is *zero*, that particular indexed element -of the vector does *not* have the requested operation carried out. - -In V2.3-draft V, there is a significant (not recommended) difference: -the zero-tagged elements are *set to zero*. This loses a *significant* -advantage of mask / tagging, particularly if the entire mask register -is itself a general-purpose register, as that general-purpose register -can be inverted, shifted, and'ed, or'ed and so on. In other words -it becomes possible, especially if Carry/Overflow from each vector -operation is also accessible, to do conditional (step-by-step) vector -operations including things like turn vectors into 1024-bit or greater -operands with very few instructions, by treating the "carry" from -one instruction as a way to do "Conditional add of 1 to the register -next door". If V2.3-draft V sets zero-tagged elements to zero, such -extremely powerful techniques are simply not possible. - -It is noted that there is no mention of an equivalent to BEXT (element -skipping) which would be particularly fascinating and powerful to have. -In this mode, the "mask" would skip elements where its mask bit was zero -in either the source or the destination operand. - -Lots to be discussed. - -## 17.18 Vector Load/Store Instructions - -These may not have a direct equivalent in Simple-V, except if mask/tagging -is to be deployed. - -To be discussed. - -## 17.19 Vector Register Gather - -TODO - -## TODO, sort - -> However, there are also several features that go beyond simply attaching VL -> to a scalar operation and are crucial to being able to vectorize a lot of -> code. To name a few: -> - Conditional execution (i.e., predicated operations) -> - Inter-lane data movement (e.g. SLIDE, SELECT) -> - Reductions (e.g., VADD with a scalar destination) - - Ok so the Conditional and also the Reductions is one of the reasons - why as part of SimpleV / variable-SIMD / parallelism (gah gotta think - of a decent name) i proposed that it be implemented as "if you say r0 - is to be a vector / SIMD that means operations actually take place on - r0,r1,r2... r(N-1)". - - Consequently any parallel operation could be paused (or... more - specifically: vectors disabled by resetting it back to a default / - scalar / vector-length=1) yet the results would actually be in the - *main register file* (integer or float) and so anything that wasn't - possible to easily do in "simple" parallel terms could be done *out* - of parallel "mode" instead. - - I do appreciate that the above does imply that there is a limit to the - length that SimpleV (whatever) can be parallelised, namely that you - run out of registers! my thought there was, "leave space for the main - V-Ext proposal to extend it to the length that V currently supports". - Honestly i had not thought through precisely how that would work. - - Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that, - it reminds me of the discussion with Clifford on bit-manipulation - (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if - applied "globally and outside of V and P" SLIDE and SELECT might become - an extremely powerful way to do fast memory copy and reordering [2[. - - However I haven't quite got my head round how that would work: i am - used to the concept of register "tags" (the modern term is "masks") - and i *think* if "masks" were applied to a Simple-V-enhanced LOAD / - STORE you would get the exact same thing as SELECT. - - SLIDE you could do simply by setting say r0 vector-length to say 16 - (meaning that if referred to in any operation it would be an implicit - parallel operation on *all* registers r0 through r15), and temporarily - set say.... r7 vector-length to say... 5. Do a LOAD on r7 and it would - implicitly mean "load from memory into r7 through r11". Then you go - back and do an operation on r0 and ta-daa, you're actually doing an - operation on a SLID {SLIDED?) vector. - - The advantage of Simple-V (whatever) over V would be that you could - actually do *operations* in the middle of vectors (not just SLIDEs) - simply by (as above) setting r0 vector-length to 16 and r7 vector-length - to 5. There would be nothing preventing you from doing an ADD on r0 - (which meant do an ADD on r0 through r15) followed *immediately in the - next instruction with no setup cost* a MUL on r7 (which actually meant - "do a parallel MUL on r7 through r11"). - - btw it's worth mentioning that you'd get scalar-vector and vector-scalar - implicitly by having one of the source register be vector-length 1 - (the default) and one being N > 1. but without having special opcodes - to do it. i *believe* (or more like "logically infer or deduce" as - i haven't got access to the spec) that that would result in a further - opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V. - - Also, Reduction *might* be possible by specifying that the destination be - a scalar (vector-length=1) whilst the source be a vector. However... it - would be an awful lot of work to go through *every single instruction* - in *every* Extension, working out which ones could be parallelised (ADD, - MUL, XOR) and those that definitely could not (DIV, SUB). Is that worth - the effort? maybe. Would it result in huge complexity? probably. - Could an implementor just go "I ain't doing *that* as parallel! - let's make it virtual-parallelism (sequential reduction) instead"? - absolutely. So, now that I think it through, Simple-V (whatever) - covers Reduction as well. huh, that's a surprise. - - -> - Vector-length speculation (making it possible to vectorize some loops with -> unknown trip count) - I don't think this part of the proposal is written -> down yet. - - Now that _is_ an interesting concept. A little scary, i imagine, with - the possibility of putting a processor into a hard infinite execution - loop... :) - - -> Also, note the vector ISA consumes relatively little opcode space (all the -> arithmetic fits in 7/8ths of a major opcode). This is mainly because data -> type and size is a function of runtime configuration, rather than of opcode. - - yes. i love that aspect of V, i am a huge fan of polymorphism [1] - which is why i am keen to advocate that the same runtime principle be - extended to the rest of the RISC-V ISA [3] - - Yikes that's a lot. I'm going to need to pull this into the wiki to - make sure it's not lost. - -[1] inherent data type conversion: 25 years ago i designed a hypothetical -hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit -(escape-extended) opcodes and 2-bit (escape-extended) operands that -only required a fixed 8-bit instruction length. that relied heavily -on polymorphism and runtime size configurations as well. At the time -I thought it would have meant one HELL of a lot of CSRs... but then I -met RISC-V and was cured instantly of that delusion^Wmisapprehension :) - -[2] Interestingly if you then also add in the other aspect of Simple-V -(the data-size, which is effectively functionally orthogonal / identical -to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE -operations become byte / half-word / word augmenters of B-Ext's proposed -"BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored -LOAD / STORE would deal with 8 / 16 / 32 bits at a time. Where it -would get really REALLY interesting would be masked-packed-vectored -B-Ext BGS instructions. I can't even get my head fully round that, -which is a good sign that the combination would be *really* powerful :) - -[3] ok sadly maybe not the polymorphism, it's too complicated and I -think would be much too hard for implementors to easily "slide in" to an -existing non-Simple-V implementation.  i say that despite really *really* -wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some -fashion, for optimising 3D Graphics.  *sigh*. - -## TODO: instructions (based on Hwacha) V-Ext duplication analysis - -This is partly speculative due to lack of access to an up-to-date -V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing). However -basin an analysis instead on Hwacha, a cursory examination shows over -an **85%** duplication of V-Ext operand-related instructions when -compared to Simple-V on a standard RG64G base. Even Vector Fetch -is analogous to "zero-overhead loop". - -Exceptions are: - -* Vector Indexed Memory Instructions (non-contiguous) -* Vector Atomic Memory Instructions. -* Some of the Vector Arithmetic ops: MADD, MSUB, - VSRL, VSRA, VEIDX, VFIRST, VSGNJN, VFSGNJX and potentially more. -* Consensual Jump - -## TODO: sort - -> I suspect that the "hardware loop" in question is actually a zero-overhead -> loop unit that diverts execution from address X to address Y if a certain -> condition is met. - - not quite.  The zero-overhead loop unit interestingly would be at -an [independent] level above vector-length.  The distinctions are -as follows: - -* Vector-length issues *virtual* instructions where the register - operands are *specifically* altered (to cover a range of registers), - whereas zero-overhead loops *specifically* do *NOT* alter the operands - in *ANY* way. - -* Vector-length-driven "virtual" instructions are driven by *one* - and *only* one instruction (whether it be a LOAD, STORE, or pure - one/two/three-operand opcode) whereas zero-overhead loop units - specifically apply to *multiple* instructions. - -Where vector-length-driven "virtual" instructions might get conceptually -blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD / -STORE, to actually be useful, vector-length-driven LOAD / STORE should -increment the LOAD / STORE memory address to correspondingly match the -increment in the register bank.  example: - -* set vector-length for r0 to 4 -* issue RV32 LOAD from addr 0x1230 to r0 - -translates effectively to: - -* RV32 LOAD from addr 0x1230 to r0 -* ... -* ... -* RV32 LOAD from addr 0x123B to r3 - -# P-Ext ISA - -| Mnemonic | 16-bit Instruction | -| ------------------ | ------------------------- | -| ADD16 rt, ra, rb | add | -| RADD16 rt, ra, rb | Signed Halving add | -| URADD16 rt, ra, rb | Unsigned Halving add | -| KADD16 rt, ra, rb | Signed Saturating add | -| UKADD16 rt, ra, rb | Unsigned Saturating add | -| SUB16 rt, ra, rb | sub | -| RSUB16 rt, ra, rb | Signed Halving sub | - +implementation efforts, without "extra baggage". + +# CSRs + +There are a number of CSRs needed, which are used at the instruction +decode phase to re-interpret standard RV opcodes (a practice that has +precedent in the setting of MISA to enable / disable extensions). + +* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1) +* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) +* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) +* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) +* Integer Register N is a Predication Register (note: a key-value store) +* Vector Length CSR (VSETVL, VGETVL) + +Notes: + +* for the purposes of LOAD / STORE, Integer Registers which are + marked as a Vector will result in a Vector LOAD / STORE. +* Vector Lengths are *not* the same as vsetl but are an integral part + of vsetl. +* Actual vector length is *multipled* by how many blocks of length + "bitwidth" may fit into an XLEN-sized register file. +* Predication is a key-value store due to the implicit referencing, + as opposed to having the predicate register explicitly in the instruction. + +## Predication CSR + +The Predication CSR is a key-value store indicating whether, if a given +destination register (integer or floating-point) is referred to in an +instruction, it is to be predicated. The first entry is whether predication +is enabled. The second entry is whether the register index refers to a +floating-point or an integer register. The third entry is the index +of that register which is to be predicated (if referred to). The fourth entry +is the integer register that is treated as a bitfield, indexable by the +vector element index. + +| RegNo | 6 | 5 | (4..0) | (4..0) | +| ----- | - | - | ------- | ------- | +| r0 | pren0 | i/f | regidx | predidx | +| r1 | pren1 | i/f | regidx | predidx | +| .. | pren.. | i/f | regidx | predidx | +| r15 | pren15 | i/f | regidx | predidx | + +The Predication CSR Table is a key-value store, so implementation-wise +it will be faster to turn the table around (maintain topologically +equivalent state): + + fp_pred_enabled[32]; + int_pred_enabled[32]; + for (i = 0; i < 16; i++) + if CSRpred[i].pren: + idx = CSRpred[i].regidx + predidx = CSRpred[i].predidx + if CSRpred[i].type == 0: # integer + int_pred_enabled[idx] = 1 + int_pred_reg[idx] = predidx + else: + fp_pred_enabled[idx] = 1 + fp_pred_reg[idx] = predidx + +So when an operation is to be predicated, it is the internal state that +is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following +pseudo-code for operations is given, where p is the explicit (direct) +reference to the predication register to be used: + + for (int i=0; i What does an ADD of two different-sized vectors do in simple-V? + +* if the two source operands are not the same, throw an exception. +* if the destination operand is also a vector, and the source is longer + than the destination, throw an exception. + +> And what about instructions like JALR?  +> What does jumping to a vector do? + +* Throw an exception. Whether that actually results in spawning threads + as part of the trap-handling remains to be seen. + +# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals + +This section compares the various parallelism proposals as they stand, +including traditional SIMD, in terms of features, ease of implementation, +complexity, flexibility, and die area. + +## [[alt_rvp]] + +Primary benefit of Alt-RVP is the simplicity with which parallelism +may be introduced (effective multiplication of regfiles and associated ALUs). + +* plus: the simplicity of the lanes (combined with the regularity of + allocating identical opcodes multiple independent registers) meaning + that SRAM or 2R1W can be used for entire regfile (potentially). +* minus: a more complex instruction set where the parallelism is much + more explicitly directly specified in the instruction and +* minus: if you *don't* have an explicit instruction (opcode) and you + need one, the only place it can be added is... in the vector unit and +* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are + not useable or accessible in other Extensions. +* plus-and-minus: Lanes may be utilised for high-speed context-switching + but with the down-side that they're an all-or-nothing part of the Extension. + No Alt-RVP: no fast register-bank switching. +* plus: Lane-switching would mean that complex operations not suited to + parallelisation can be carried out, followed by further parallel Lane-based + work, without moving register contents down to memory (and back) +* minus: Access to registers across multiple lanes is challenging. "Solution" + is to drop data into memory and immediately back in again (like MMX). + +## Simple-V + +Primary benefit of Simple-V is the OO abstraction of parallel principles +from actual (internal) parallel hardware. It's an API in effect that's +designed to be slotted in to an existing implementation (just after +instruction decode) with minimum disruption and effort. + +* minus: the complexity of having to use register renames, OoO, VLIW, + register file cacheing, all of which has been done before but is a + pain +* plus: transparent re-use of existing opcodes as-is just indirectly + saying "this register's now a vector" which +* plus: means that future instructions also get to be inherently + parallelised because there's no "separate vector opcodes" +* plus: Compressed instructions may also be (indirectly) parallelised +* minus: the indirect nature of Simple-V means that setup (setting + a CSR register to indicate vector length, a separate one to indicate + that it is a predicate register and so on) means a little more setup + time than Alt-RVP or RVV's "direct and within the (longer) instruction" + approach. +* plus: shared register file meaning that, like Alt-RVP, complex + operations not suited to parallelisation may be carried out interleaved + between parallelised instructions *without* requiring data to be dropped + down to memory and back (into a separate vectorised register engine). +* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register + files means that huge parallel workloads would use up considerable + chunks of the register file. However in the case of RV64 and 32-bit + operations, that effectively means 64 slots are available for parallel + operations. +* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to + be added, yet the instruction opcodes remain unchanged (and still appear + to be parallel). consistent "API" regardless of actual internal parallelism: + even an in-order single-issue implementation with a single ALU would still + appear to have parallel vectoristion. +* hard-to-judge: if actual inherent underlying ALU parallelism is added it's + hard to say if there would be pluses or minuses (on die area). At worse it + would be "no worse" than existing register renaming, OoO, VLIW and register + file cacheing schemes. + +## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) + +RVV is extremely well-designed and has some amazing features, including +2D reorganisation of memory through LOAD/STORE "strides". + +* plus: regular predictable workload means that implementations may + streamline effects on L1/L2 Cache. +* plus: regular and clear parallel workload also means that lanes + (similar to Alt-RVP) may be used as an implementation detail, + using either SRAM or 2R1W registers. +* plus: separate engine with no impact on the rest of an implementation +* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse + really feasible. +* minus: no ISA abstraction or re-use either: additions to other Extensions + do not gain parallelism, resulting in prolific duplication of functionality + inside RVV *and out*. +* minus: when operations require a different approach (scalar operations + using the standard integer or FP regfile) an entire vector must be + transferred out to memory, into standard regfiles, then back to memory, + then back to the vector unit, this to occur potentially multiple times. +* minus: will never fit into Compressed instruction space (as-is. May + be able to do so if "indirect" features of Simple-V are partially adopted). +* plus-and-slight-minus: extended variants may address up to 256 + vectorised registers (requires 48/64-bit opcodes to do it). +* minus-and-partial-plus: separate engine plus complexity increases + implementation time and die area, meaning that adoption is likely only + to be in high-performance specialist supercomputing (where it will + be absolutely superb). + +## Traditional SIMD + +The only really good things about SIMD are how easy it is to implement and +get good performance. Unfortunately that makes it quite seductive... + +* plus: really straightforward, ALU basically does several packed operations + at once. Parallelism is inherent at the ALU, making the addition of + SIMD-style parallelism an easy decision that has zero significant impact + on the rest of any given architectural design and layout. +* plus (continuation): SIMD in simple in-order single-issue designs can + therefore result in superb throughput, easily achieved even with a very + simple execution model. +* minus: ridiculously complex setup and corner-cases that disproportionately + increase instruction count on what would otherwise be a "simple loop", + should the number of elements in an array not happen to exactly match + the SIMD group width. +* minus: getting data usefully out of registers (if separate regfiles + are used) means outputting to memory and back. +* minus: quite a lot of supplementary instructions for bit-level manipulation + are needed in order to efficiently extract (or prepare) SIMD operands. +* minus: MASSIVE proliferation of ISA both in terms of opcodes in one + dimension and parallelism (width): an at least O(N^2) and quite probably + O(N^3) ISA proliferation that often results in several thousand + separate instructions. all requiring separate and distinct corner-case + algorithms! +* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of + 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction. + For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires + four separate and distinct instructions: one for (r1:low r2:high), + one for (r1:high r2:low), one for (r1:high r2:high) and one for + (r1:low r2:low) *per function*. +* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch + between operand and result bit-widths. In combination with high/low + proliferation the situation is made even worse. +* minor-saving-grace: some implementations *may* have predication masks + that allow control over individual elements within the SIMD block. + +# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals + +This section compares the various parallelism proposals as they stand, +*against* traditional SIMD as opposed to *alongside* SIMD. In other words, +the question is asked "How can each of the proposals effectively implement +(or replace) SIMD, and how effective would they be"? + +## [[alt_rvp]] + +* Alt-RVP would not actually replace SIMD but would augment it: just as with + a SIMD architecture where the ALU becomes responsible for the parallelism, + Alt-RVP ALUs would likewise be so responsible... with *additional* + (lane-based) parallelism on top. +* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by + at least one dimension are avoided (architectural upgrades introducing + 128-bit then 256-bit then 512-bit variants of the exact same 64-bit + SIMD block) +* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation + of instructions as SIMD, albeit not quite as badly (due to Lanes). +* In the same discussion for Alt-RVP, an additional proposal was made to + be able to subdivide the bits of each register lane (columns) down into + arbitrary bit-lengths (RGB 565 for example). +* A recommendation was given instead to make the subdivisions down to 32-bit, + 16-bit or even 8-bit, effectively dividing the registerfile into + Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane + "swapping" instructions were then introduced, some of the disadvantages + of SIMD could be mitigated. + +## RVV + +* RVV is designed to replace SIMD with a better paradigm: arbitrary-length + parallelism. +* However whilst SIMD is usually designed for single-issue in-order simple + DSPs with a focus on Multimedia (Audio, Video and Image processing), + RVV's primary focus appears to be on Supercomputing: optimisation of + mathematical operations that fit into the OpenCL space. +* Adding functions (operations) that would normally fit (in parallel) + into a SIMD instruction requires an equivalent to be added to the + RVV Extension, if one does not exist. Given the specialist nature of + some SIMD instructions (8-bit or 16-bit saturated or halving add), + this possibility seems extremely unlikely to occur, even if the + implementation overhead of RVV were acceptable (compared to + normal SIMD/DSP-style single-issue in-order simplicity). + +## Simple-V + +* Simple-V borrows hugely from RVV as it is intended to be easy to + topologically transplant every single instruction from RVV (as + designed) into Simple-V equivalents, with *zero loss of functionality + or capability*. +* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP" + Extension which contained the basic primitives (non-parallelised + 8, 16 or 32-bit SIMD operations) inherently *become* parallel, + automatically. +* Additionally, standard operations (ADD, MUL) that would normally have + to have special SIMD-parallel opcodes added need no longer have *any* + of the length-dependent variants (2of 32-bit ADDs in a 64-bit register, + 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the + *standard* RV opcodes (present and future) and automatically parallelises + them. +* By inheriting the RVV feature of arbitrary vector-length, then just as + with RVV the corner-cases and ISA proliferation of SIMD is avoided. +* Whilst not entirely finalised, registers are expected to be + capable of being subdivided down to an implementor-chosen bitwidth + in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8] + and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can + choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit + ALUs that perform twin 8-bit operations as they see fit, or anything + else including no subdivisions at all. +* Even though implementors have that choice even to have full 64-bit + (with RV64) SIMD, they *must* provide predication that transparently + switches off appropriate units on the last loop, thus neatly fitting + underlying SIMD ALU implementations *into* the arbitrary vector-length + RVV paradigm, keeping the uniform consistent API that is a key strategic + feature of Simple-V. +* With Simple-V fitting into the standard register files, certain classes + of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0]) + can be done by applying *Parallelised* Bit-manipulation operations + followed by parallelised *straight* versions of element-to-element + arithmetic operations, even if the bit-manipulation operations require + changing the bitwidth of the "vectors" to do so. Predication can + be utilised to skip high words (or low words) in source or destination. +* In essence, the key downside of SIMD - massive duplication of + identical functions over time as an architecture evolves from 32-bit + wide SIMD all the way up to 512-bit, is avoided with Simple-V, through + vector-style parallelism being dropped on top of 8-bit or 16-bit + operations, all the while keeping a consistent ISA-level "API" irrespective + of implementor design choices (or indeed actual implementations). + +# Impementing V on top of Simple-V + +* Number of Offset CSRs extends from 2 +* Extra register file: vector-file +* Setup of Vector length and bitwidth CSRs now can specify vector-file + as well as integer or float file. +* Extend CSR tables (bitwidth) with extra bits +* TODO + +# Implementing P (renamed to DSP) on top of Simple-V + +* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR + (caveat: anything not specified drops through to software-emulation / traps) +* TODO + +# Appendix + +## V-Extension to Simple-V Comparative Analysis + +This section has been moved to its own page [[v_comparative_analysis]] + +## P-Ext ISA + +This section has been moved to its own page [[p_comparative_analysis]] + +## Example of vector / vector, vector / scalar, scalar / scalar => vector add + + register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... + register CSRpredicate[XLEN][4]; # 2^4 is max vector length + register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well + register x[32][XLEN]; + + function op_add(rd, rs1, rs2, predr) + { +    /* note that this is ADD, not PADD */ +    int i, id, irs1, irs2; +    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored +    # also destination makes no sense as a scalar but what the hell... +    for (i = 0, id=0, irs1=0, irs2=0; i + +One of the goals of this parallelism proposal is to avoid instruction +duplication. However, with the base ISA having been designed explictly +to *avoid* condition-codes entirely, shoe-horning predication into it +bcomes quite challenging. + +However what if all branch instructions, if referencing a vectorised +register, were instead given *completely new analogous meanings* that +resulted in a parallel bit-wise predication register being set? This +would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, +BLT and BGE. + +We might imagine that FEQ, FLT and FLT would also need to be converted, +however these are effectively *already* in the precise form needed and +do not need to be converted *at all*! The difference is that FEQ, FLT +and FLE *specifically* write a 1 to an integer register if the condition +holds, and 0 if not. All that needs to be done here is to say, "if +the integer register is tagged with a bit that says it is a predication +register, the **bit** in the integer register is set based on the +current vector index" instead. + +There is, in the standard Conditional Branch instruction, more than +adequate space to interpret it in a similar fashion: + +[[!table data=""" + 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | +imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | + 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | + offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | +"""]] + +This would become: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | +"""]] + +Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, +with the interesting side-effect that there is space within what is presently +the "immediate offset" field to reinterpret that to add in not only a bit +field to distinguish between floating-point compare and integer compare, +not only to add in a second source register, but also use some of the bits as +a predication target as well. + +[[!table data=""" +15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | + funct3 | imm | rs10 | imm | op | + 3 | 3 | 3 | 5 | 2 | + C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | +"""]] + +Now uses the CS format: + +[[!table data=""" +15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | + funct3 | imm | rs10 | imm | | op | + 3 | 3 | 3 | 2 | 3 | 2 | + C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | +"""]] + +Bit 6 would be decoded as "operation refers to Integer or Float" including +interpreting src1 and src2 accordingly as outlined in Table 12.2 of the +"C" Standard, version 2.0, +whilst Bit 5 would allow the operation to be extended, in combination with +funct3 = 110 or 111: a combination of four distinct (predicated) comparison +operators. In both floating-point and integer cases those could be +EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). + +## Register reordering + +### Register File + +| Reg Num | Bits | +| ------- | ---- | +| r0 | (32..0) | +| r1 | (32..0) | +| r2 | (32..0) | +| r3 | (32..0) | +| r4 | (32..0) | +| r5 | (32..0) | +| r6 | (32..0) | +| r7 | (32..0) | +| .. | (32..0) | +| r31| (32..0) | + +### Vectorised CSR + +May not be an actual CSR: may be generated from Vector Length CSR: +single-bit is less burdensome on instruction decode phase. + +| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | +| - | - | - | - | - | - | - | - | +| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | + +### Vector Length CSR + +| Reg Num | (3..0) | +| ------- | ---- | +| r0 | 2 | +| r1 | 0 | +| r2 | 1 | +| r3 | 1 | +| r4 | 3 | +| r5 | 0 | +| r6 | 0 | +| r7 | 1 | + +### Virtual Register Reordering + +This example assumes the above Vector Length CSR table + +| Reg Num | Bits (0) | Bits (1) | Bits (2) | +| ------- | -------- | -------- | -------- | +| r0 | (32..0) | (32..0) | +| r2 | (32..0) | +| r3 | (32..0) | +| r4 | (32..0) | (32..0) | (32..0) | +| r7 | (32..0) | + +### Bitwidth Virtual Register Reordering + +This example goes a little further and illustrates the effect that a +bitwidth CSR has been set on a register. Preconditions: + +* RV32 assumed +* CSRintbitwidth[2] = 010 # integer r2 is 16-bit +* CSRintvlength[2] = 3 # integer r2 is a vector of length 3 +* vsetl rs1, 5 # set the vector length to 5 + +This is interpreted as follows: + +* Given that the context is RV32, ELEN=32. +* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2 +* Therefore the actual vector length is up to *six* elements +* However vsetl sets a length 5 therefore the last "element" is skipped + +So when using an operation that uses r2 as a source (or destination) +the operation is carried out as follows: + +* 16-bit operation on r2(15..0) - vector element index 0 +* 16-bit operation on r2(31..16) - vector element index 1 +* 16-bit operation on r3(15..0) - vector element index 2 +* 16-bit operation on r3(31..16) - vector element index 3 +* 16-bit operation on r4(15..0) - vector element index 4 +* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5 + +Predication has been left out of the above example for simplicity, however +predication is ANDed with the latter stages (vsetl not equal to maximum +capacity). + +Note also that it is entirely an implementor's choice as to whether to have +actual separate ALUs down to the minimum bitwidth, or whether to have something +more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD +operations carried out 32-bits at a time is perfectly acceptable, as is +8-bit SIMD operations carried out 16-bits at a time requiring two ALUs). +Regardless of the internal parallelism choice, *predication must +still be respected*, making Simple-V in effect the "consistent public API". + +vew may be one of the following (giving a table "bytestable", used below): + +| vew | bitwidth | +| --- | -------- | +| 000 | default | +| 001 | 8 | +| 010 | 16 | +| 011 | 32 | +| 100 | 64 | +| 101 | 128 | +| 110 | rsvd | +| 111 | rsvd | + +Pseudocode for vector length taking CSR SIMD-bitwidth into account: + + vew = CSRbitwidth[rs1] + if (vew == 0) + bytesperreg = (XLEN/8) # or FLEN as appropriate + else: + bytesperreg = bytestable[vew] # 1 2 4 8 16 + simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate + vlen = CSRvectorlen[rs1] * simdmult + +To index an element in a register rnum where the vector element index is i: + + function regoffs(rnum, i): + regidx = floor(i / simdmult) # integer-div rounded down + byteidx = i % simdmult # integer-remainder + return rnum + regidx, # actual real register + byteidx * 8, # low + byteidx * 8 + (vew-1), # high + +### Example Instruction translation: + +Instructions "ADD r2 r4 r4" would result in three instructions being +generated and placed into the FILO: + +* ADD r2 r4 r4 +* ADD r2 r5 r5 +* ADD r2 r6 r6 + +### Insights + +SIMD register file splitting still to consider. For RV64, benefits of doubling +(quadrupling in the case of Half-Precision IEEE754 FP) the apparent +size of the floating point register file to 64 (128 in the case of HP) +seem pretty clear and worth the complexity. + +64 virtual 32-bit F.P. registers and given that 32-bit FP operations are +done on 64-bit registers it's not so conceptually difficult.  May even +be achieved by *actually* splitting the regfile into 64 virtual 32-bit +registers such that a 64-bit FP scalar operation is dropped into (r0.H +r0.L) tuples.  Implementation therefore hidden through register renaming. + +Implementations intending to introduce VLIW, OoO and parallelism +(even without Simple-V) would then find that the instructions are +generated quicker (or in a more compact fashion that is less heavy +on caches). Interestingly we observe then that Simple-V is about +"consolidation of instruction generation", where actual parallelism +of underlying hardware is an implementor-choice that could just as +equally be applied *without* Simple-V even being implemented. + +## Analysis of CSR decoding on latency + +It could indeed have been logically deduced (or expected), that there +would be additional decode latency in this proposal, because if +overloading the opcodes to have different meanings, there is guaranteed +to be some state, some-where, directly related to registers. + +There are several cases: + +* All operands vector-length=1 (scalars), all operands + packed-bitwidth="default": instructions are passed through direct as if + Simple-V did not exist.  Simple-V is, in effect, completely disabled. +* At least one operand vector-length > 1, all operands + packed-bitwidth="default": any parallel vector ALUs placed on "alert", + virtual parallelism looping may be activated. +* All operands vector-length=1 (scalars), at least one + operand packed-bitwidth != default: degenerate case of SIMD, + implementation-specific complexity here (packed decode before ALUs or + *IN* ALUs) +* At least one operand vector-length > 1, at least one operand + packed-bitwidth != default: parallel vector ALUs (if any) + placed on "alert", virtual parallelsim looping may be activated, + implementation-specific SIMD complexity kicks in (packed decode before + ALUs or *IN* ALUs). + +Bear in mind that the proposal includes that the decision whether +to parallelise in hardware or whether to virtual-parallelise (to +dramatically simplify compilers and also not to run into the SIMD +instruction proliferation nightmare) *or* a transprent combination +of both, be done on a *per-operand basis*, so that implementors can +specifically choose to create an application-optimised implementation +that they believe (or know) will sell extremely well, without having +"Extra Standards-Mandated Baggage" that would otherwise blow their area +or power budget completely out the window. + +Additionally, two possible CSR schemes have been proposed, in order to +greatly reduce CSR space: + +* per-register CSRs (vector-length and packed-bitwidth) +* a smaller number of CSRs with the same information but with an *INDEX* + specifying WHICH register in one of three regfiles (vector, fp, int) + the length and bitwidth applies to. + +(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details) + +In addition, LOAD/STORE has its own associated proposed CSRs that +mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of +V (and Hwacha). + +Also bear in mind that, for reasons of simplicity for implementors, +I was coming round to the idea of permitting implementors to choose +exactly which bitwidths they would like to support in hardware and which +to allow to fall through to software-trap emulation. + +So the question boils down to: + +* whether either (or both) of those two CSR schemes have significant + latency that could even potentially require an extra pipeline decode stage +* whether there are implementations that can be thought of which do *not* + introduce significant latency +* whether it is possible to explicitly (through quite simply + disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1, + all-simd-bitwidths=default) switch OFF any decoding, perhaps even to + the extreme of skipping an entire pipeline stage (if one is needed) +* whether packed bitwidth and associated regfile splitting is so complex + that it should definitely, definitely be made mandatory that implementors + move regfile splitting into the ALU, and what are the implications of that +* whether even if that *is* made mandatory, is software-trapped + "unsupported bitwidths" still desirable, on the basis that SIMD is such + a complete nightmare that *even* having a software implementation is + better, making Simple-V have more in common with a software API than + anything else. + +Whilst the above may seem to be severe minuses, there are some strong +pluses: + +* Significant reduction of V's opcode space: over 85%. +* Smaller reduction of P's opcode space: around 10%. +* The potential to use Compressed instructions in both Vector and SIMD + due to the overloading of register meaning (implicit vectorisation, + implicit packing) +* Not only present but also future extensions automatically gain parallelism. +* Already mentioned but worth emphasising: the simplification to compiler + writers and assembly-level writers of having the same consistent ISA + regardless of whether the internal level of parallelism (number of + parallel ALUs) is only equal to one ("virtual" parallelism), or is + greater than one, should not be underestimated. + +## Reducing Register Bank porting + +This looks quite reasonable. + + +The main details are outlined on page 4.  They propose a 2-level register +cache hierarchy, note that registers are typically only read once, that +you never write back from upper to lower cache level but always go in a +cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose +a scheme where you look ahead by only 2 instructions to determine which +registers to bring into the cache. + +The nice thing about a vector architecture is that you *know* that +*even more* registers are going to be pulled in: Hwacha uses this fact +to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough +by *introducing* deliberate latency into the execution phase. # References @@ -583,3 +1327,14 @@ translates effectively to: Figure 2 P17 and Section 3 on P16. * Hwacha * Hwacha +* Vector Workshop +* Predication +* Branch Divergence +* Life of Triangles (3D) +* Videocore-IV +* Discussion proposing CSRs that change ISA definition + +* Zero-overhead loops +* Multi-ported VLIW Register File Implementation +* Fast context save/restore proposal +* Register File Bank Cacheing