X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=5567b3d8bbee22a321a7887dd866fa1466d58f04;hb=ccfc957b55d0163099cb66301bcfe30ca1b8e6b4;hp=fe05c08222afee8d81c19f384e1bc28b38c2e8bd;hpb=14487c3a9f8e0bed801a8abdd43711d4b3037435;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index fe05c0822..5567b3d8b 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,1424 +1,12 @@ -# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal - -Key insight: Simple-V is intended as an abstraction layer to provide -a consistent "API" to parallelisation of existing *and future* operations. -*Actual* internal hardware-level parallelism is *not* required, such -that Simple-V may be viewed as providing a "compact" or "consolidated" -means of issuing multiple near-identical arithmetic instructions to an -instruction queue (FILO), pending execution. - -*Actual* parallelism, if added independently of Simple-V in the form -of Out-of-order restructuring (including parallel ALU lanes) or VLIW -implementations, or SIMD, or anything else, would then benefit *if* -Simple-V was added on top. - -[[!toc ]] - -# Introduction - -This proposal exists so as to be able to satisfy several disparate -requirements: power-conscious, area-conscious, and performance-conscious -designs all pull an ISA and its implementation in different conflicting -directions, as do the specific intended uses for any given implementation. - -Additionally, the existing P (SIMD) proposal and the V (Vector) proposals, -whilst each extremely powerful in their own right and clearly desirable, -are also: - -* Clearly independent in their origins (Cray and AndesStar v3 respectively) - so need work to adapt to the RISC-V ethos and paradigm -* Are sufficiently large so as to make adoption (and exploration for - analysis and review purposes) prohibitively expensive -* Both contain partial duplication of pre-existing RISC-V instructions - (an undesirable characteristic) -* Both have independent and disparate methods for introducing parallelism - at the instruction level. -* Both require that their respective parallelism paradigm be implemented - along-side and integral to their respective functionality *or not at all*. -* Both independently have methods for introducing parallelism that - could, if separated, benefit - *other areas of RISC-V not just DSP or Floating-point respectively*. - -Therefore it makes a huge amount of sense to have a means and method -of introducing instruction parallelism in a flexible way that provides -implementors with the option to choose exactly where they wish to offer -performance improvements and where they wish to optimise for power -and/or area (and if that can be offered even on a per-operation basis that -would provide even more flexibility). - -Additionally it makes sense to *split out* the parallelism inherent within -each of P and V, and to see if each of P and V then, in *combination* with -a "best-of-both" parallelism extension, could be added on *on top* of -this proposal, to topologically provide the exact same functionality of -each of P and V. Each of P and V then can focus on providing the best -operations possible for their respective target areas, without being -hugely concerned about the actual parallelism. - -Furthermore, an additional goal of this proposal is to reduce the number -of opcodes utilised by each of P and V as they currently stand, leveraging -existing RISC-V opcodes where possible, and also potentially allowing -P and V to make use of Compressed Instructions as a result. - -**TODO**: propose overflow registers be actually one of the integer regs -(flowing to multiple regs). - -**TODO**: propose "mask" (predication) registers likewise. combination with -standard RV instructions and overflow registers extremely powerful, see -Aspex ASP. - -# Analysis and discussion of Vector vs SIMD - -There are five combined areas between the two proposals that help with -parallelism without over-burdening the ISA with a huge proliferation of -instructions: - -* Fixed vs variable parallelism (fixed or variable "M" in SIMD) -* Implicit vs fixed instruction bit-width (integral to instruction or not) -* Implicit vs explicit type-conversion (compounded on bit-width) -* Implicit vs explicit inner loops. -* Masks / tagging (selecting/preventing certain indexed elements from execution) - -The pros and cons of each are discussed and analysed below. - -## Fixed vs variable parallelism length - -In David Patterson and Andrew Waterman's analysis of SIMD and Vector -ISAs, the analysis comes out clearly in favour of (effectively) variable -length SIMD. As SIMD is a fixed width, typically 4, 8 or in extreme cases -16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD -are extremely burdensome except for applications whose requirements -*specifically* match the *precise and exact* depth of the SIMD engine. - -Thus, SIMD, no matter what width is chosen, is never going to be acceptable -for general-purpose computation, and in the context of developing a -general-purpose ISA, is never going to satisfy 100 percent of implementors. - -To explain this further: for increased workloads over time, as the -performance requirements increase for new target markets, implementors -choose to extend the SIMD width (so as to again avoid mixing parallelism -into the instruction issue phases: the primary "simplicity" benefit of -SIMD in the first place), with the result that the entire opcode space -effectively doubles with each new SIMD width that's added to the ISA. - -That basically leaves "variable-length vector" as the clear *general-purpose* -winner, at least in terms of greatly simplifying the instruction set, -reducing the number of instructions required for any given task, and thus -reducing power consumption for the same. - -## Implicit vs fixed instruction bit-width - -SIMD again has a severe disadvantage here, over Vector: huge proliferation -of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and -have to then have operations *for each and between each*. It gets very -messy, very quickly. - -The V-Extension on the other hand proposes to set the bit-width of -future instructions on a per-register basis, such that subsequent instructions -involving that register are *implicitly* of that particular bit-width until -otherwise changed or reset. - -This has some extremely useful properties, without being particularly -burdensome to implementations, given that instruction decode already has -to direct the operation to a correctly-sized width ALU engine, anyway. - -Not least: in places where an ISA was previously constrained (due for -whatever reason, including limitations of the available operand spcace), -implicit bit-width allows the meaning of certain operations to be -type-overloaded *without* pollution or alteration of frozen and immutable -instructions, in a fully backwards-compatible fashion. - -## Implicit and explicit type-conversion - -The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help -deal with over-population of instructions, such that type-casting from -integer (and floating point) of various sizes is automatically inferred -due to "type tagging" that is set with a special instruction. A register -will be *specifically* marked as "16-bit Floating-Point" and, if added -to an operand that is specifically tagged as "32-bit Integer" an implicit -type-conversion will take place *without* requiring that type-conversion -to be explicitly done with its own separate instruction. - -However, implicit type-conversion is not only quite burdensome to -implement (explosion of inferred type-to-type conversion) but also is -never really going to be complete. It gets even worse when bit-widths -also have to be taken into consideration. Each new type results in -an increased O(N^2) conversion space that, as anyone who has examined -python's source code (which has built-in polymorphic type-conversion), -knows that the task is more complex than it first seems. - -Overall, type-conversion is generally best to leave to explicit -type-conversion instructions, or in definite specific use-cases left to -be part of an actual instruction (DSP or FP) - -## Zero-overhead loops vs explicit loops - -The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology -contains an extremely interesting feature: zero-overhead loops. This -proposal would basically allow an inner loop of instructions to be -repeated indefinitely, a fixed number of times. - -Its specific advantage over explicit loops is that the pipeline in a DSP -can potentially be kept completely full *even in an in-order single-issue -implementation*. Normally, it requires a superscalar architecture and -out-of-order execution capabilities to "pre-process" instructions in -order to keep ALU pipelines 100% occupied. - -By bringing that capability in, this proposal could offer a way to increase -pipeline activity even in simpler implementations in the one key area -which really matters: the inner loop. - -However when looking at much more comprehensive schemes -"A portable specification of zero-overhead loop control hardware -applied to embedded processors" (ZOLC), optimising only the single -inner loop seems inadequate, tending to suggest that ZOLC may be -better off being proposed as an entirely separate Extension. - -## Mask and Tagging (Predication) - -Tagging (aka Masks aka Predication) is a pseudo-method of implementing -simplistic branching in a parallel fashion, by allowing execution on -elements of a vector to be switched on or off depending on the results -of prior operations in the same array position. - -The reason for considering this is simple: by *definition* it -is not possible to perform individual parallel branches in a SIMD -(Single-Instruction, **Multiple**-Data) context. Branches (modifying -of the Program Counter) will result in *all* parallel data having -a different instruction executed on it: that's just the definition of -SIMD, and it is simply unavoidable. - -So these are the ways in which conditional execution may be implemented: - -* explicit compare and branch: BNE x, y -> offs would jump offs - instructions if x was not equal to y -* explicit store of tag condition: CMP x, y -> tagbit -* implicit (condition-code) ADD results in a carry, carry bit implicitly - (or sometimes explicitly) goes into a "tag" (mask) register - -The first of these is a "normal" branch method, which is flat-out impossible -to parallelise without look-ahead and effectively rewriting instructions. -This would defeat the purpose of RISC. - -The latter two are where parallelism becomes easy to do without complexity: -every operation is modified to be "conditionally executed" (in an explicit -way directly in the instruction format *or* implicitly). - -RVV (Vector-Extension) proposes to have *explicit* storing of the compare -in a tag/mask register, and to *explicitly* have every vector operation -*require* that its operation be "predicated" on the bits within an -explicitly-named tag/mask register. - -SIMD (P-Extension) has not yet published precise documentation on what its -schema is to be: there is however verbal indication at the time of writing -that: - -> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will -> be executed using the same compare ALU logic for the base ISA with some -> minor modifications to handle smaller data types. The function will not -> be duplicated. - -This is an *implicit* form of predication as the base RV ISA does not have -condition-codes or predication. By adding a CSR it becomes possible -to also tag certain registers as "predicated if referenced as a destination". -Example: - - // in future operations from now on, if r0 is the destination use r5 as - // the PREDICATION register - SET_IMPLICIT_CSRPREDICATE r0, r5 - // store the compares in r5 as the PREDICATION register - CMPEQ8 r5, r1, r2 - // r0 is used here. ah ha! that means it's predicated using r5! - ADD8 r0, r1, r3 - -With enough registers (and in RISC-V there are enough registers) some fairly -complex predication can be set up and yet still execute without significant -stalling, even in a simple non-superscalar architecture. - -(For details on how Branch Instructions would be retro-fitted to indirectly -predicated equivalents, see Appendix) - -## Conclusions - -In the above sections the five different ways where parallel instruction -execution has closely and loosely inter-related implications for the ISA and -for implementors, were outlined. The pluses and minuses came out as -follows: - -* Fixed vs variable parallelism: variable -* Implicit (indirect) vs fixed (integral) instruction bit-width: indirect -* Implicit vs explicit type-conversion: explicit -* Implicit vs explicit inner loops: implicit but best done separately -* Tag or no-tag: Complex but highly beneficial - -In particular: - -* variable-length vectors came out on top because of the high setup, teardown - and corner-cases associated with the fixed width of SIMD. -* Implicit bit-width helps to extend the ISA to escape from - former limitations and restrictions (in a backwards-compatible fashion), - whilst also leaving implementors free to simmplify implementations - by using actual explicit internal parallelism. -* Implicit (zero-overhead) loops provide a means to keep pipelines - potentially 100% occupied in a single-issue in-order implementation - i.e. *without* requiring a super-scalar or out-of-order architecture, - but doing a proper, full job (ZOLC) is an entirely different matter. - -Constructing a SIMD/Simple-Vector proposal based around four of these five -requirements would therefore seem to be a logical thing to do. - -# Instruction Format - -The instruction format for Simple-V does not actually have *any* compare -operations, *any* arithmetic, floating point or memory instructions. -Instead it *overloads* pre-existing branch operations into predicated -variants, and implicitly overloads arithmetic operations and LOAD/STORE -depending on implicit CSR configurations for both vector length and -bitwidth. This includes Compressed instructions. - -* For analysis of RVV see [[v_comparative_analysis]] which begins to - outline topologically-equivalent mappings of instructions -* Also see Appendix "Retro-fitting Predication into branch-explicit ISA" - for format of Branch opcodes. - -**TODO**: *analyse and decide whether the implicit nature of predication -as proposed is or is not a lot of hassle, and if explicit prefixes are -a better idea instead. Parallelism therefore effectively may end up -as always being 64-bit opcodes (32 for the prefix, 32 for the instruction) -with some opportunities for to use Compressed bringing it down to 48. -Also to consider is whether one or both of the last two remaining Compressed -instruction codes in Quadrant 1 could be used as a parallelism prefix, -bringing parallelised opcodes down to 32-bit and having the benefit of -being explicit.* - -## Branch Instruction: - -[[!table data=""" -31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | -imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | -1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | -I/F | reserved | src2 | src1 | BPR | predicate rs3 || BRANCH | -0 | reserved | src2 | src1 | 000 | predicate rs3 || BEQ | -0 | reserved | src2 | src1 | 001 | predicate rs3 || BNE | -0 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd | -0 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd | -0 | reserved | src2 | src1 | 100 | predicate rs3 || BLE | -0 | reserved | src2 | src1 | 101 | predicate rs3 || BGE | -0 | reserved | src2 | src1 | 110 | predicate rs3 || BLTU | -0 | reserved | src2 | src1 | 111 | predicate rs3 || BGEU | -1 | reserved | src2 | src1 | 000 | predicate rs3 || FEQ | -1 | reserved | src2 | src1 | 001 | predicate rs3 || FNE | -1 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd | -1 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd | -1 | reserved | src2 | src1 | 100 | predicate rs3 || FLT | -1 | reserved | src2 | src1 | 101 | predicate rs3 || FLE | -1 | reserved | src2 | src1 | 110 | predicate rs3 || rsvd | -1 | reserved | src2 | src1 | 111 | predicate rs3 || rsvd | -"""]] - -In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given -for predicated compare operations of function "cmp": - - for (int i=0; i 1; - s2 = CSRvectorlen[src2] > 1; - for (int i=0; i - -There are a number of CSRs needed, which are used at the instruction -decode phase to re-interpret standard RV opcodes (a practice that has -precedent in the setting of MISA to enable / disable extensions). - -* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Integer Register N is a Predication Register (note: a key-value store) -* Vector Length CSR (VSETVL, VGETVL) - -Notes: - -* for the purposes of LOAD / STORE, Integer Registers which are - marked as a Vector will result in a Vector LOAD / STORE. -* Vector Lengths are *not* the same as vsetl but are an integral part - of vsetl. -* Actual vector length is *multipled* by how many blocks of length - "bitwidth" may fit into an XLEN-sized register file. -* Predication is a key-value store due to the implicit referencing, - as opposed to having the predicate register explicitly in the instruction. - -## Predication CSR - -The Predication CSR is a key-value store indicating whether, if a given -destination register (integer or floating-point) is referred to in an -instruction, it is to be predicated. The first entry is whether predication -is enabled. The second entry is whether the register index refers to a -floating-point or an integer register. The third entry is the index -of that register which is to be predicated (if referred to). The fourth entry -is the integer register that is treated as a bitfield, indexable by the -vector element index. - -| RegNo | 6 | 5 | (4..0) | (4..0) | -| ----- | - | - | ------- | ------- | -| r0 | pren0 | i/f | regidx | predidx | -| r1 | pren1 | i/f | regidx | predidx | -| .. | pren.. | i/f | regidx | predidx | -| r15 | pren15 | i/f | regidx | predidx | - -The Predication CSR Table is a key-value store, so implementation-wise -it will be faster to turn the table around (maintain topologically -equivalent state): - - fp_pred_enabled[32]; - int_pred_enabled[32]; - for (i = 0; i < 16; i++) - if CSRpred[i].pren: - idx = CSRpred[i].regidx - predidx = CSRpred[i].predidx - if CSRpred[i].type == 0: # integer - int_pred_enabled[idx] = 1 - int_pred_reg[idx] = predidx - else: - fp_pred_enabled[idx] = 1 - fp_pred_reg[idx] = predidx - -So when an operation is to be predicated, it is the internal state that -is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following -pseudo-code for operations is given, where p is the explicit (direct) -reference to the predication register to be used: - - for (int i=0; i What does an ADD of two different-sized vectors do in simple-V? - -* if the two source operands are not the same, throw an exception. -* if the destination operand is also a vector, and the source is longer - than the destination, throw an exception. - -> And what about instructions like JALR?  -> What does jumping to a vector do? - -* Throw an exception. Whether that actually results in spawning threads - as part of the trap-handling remains to be seen. - -# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals - -This section compares the various parallelism proposals as they stand, -including traditional SIMD, in terms of features, ease of implementation, -complexity, flexibility, and die area. - -## [[alt_rvp]] - -Primary benefit of Alt-RVP is the simplicity with which parallelism -may be introduced (effective multiplication of regfiles and associated ALUs). - -* plus: the simplicity of the lanes (combined with the regularity of - allocating identical opcodes multiple independent registers) meaning - that SRAM or 2R1W can be used for entire regfile (potentially). -* minus: a more complex instruction set where the parallelism is much - more explicitly directly specified in the instruction and -* minus: if you *don't* have an explicit instruction (opcode) and you - need one, the only place it can be added is... in the vector unit and -* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are - not useable or accessible in other Extensions. -* plus-and-minus: Lanes may be utilised for high-speed context-switching - but with the down-side that they're an all-or-nothing part of the Extension. - No Alt-RVP: no fast register-bank switching. -* plus: Lane-switching would mean that complex operations not suited to - parallelisation can be carried out, followed by further parallel Lane-based - work, without moving register contents down to memory (and back) -* minus: Access to registers across multiple lanes is challenging. "Solution" - is to drop data into memory and immediately back in again (like MMX). - -## Simple-V - -Primary benefit of Simple-V is the OO abstraction of parallel principles -from actual (internal) parallel hardware. It's an API in effect that's -designed to be slotted in to an existing implementation (just after -instruction decode) with minimum disruption and effort. - -* minus: the complexity of having to use register renames, OoO, VLIW, - register file cacheing, all of which has been done before but is a - pain -* plus: transparent re-use of existing opcodes as-is just indirectly - saying "this register's now a vector" which -* plus: means that future instructions also get to be inherently - parallelised because there's no "separate vector opcodes" -* plus: Compressed instructions may also be (indirectly) parallelised -* minus: the indirect nature of Simple-V means that setup (setting - a CSR register to indicate vector length, a separate one to indicate - that it is a predicate register and so on) means a little more setup - time than Alt-RVP or RVV's "direct and within the (longer) instruction" - approach. -* plus: shared register file meaning that, like Alt-RVP, complex - operations not suited to parallelisation may be carried out interleaved - between parallelised instructions *without* requiring data to be dropped - down to memory and back (into a separate vectorised register engine). -* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register - files means that huge parallel workloads would use up considerable - chunks of the register file. However in the case of RV64 and 32-bit - operations, that effectively means 64 slots are available for parallel - operations. -* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to - be added, yet the instruction opcodes remain unchanged (and still appear - to be parallel). consistent "API" regardless of actual internal parallelism: - even an in-order single-issue implementation with a single ALU would still - appear to have parallel vectoristion. -* hard-to-judge: if actual inherent underlying ALU parallelism is added it's - hard to say if there would be pluses or minuses (on die area). At worse it - would be "no worse" than existing register renaming, OoO, VLIW and register - file cacheing schemes. - -## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) - -RVV is extremely well-designed and has some amazing features, including -2D reorganisation of memory through LOAD/STORE "strides". - -* plus: regular predictable workload means that implementations may - streamline effects on L1/L2 Cache. -* plus: regular and clear parallel workload also means that lanes - (similar to Alt-RVP) may be used as an implementation detail, - using either SRAM or 2R1W registers. -* plus: separate engine with no impact on the rest of an implementation -* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse - really feasible. -* minus: no ISA abstraction or re-use either: additions to other Extensions - do not gain parallelism, resulting in prolific duplication of functionality - inside RVV *and out*. -* minus: when operations require a different approach (scalar operations - using the standard integer or FP regfile) an entire vector must be - transferred out to memory, into standard regfiles, then back to memory, - then back to the vector unit, this to occur potentially multiple times. -* minus: will never fit into Compressed instruction space (as-is. May - be able to do so if "indirect" features of Simple-V are partially adopted). -* plus-and-slight-minus: extended variants may address up to 256 - vectorised registers (requires 48/64-bit opcodes to do it). -* minus-and-partial-plus: separate engine plus complexity increases - implementation time and die area, meaning that adoption is likely only - to be in high-performance specialist supercomputing (where it will - be absolutely superb). - -## Traditional SIMD - -The only really good things about SIMD are how easy it is to implement and -get good performance. Unfortunately that makes it quite seductive... - -* plus: really straightforward, ALU basically does several packed operations - at once. Parallelism is inherent at the ALU, making the addition of - SIMD-style parallelism an easy decision that has zero significant impact - on the rest of any given architectural design and layout. -* plus (continuation): SIMD in simple in-order single-issue designs can - therefore result in superb throughput, easily achieved even with a very - simple execution model. -* minus: ridiculously complex setup and corner-cases that disproportionately - increase instruction count on what would otherwise be a "simple loop", - should the number of elements in an array not happen to exactly match - the SIMD group width. -* minus: getting data usefully out of registers (if separate regfiles - are used) means outputting to memory and back. -* minus: quite a lot of supplementary instructions for bit-level manipulation - are needed in order to efficiently extract (or prepare) SIMD operands. -* minus: MASSIVE proliferation of ISA both in terms of opcodes in one - dimension and parallelism (width): an at least O(N^2) and quite probably - O(N^3) ISA proliferation that often results in several thousand - separate instructions. all requiring separate and distinct corner-case - algorithms! -* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of - 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction. - For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires - four separate and distinct instructions: one for (r1:low r2:high), - one for (r1:high r2:low), one for (r1:high r2:high) and one for - (r1:low r2:low) *per function*. -* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch - between operand and result bit-widths. In combination with high/low - proliferation the situation is made even worse. -* minor-saving-grace: some implementations *may* have predication masks - that allow control over individual elements within the SIMD block. - -# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals - -This section compares the various parallelism proposals as they stand, -*against* traditional SIMD as opposed to *alongside* SIMD. In other words, -the question is asked "How can each of the proposals effectively implement -(or replace) SIMD, and how effective would they be"? - -## [[alt_rvp]] - -* Alt-RVP would not actually replace SIMD but would augment it: just as with - a SIMD architecture where the ALU becomes responsible for the parallelism, - Alt-RVP ALUs would likewise be so responsible... with *additional* - (lane-based) parallelism on top. -* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by - at least one dimension are avoided (architectural upgrades introducing - 128-bit then 256-bit then 512-bit variants of the exact same 64-bit - SIMD block) -* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation - of instructions as SIMD, albeit not quite as badly (due to Lanes). -* In the same discussion for Alt-RVP, an additional proposal was made to - be able to subdivide the bits of each register lane (columns) down into - arbitrary bit-lengths (RGB 565 for example). -* A recommendation was given instead to make the subdivisions down to 32-bit, - 16-bit or even 8-bit, effectively dividing the registerfile into - Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane - "swapping" instructions were then introduced, some of the disadvantages - of SIMD could be mitigated. - -## RVV - -* RVV is designed to replace SIMD with a better paradigm: arbitrary-length - parallelism. -* However whilst SIMD is usually designed for single-issue in-order simple - DSPs with a focus on Multimedia (Audio, Video and Image processing), - RVV's primary focus appears to be on Supercomputing: optimisation of - mathematical operations that fit into the OpenCL space. -* Adding functions (operations) that would normally fit (in parallel) - into a SIMD instruction requires an equivalent to be added to the - RVV Extension, if one does not exist. Given the specialist nature of - some SIMD instructions (8-bit or 16-bit saturated or halving add), - this possibility seems extremely unlikely to occur, even if the - implementation overhead of RVV were acceptable (compared to - normal SIMD/DSP-style single-issue in-order simplicity). - -## Simple-V - -* Simple-V borrows hugely from RVV as it is intended to be easy to - topologically transplant every single instruction from RVV (as - designed) into Simple-V equivalents, with *zero loss of functionality - or capability*. -* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP" - Extension which contained the basic primitives (non-parallelised - 8, 16 or 32-bit SIMD operations) inherently *become* parallel, - automatically. -* Additionally, standard operations (ADD, MUL) that would normally have - to have special SIMD-parallel opcodes added need no longer have *any* - of the length-dependent variants (2of 32-bit ADDs in a 64-bit register, - 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the - *standard* RV opcodes (present and future) and automatically parallelises - them. -* By inheriting the RVV feature of arbitrary vector-length, then just as - with RVV the corner-cases and ISA proliferation of SIMD is avoided. -* Whilst not entirely finalised, registers are expected to be - capable of being subdivided down to an implementor-chosen bitwidth - in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8] - and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can - choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit - ALUs that perform twin 8-bit operations as they see fit, or anything - else including no subdivisions at all. -* Even though implementors have that choice even to have full 64-bit - (with RV64) SIMD, they *must* provide predication that transparently - switches off appropriate units on the last loop, thus neatly fitting - underlying SIMD ALU implementations *into* the arbitrary vector-length - RVV paradigm, keeping the uniform consistent API that is a key strategic - feature of Simple-V. -* With Simple-V fitting into the standard register files, certain classes - of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0]) - can be done by applying *Parallelised* Bit-manipulation operations - followed by parallelised *straight* versions of element-to-element - arithmetic operations, even if the bit-manipulation operations require - changing the bitwidth of the "vectors" to do so. Predication can - be utilised to skip high words (or low words) in source or destination. -* In essence, the key downside of SIMD - massive duplication of - identical functions over time as an architecture evolves from 32-bit - wide SIMD all the way up to 512-bit, is avoided with Simple-V, through - vector-style parallelism being dropped on top of 8-bit or 16-bit - operations, all the while keeping a consistent ISA-level "API" irrespective - of implementor design choices (or indeed actual implementations). - -# Impementing V on top of Simple-V - -* Number of Offset CSRs extends from 2 -* Extra register file: vector-file -* Setup of Vector length and bitwidth CSRs now can specify vector-file - as well as integer or float file. -* Extend CSR tables (bitwidth) with extra bits -* TODO - -# Implementing P (renamed to DSP) on top of Simple-V - -* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR - (caveat: anything not specified drops through to software-emulation / traps) -* TODO - -# Appendix - -## V-Extension to Simple-V Comparative Analysis - -This section has been moved to its own page [[v_comparative_analysis]] - -## P-Ext ISA - -This section has been moved to its own page [[p_comparative_analysis]] - -## Example of vector / vector, vector / scalar, scalar / scalar => vector add - - register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... - register CSRpredicate[XLEN][4]; # 2^4 is max vector length - register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well - register x[32][XLEN]; - - function op_add(rd, rs1, rs2, predr) - { -    /* note that this is ADD, not PADD */ -    int i, id, irs1, irs2; -    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored -    # also destination makes no sense as a scalar but what the hell... -    for (i = 0, id=0, irs1=0, irs2=0; i - -One of the goals of this parallelism proposal is to avoid instruction -duplication. However, with the base ISA having been designed explictly -to *avoid* condition-codes entirely, shoe-horning predication into it -bcomes quite challenging. - -However what if all branch instructions, if referencing a vectorised -register, were instead given *completely new analogous meanings* that -resulted in a parallel bit-wise predication register being set? This -would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, -BLT and BGE. - -We might imagine that FEQ, FLT and FLT would also need to be converted, -however these are effectively *already* in the precise form needed and -do not need to be converted *at all*! The difference is that FEQ, FLT -and FLE *specifically* write a 1 to an integer register if the condition -holds, and 0 if not. All that needs to be done here is to say, "if -the integer register is tagged with a bit that says it is a predication -register, the **bit** in the integer register is set based on the -current vector index" instead. - -There is, in the standard Conditional Branch instruction, more than -adequate space to interpret it in a similar fashion: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | -"""]] - -This would become: - -[[!table data=""" -31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | -imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | -1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | -reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | -"""]] - -Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, -with the interesting side-effect that there is space within what is presently -the "immediate offset" field to reinterpret that to add in not only a bit -field to distinguish between floating-point compare and integer compare, -not only to add in a second source register, but also use some of the bits as -a predication target as well. - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | op | - 3 | 3 | 3 | 5 | 2 | - C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | -"""]] - -Now uses the CS format: - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | | op | - 3 | 3 | 3 | 2 | 3 | 2 | - C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | -"""]] - -Bit 6 would be decoded as "operation refers to Integer or Float" including -interpreting src1 and src2 accordingly as outlined in Table 12.2 of the -"C" Standard, version 2.0, -whilst Bit 5 would allow the operation to be extended, in combination with -funct3 = 110 or 111: a combination of four distinct (predicated) comparison -operators. In both floating-point and integer cases those could be -EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). - -## Register reordering - -### Register File - -| Reg Num | Bits | -| ------- | ---- | -| r0 | (32..0) | -| r1 | (32..0) | -| r2 | (32..0) | -| r3 | (32..0) | -| r4 | (32..0) | -| r5 | (32..0) | -| r6 | (32..0) | -| r7 | (32..0) | -| .. | (32..0) | -| r31| (32..0) | - -### Vectorised CSR - -May not be an actual CSR: may be generated from Vector Length CSR: -single-bit is less burdensome on instruction decode phase. - -| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | -| - | - | - | - | - | - | - | - | -| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | - -### Vector Length CSR - -| Reg Num | (3..0) | -| ------- | ---- | -| r0 | 2 | -| r1 | 0 | -| r2 | 1 | -| r3 | 1 | -| r4 | 3 | -| r5 | 0 | -| r6 | 0 | -| r7 | 1 | - -### Virtual Register Reordering - -This example assumes the above Vector Length CSR table - -| Reg Num | Bits (0) | Bits (1) | Bits (2) | -| ------- | -------- | -------- | -------- | -| r0 | (32..0) | (32..0) | -| r2 | (32..0) | -| r3 | (32..0) | -| r4 | (32..0) | (32..0) | (32..0) | -| r7 | (32..0) | - -### Bitwidth Virtual Register Reordering - -This example goes a little further and illustrates the effect that a -bitwidth CSR has been set on a register. Preconditions: - -* RV32 assumed -* CSRintbitwidth[2] = 010 # integer r2 is 16-bit -* CSRintvlength[2] = 3 # integer r2 is a vector of length 3 -* vsetl rs1, 5 # set the vector length to 5 - -This is interpreted as follows: - -* Given that the context is RV32, ELEN=32. -* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2 -* Therefore the actual vector length is up to *six* elements -* However vsetl sets a length 5 therefore the last "element" is skipped - -So when using an operation that uses r2 as a source (or destination) -the operation is carried out as follows: - -* 16-bit operation on r2(15..0) - vector element index 0 -* 16-bit operation on r2(31..16) - vector element index 1 -* 16-bit operation on r3(15..0) - vector element index 2 -* 16-bit operation on r3(31..16) - vector element index 3 -* 16-bit operation on r4(15..0) - vector element index 4 -* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5 - -Predication has been left out of the above example for simplicity, however -predication is ANDed with the latter stages (vsetl not equal to maximum -capacity). - -Note also that it is entirely an implementor's choice as to whether to have -actual separate ALUs down to the minimum bitwidth, or whether to have something -more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD -operations carried out 32-bits at a time is perfectly acceptable, as is -8-bit SIMD operations carried out 16-bits at a time requiring two ALUs). -Regardless of the internal parallelism choice, *predication must -still be respected*, making Simple-V in effect the "consistent public API". - -vew may be one of the following (giving a table "bytestable", used below): - -| vew | bitwidth | -| --- | -------- | -| 000 | default | -| 001 | 8 | -| 010 | 16 | -| 011 | 32 | -| 100 | 64 | -| 101 | 128 | -| 110 | rsvd | -| 111 | rsvd | - -Pseudocode for vector length taking CSR SIMD-bitwidth into account: - - vew = CSRbitwidth[rs1] - if (vew == 0) - bytesperreg = (XLEN/8) # or FLEN as appropriate - else: - bytesperreg = bytestable[vew] # 1 2 4 8 16 - simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate - vlen = CSRvectorlen[rs1] * simdmult - -To index an element in a register rnum where the vector element index is i: - - function regoffs(rnum, i): - regidx = floor(i / simdmult) # integer-div rounded down - byteidx = i % simdmult # integer-remainder - return rnum + regidx, # actual real register - byteidx * 8, # low - byteidx * 8 + (vew-1), # high - -### Example Instruction translation: - -Instructions "ADD r2 r4 r4" would result in three instructions being -generated and placed into the FILO: - -* ADD r2 r4 r4 -* ADD r2 r5 r5 -* ADD r2 r6 r6 - -### Insights - -SIMD register file splitting still to consider. For RV64, benefits of doubling -(quadrupling in the case of Half-Precision IEEE754 FP) the apparent -size of the floating point register file to 64 (128 in the case of HP) -seem pretty clear and worth the complexity. - -64 virtual 32-bit F.P. registers and given that 32-bit FP operations are -done on 64-bit registers it's not so conceptually difficult.  May even -be achieved by *actually* splitting the regfile into 64 virtual 32-bit -registers such that a 64-bit FP scalar operation is dropped into (r0.H -r0.L) tuples.  Implementation therefore hidden through register renaming. - -Implementations intending to introduce VLIW, OoO and parallelism -(even without Simple-V) would then find that the instructions are -generated quicker (or in a more compact fashion that is less heavy -on caches). Interestingly we observe then that Simple-V is about -"consolidation of instruction generation", where actual parallelism -of underlying hardware is an implementor-choice that could just as -equally be applied *without* Simple-V even being implemented. - -## Analysis of CSR decoding on latency - -It could indeed have been logically deduced (or expected), that there -would be additional decode latency in this proposal, because if -overloading the opcodes to have different meanings, there is guaranteed -to be some state, some-where, directly related to registers. - -There are several cases: - -* All operands vector-length=1 (scalars), all operands - packed-bitwidth="default": instructions are passed through direct as if - Simple-V did not exist.  Simple-V is, in effect, completely disabled. -* At least one operand vector-length > 1, all operands - packed-bitwidth="default": any parallel vector ALUs placed on "alert", - virtual parallelism looping may be activated. -* All operands vector-length=1 (scalars), at least one - operand packed-bitwidth != default: degenerate case of SIMD, - implementation-specific complexity here (packed decode before ALUs or - *IN* ALUs) -* At least one operand vector-length > 1, at least one operand - packed-bitwidth != default: parallel vector ALUs (if any) - placed on "alert", virtual parallelsim looping may be activated, - implementation-specific SIMD complexity kicks in (packed decode before - ALUs or *IN* ALUs). - -Bear in mind that the proposal includes that the decision whether -to parallelise in hardware or whether to virtual-parallelise (to -dramatically simplify compilers and also not to run into the SIMD -instruction proliferation nightmare) *or* a transprent combination -of both, be done on a *per-operand basis*, so that implementors can -specifically choose to create an application-optimised implementation -that they believe (or know) will sell extremely well, without having -"Extra Standards-Mandated Baggage" that would otherwise blow their area -or power budget completely out the window. - -Additionally, two possible CSR schemes have been proposed, in order to -greatly reduce CSR space: - -* per-register CSRs (vector-length and packed-bitwidth) -* a smaller number of CSRs with the same information but with an *INDEX* - specifying WHICH register in one of three regfiles (vector, fp, int) - the length and bitwidth applies to. - -(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details) - -In addition, LOAD/STORE has its own associated proposed CSRs that -mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of -V (and Hwacha). - -Also bear in mind that, for reasons of simplicity for implementors, -I was coming round to the idea of permitting implementors to choose -exactly which bitwidths they would like to support in hardware and which -to allow to fall through to software-trap emulation. - -So the question boils down to: - -* whether either (or both) of those two CSR schemes have significant - latency that could even potentially require an extra pipeline decode stage -* whether there are implementations that can be thought of which do *not* - introduce significant latency -* whether it is possible to explicitly (through quite simply - disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1, - all-simd-bitwidths=default) switch OFF any decoding, perhaps even to - the extreme of skipping an entire pipeline stage (if one is needed) -* whether packed bitwidth and associated regfile splitting is so complex - that it should definitely, definitely be made mandatory that implementors - move regfile splitting into the ALU, and what are the implications of that -* whether even if that *is* made mandatory, is software-trapped - "unsupported bitwidths" still desirable, on the basis that SIMD is such - a complete nightmare that *even* having a software implementation is - better, making Simple-V have more in common with a software API than - anything else. - -Whilst the above may seem to be severe minuses, there are some strong -pluses: - -* Significant reduction of V's opcode space: over 85%. -* Smaller reduction of P's opcode space: around 10%. -* The potential to use Compressed instructions in both Vector and SIMD - due to the overloading of register meaning (implicit vectorisation, - implicit packing) -* Not only present but also future extensions automatically gain parallelism. -* Already mentioned but worth emphasising: the simplification to compiler - writers and assembly-level writers of having the same consistent ISA - regardless of whether the internal level of parallelism (number of - parallel ALUs) is only equal to one ("virtual" parallelism), or is - greater than one, should not be underestimated. - -## Reducing Register Bank porting - -This looks quite reasonable. - - -The main details are outlined on page 4.  They propose a 2-level register -cache hierarchy, note that registers are typically only read once, that -you never write back from upper to lower cache level but always go in a -cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose -a scheme where you look ahead by only 2 instructions to determine which -registers to bring into the cache. - -The nice thing about a vector architecture is that you *know* that -*even more* registers are going to be pulled in: Hwacha uses this fact -to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough -by *introducing* deliberate latency into the execution phase. - -# Virtual Memory page-faults - -> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft -> riscv-isa-manual in order to work out how to re-map RVV onto the standard -> ISA, and came across an interesting comments at the bottom of pages 75 -> and 76: - -> " A common mechanism used in other ISAs to further reduce save/restore -> code size is load- multiple and store-multiple instructions. " - -> Fascinatingly, due to Simple-V proposing to use the *standard* register -> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly -> that: load-multiple and store-multiple instructions. Which brings us -> on to this comment: - -> "For virtual memory systems, some data accesses could be resident in -> physical memory and -> some could not, which requires a new restart mechanism for partially -> executed instructions." - -> Which then of course brings us to the interesting question: how does RVV -> cope with the scenario when, particularly with LD.X (Indexed / indirect -> loads), part-way through the loading a page fault occurs? - -> Has this been noted or discussed before? - -For applications-class platforms, the RVV exception model is -element-precise (that is, if an exception occurs on element j of a -vector instruction, elements 0..j-1 have completed execution and elements -j+1..vl-1 have not executed). - -Certain classes of embedded platforms where exceptions are always fatal -might choose to offer resumable/swappable interrupts but not precise -exceptions. - - -> Is RVV designed in any way to be re-entrant? - -Yes. - - -> What would the implications be for instructions that were in a FIFO at -> the time, in out-of-order and VLIW implementations, where partial decode -> had taken place? - -The usual bag of tricks for maintaining precise exceptions applies to -vector machines as well. Register renaming makes the job easier, and -it's relatively cheaper for vectors, since the control cost is amortized -over longer registers. - - -> Would it be reasonable at least to say *bypass* (and freeze) the -> instruction FIFO (drop down to a single-issue execution model temporarily) -> for the purposes of executing the instructions in the interrupt (whilst -> setting up the VM page), then re-continue the instruction with all -> state intact? - -This approach has been done successfully, but it's desirable to be -able to swap out the vector unit state to support context switches on -exceptions that result in long-latency I/O. - - -> Or would it be better to switch to an entirely separate secondary -> hyperthread context? - -> Does anyone have any ideas or know if there is any academic literature -> on solutions to this problem? - -The Vector VAX offered imprecise but restartable and swappable exceptions: -http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf - -Sec. 4.6 of Krste's dissertation assesses some of -the tradeoffs and references a bunch of related work: -http://people.eecs.berkeley.edu/~krste/thesis.pdf - - ----- - -Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P -exceptions" and thought, "hmmm that could go into a CSR, must re-read -the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly -thought, "ah ha! what if the memory exceptions were, instead of having -an immediate exception thrown, were simply stored in a type of predication -bit-field with a flag "error this element failed"? +# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal -Then, *after* the vector load (or store, or even operation) was -performed, you could *then* raise an exception, at which point it -would be possible (yes in software... I know....) to go "hmmm, these -indexed operations didn't work, let's get them into memory by triggering -page-loads", then *re-run the entire instruction* but this time with a -"memory-predication CSR" that stops the already-performed operations -(whether they be loads, stores or an arithmetic / FP operation) from -being carried out a second time. +**OBSOLETE. This document is out of date and involved early ideas and discussions. [Go to the up-to-date document](https://libre-soc.org/openpower/sv/)** -This theoretically could end up being done multiple times in an SMP -environment, and also for LD.X there would be the remote outside annoying -possibility that the indexed memory address could end up being modified. +This list is auto-generated from a page tag "oldstandards": -The advantage would be that the order of execution need not be -sequential, which potentially could have some big advantages. -Am still thinking through the implications as any dependent operations -(particularly ones already decoded and moved into the execution FIFO) -would still be there (and stalled). hmmm. +[[!inline pages="tagged(oldstandards)" actions="no" archive="yes" quick="yes"]] # References @@ -1449,3 +37,13 @@ would still be there (and stalled). hmmm. * Discussion on RVV "re-entrant" capabilities allowing operations to be restarted if an exception occurs (VM page-table miss) +* Dot Product Vector +* RVV slides 2017 +* Wavefront skipping using BRAMS +* Streaming Pipelines +* Barcelona SIMD Presentation +* +* Full Description (last page) of RVV instructions + +* PULP Low-energy Cluster Vector Processor +