From 6112eeed535a31c43d234630ffe355d7fddb9b27 Mon Sep 17 00:00:00 2001 From: lkcl Date: Thu, 8 Sep 2022 01:16:49 +0100 Subject: [PATCH] --- simple_v_extension.mdwn | 1712 --------------------------------------- 1 file changed, 1712 deletions(-) diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 9007b5a62..f37e1fc44 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -4,1718 +4,6 @@ **OBSOLETE. This document is out of date and involved early ideas and discussions. [Go to the up-to-date document](https://libre-soc.org/openpower/sv/)** -Key insight: Simple-V is intended as an abstraction layer to provide -a consistent "API" to parallelisation of existing *and future* operations. -*Actual* internal hardware-level parallelism is *not* required, such -that Simple-V may be viewed as providing a "compact" or "consolidated" -means of issuing multiple near-identical arithmetic instructions to an -instruction queue (FIFO), pending execution. - -*Actual* parallelism, if added independently of Simple-V in the form -of Out-of-order restructuring (including parallel ALU lanes) or VLIW -implementations, or SIMD, or anything else, would then benefit from -the uniformity of a consistent API. - -**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E. - -* Talk slides: -* Specification: now move to its own page: [[specification]] - -[[!toc ]] - -# Introduction - -This proposal exists so as to be able to satisfy several disparate -requirements: power-conscious, area-conscious, and performance-conscious -designs all pull an ISA and its implementation in different conflicting -directions, as do the specific intended uses for any given implementation. - -The existing P (SIMD) proposal and the V (Vector) proposals, -whilst each extremely powerful in their own right and clearly desirable, -are also: - -* Clearly independent in their origins (Cray and AndesStar v3 respectively) - so need work to adapt to the RISC-V ethos and paradigm -* Are sufficiently large so as to make adoption (and exploration for - analysis and review purposes) prohibitively expensive -* Both contain partial duplication of pre-existing RISC-V instructions - (an undesirable characteristic) -* Both have independent, incompatible and disparate methods for introducing - parallelism at the instruction level -* Both require that their respective parallelism paradigm be implemented - along-side and integral to their respective functionality *or not at all*. -* Both independently have methods for introducing parallelism that - could, if separated, benefit - *other areas of RISC-V not just DSP or Floating-point respectively*. - -There are also key differences between Vectorisation and SIMD (full -details outlined in the Appendix), the key points being: - -* SIMD has an extremely seductively compelling ease of implementation argument: - each operation is passed to the ALU, which is where the parallelism - lies. There is *negligeable* (if any) impact on the rest of the core - (with life instead being made hell for compiler writers and applications - writers due to extreme ISA proliferation). -* By contrast, Vectorisation has quite some complexity (for considerable - flexibility, reduction in opcode proliferation and much more). -* Vectorisation typically includes much more comprehensive memory load - and store schemes (unit stride, constant-stride and indexed), which - in turn have ramifications: virtual memory misses (TLB cache misses) - and even multiple page-faults... all caused by a *single instruction*, - yet with a clear benefit that the regularisation of LOAD/STOREs can - be optimised for minimal impact on caches and maximised throughput. -* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned - to pages), and these load/stores have absolutely nothing to do with the - SIMD / ALU engine, no matter how wide the operand. Simplicity but with - more impact on instruction and data caches. - -Overall it makes a huge amount of sense to have a means and method -of introducing instruction parallelism in a flexible way that provides -implementors with the option to choose exactly where they wish to offer -performance improvements and where they wish to optimise for power -and/or area (and if that can be offered even on a per-operation basis that -would provide even more flexibility). - -Additionally it makes sense to *split out* the parallelism inherent within -each of P and V, and to see if each of P and V then, in *combination* with -a "best-of-both" parallelism extension, could be added on *on top* of -this proposal, to topologically provide the exact same functionality of -each of P and V. Each of P and V then can focus on providing the best -operations possible for their respective target areas, without being -hugely concerned about the actual parallelism. - -Furthermore, an additional goal of this proposal is to reduce the number -of opcodes utilised by each of P and V as they currently stand, leveraging -existing RISC-V opcodes where possible, and also potentially allowing -P and V to make use of Compressed Instructions as a result. - -# Analysis and discussion of Vector vs SIMD - -There are six combined areas between the two proposals that help with -parallelism (increased performance, reduced power / area) without -over-burdening the ISA with a huge proliferation of -instructions: - -* Fixed vs variable parallelism (fixed or variable "M" in SIMD) -* Implicit vs fixed instruction bit-width (integral to instruction or not) -* Implicit vs explicit type-conversion (compounded on bit-width) -* Implicit vs explicit inner loops. -* Single-instruction LOAD/STORE. -* Masks / tagging (selecting/preventing certain indexed elements from execution) - -The pros and cons of each are discussed and analysed below. - -## Fixed vs variable parallelism length - -In David Patterson and Andrew Waterman's analysis of SIMD and Vector -ISAs, the analysis comes out clearly in favour of (effectively) variable -length SIMD. As SIMD is a fixed width, typically 4, 8 or in extreme cases -16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD -are extremely burdensome except for applications whose requirements -*specifically* match the *precise and exact* depth of the SIMD engine. - -Thus, SIMD, no matter what width is chosen, is never going to be acceptable -for general-purpose computation, and in the context of developing a -general-purpose ISA, is never going to satisfy 100 percent of implementors. - -To explain this further: for increased workloads over time, as the -performance requirements increase for new target markets, implementors -choose to extend the SIMD width (so as to again avoid mixing parallelism -into the instruction issue phases: the primary "simplicity" benefit of -SIMD in the first place), with the result that the entire opcode space -effectively doubles with each new SIMD width that's added to the ISA. - -That basically leaves "variable-length vector" as the clear *general-purpose* -winner, at least in terms of greatly simplifying the instruction set, -reducing the number of instructions required for any given task, and thus -reducing power consumption for the same. - -## Implicit vs fixed instruction bit-width - -SIMD again has a severe disadvantage here, over Vector: huge proliferation -of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and -have to then have operations *for each and between each*. It gets very -messy, very quickly: *six* separate dimensions giving an O(N^6) instruction -proliferation profile. - -The V-Extension on the other hand proposes to set the bit-width of -future instructions on a per-register basis, such that subsequent instructions -involving that register are *implicitly* of that particular bit-width until -otherwise changed or reset. - -This has some extremely useful properties, without being particularly -burdensome to implementations, given that instruction decode already has -to direct the operation to a correctly-sized width ALU engine, anyway. - -Not least: in places where an ISA was previously constrained (due for -whatever reason, including limitations of the available operand space), -implicit bit-width allows the meaning of certain operations to be -type-overloaded *without* pollution or alteration of frozen and immutable -instructions, in a fully backwards-compatible fashion. - -## Implicit and explicit type-conversion - -The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help -deal with over-population of instructions, such that type-casting from -integer (and floating point) of various sizes is automatically inferred -due to "type tagging" that is set with a special instruction. A register -will be *specifically* marked as "16-bit Floating-Point" and, if added -to an operand that is specifically tagged as "32-bit Integer" an implicit -type-conversion will take place *without* requiring that type-conversion -to be explicitly done with its own separate instruction. - -However, implicit type-conversion is not only quite burdensome to -implement (explosion of inferred type-to-type conversion) but also is -never really going to be complete. It gets even worse when bit-widths -also have to be taken into consideration. Each new type results in -an increased O(N^2) conversion space that, as anyone who has examined -python's source code (which has built-in polymorphic type-conversion), -knows that the task is more complex than it first seems. - -Overall, type-conversion is generally best to leave to explicit -type-conversion instructions, or in definite specific use-cases left to -be part of an actual instruction (DSP or FP) - -## Zero-overhead loops vs explicit loops - -The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology -contains an extremely interesting feature: zero-overhead loops. This -proposal would basically allow an inner loop of instructions to be -repeated indefinitely, a fixed number of times. - -Its specific advantage over explicit loops is that the pipeline in a DSP -can potentially be kept completely full *even in an in-order single-issue -implementation*. Normally, it requires a superscalar architecture and -out-of-order execution capabilities to "pre-process" instructions in -order to keep ALU pipelines 100% occupied. - -By bringing that capability in, this proposal could offer a way to increase -pipeline activity even in simpler implementations in the one key area -which really matters: the inner loop. - -However when looking at much more comprehensive schemes -"A portable specification of zero-overhead loop control hardware -applied to embedded processors" (ZOLC), optimising only the single -inner loop seems inadequate, tending to suggest that ZOLC may be -better off being proposed as an entirely separate Extension. - -## Single-instruction LOAD/STORE - -In traditional Vector Architectures there are instructions which -result in multiple register-memory transfer operations resulting -from a single instruction. They're complicated to implement in hardware, -yet the benefits are a huge consistent regularisation of memory accesses -that can be highly optimised with respect to both actual memory and any -L1, L2 or other caches. In Hwacha EECS-2015-263 it is explicitly made -clear the consequences of getting this architecturally wrong: -L2 cache-thrashing at the very least. - -Complications arise when Virtual Memory is involved: TLB cache misses -need to be dealt with, as do page faults. Some of the tradeoffs are -discussed in , Section -4.6, and an article by Jeff Bush when faced with some of these issues -is particularly enlightening - - -Interestingly, none of this complexity is faced in SIMD architectures... -but then they do not get the opportunity to optimise for highly-streamlined -memory accesses either. - -With the "bang-per-buck" ratio being so high and the indirect improvement -in L1 Instruction Cache usage (reduced instruction count), as well as -the opportunity to optimise L1 and L2 cache usage, the case for including -Vector LOAD/STORE is compelling. - -## Mask and Tagging (Predication) - -Tagging (aka Masks aka Predication) is a pseudo-method of implementing -simplistic branching in a parallel fashion, by allowing execution on -elements of a vector to be switched on or off depending on the results -of prior operations in the same array position. - -The reason for considering this is simple: by *definition* it -is not possible to perform individual parallel branches in a SIMD -(Single-Instruction, **Multiple**-Data) context. Branches (modifying -of the Program Counter) will result in *all* parallel data having -a different instruction executed on it: that's just the definition of -SIMD, and it is simply unavoidable. - -So these are the ways in which conditional execution may be implemented: - -* explicit compare and branch: BNE x, y -> offs would jump offs - instructions if x was not equal to y -* explicit store of tag condition: CMP x, y -> tagbit -* implicit (condition-code) such as ADD results in a carry, carry bit - implicitly (or sometimes explicitly) goes into a "tag" (mask) register - -The first of these is a "normal" branch method, which is flat-out impossible -to parallelise without look-ahead and effectively rewriting instructions. -This would defeat the purpose of RISC. - -The latter two are where parallelism becomes easy to do without complexity: -every operation is modified to be "conditionally executed" (in an explicit -way directly in the instruction format *or* implicitly). - -RVV (Vector-Extension) proposes to have *explicit* storing of the compare -in a tag/mask register, and to *explicitly* have every vector operation -*require* that its operation be "predicated" on the bits within an -explicitly-named tag/mask register. - -SIMD (P-Extension) has not yet published precise documentation on what its -schema is to be: there is however verbal indication at the time of writing -that: - -> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will -> be executed using the same compare ALU logic for the base ISA with some -> minor modifications to handle smaller data types. The function will not -> be duplicated. - -This is an *implicit* form of predication as the base RV ISA does not have -condition-codes or predication. By adding a CSR it becomes possible -to also tag certain registers as "predicated if referenced as a destination". -Example: - - // in future operations from now on, if r0 is the destination use r5 as - // the PREDICATION register - SET_IMPLICIT_CSRPREDICATE r0, r5 - // store the compares in r5 as the PREDICATION register - CMPEQ8 r5, r1, r2 - // r0 is used here. ah ha! that means it's predicated using r5! - ADD8 r0, r1, r3 - -With enough registers (and in RISC-V there are enough registers) some fairly -complex predication can be set up and yet still execute without significant -stalling, even in a simple non-superscalar architecture. - -(For details on how Branch Instructions would be retro-fitted to indirectly -predicated equivalents, see Appendix) - -## Conclusions - -In the above sections the five different ways where parallel instruction -execution has closely and loosely inter-related implications for the ISA and -for implementors, were outlined. The pluses and minuses came out as -follows: - -* Fixed vs variable parallelism: variable -* Implicit (indirect) vs fixed (integral) instruction bit-width: indirect -* Implicit vs explicit type-conversion: explicit -* Implicit vs explicit inner loops: implicit but best done separately -* Single-instruction Vector LOAD/STORE: Complex but highly beneficial -* Tag or no-tag: Complex but highly beneficial - -In particular: - -* variable-length vectors came out on top because of the high setup, teardown - and corner-cases associated with the fixed width of SIMD. -* Implicit bit-width helps to extend the ISA to escape from - former limitations and restrictions (in a backwards-compatible fashion), - whilst also leaving implementors free to simmplify implementations - by using actual explicit internal parallelism. -* Implicit (zero-overhead) loops provide a means to keep pipelines - potentially 100% occupied in a single-issue in-order implementation - i.e. *without* requiring a super-scalar or out-of-order architecture, - but doing a proper, full job (ZOLC) is an entirely different matter. - -Constructing a SIMD/Simple-Vector proposal based around four of these six -requirements would therefore seem to be a logical thing to do. - -# Note on implementation of parallelism - -One extremely important aspect of this proposal is to respect and support -implementors desire to focus on power, area or performance. In that regard, -it is proposed that implementors be free to choose whether to implement -the Vector (or variable-width SIMD) parallelism as sequential operations -with a single ALU, fully parallel (if practical) with multiple ALUs, or -a hybrid combination of both. - -In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual -Parallelism". They achieve a 16-way SIMD at an **instruction** level -by providing a combination of a 4-way parallel ALU *and* an externally -transparent loop that feeds 4 sequential sets of data into each of the -4 ALUs. - -Also in the same core, it is worth noting that particularly uncommon -but essential operations (Reciprocal-Square-Root for example) are -*not* part of the 4-way parallel ALU but instead have a *single* ALU. -Under the proposed Vector (varible-width SIMD) implementors would -be free to do precisely that: i.e. free to choose *on a per operation -basis* whether and how much "Virtual Parallelism" to deploy. - -It is absolutely critical to note that it is proposed that such choices MUST -be **entirely transparent** to the end-user and the compiler. Whilst -a Vector (varible-width SIMD) may not precisely match the width of the -parallelism within the implementation, the end-user **should not care** -and in this way the performance benefits are gained but the ISA remains -straightforward. All that happens at the end of an instruction run is: some -parallel units (if there are any) would remain offline, completely -transparently to the ISA, the program, and the compiler. - -To make that clear: should an implementor choose a particularly wide -SIMD-style ALU, each parallel unit *must* have predication so that -the parallel SIMD ALU may emulate variable-length parallel operations. -Thus the "SIMD considered harmful" trap of having huge complexity and extra -instructions to deal with corner-cases is thus avoided, and implementors -get to choose precisely where to focus and target the benefits of their -implementation efforts, without "extra baggage". - -In addition, implementors will be free to choose whether to provide an -absolute bare minimum level of compliance with the "API" (software-traps -when vectorisation is detected), all the way up to full supercomputing -level all-hardware parallelism. Options are covered in the Appendix. - - -### FMV, FNEG and FABS Instructions - -These are identical in form to C.MV, except covering floating-point -register copying. The same double-predication rules also apply. -However when elwidth is not set to default the instruction is implicitly -and automatic converted to a (vectorised) floating-point type conversion -operation of the appropriate size covering the source and destination -register bitwidths. - -(Note that FMV, FNEG and FABS are all actually pseudo-instructions) - -### FVCT Instructions - -These are again identical in form to C.MV, except that they cover -floating-point to integer and integer to floating-point. When element -width in each vector is set to default, the instructions behave exactly -as they are defined for standard RV (scalar) operations, except vectorised -in exactly the same fashion as outlined in C.MV. - -However when the source or destination element width is not set to default, -the opcode's explicit element widths are *over-ridden* to new definitions, -and the opcode's element width is taken as indicative of the SIMD width -(if applicable i.e. if packed SIMD is requested) instead. - -For example FCVT.S.L would normally be used to convert a 64-bit -integer in register rs1 to a 64-bit floating-point number in rd. -If however the source rs1 is set to be a vector, where elwidth is set to -default/2 and "packed SIMD" is enabled, then the first 32 bits of -rs1 are converted to a floating-point number to be stored in rd's -first element and the higher 32-bits *also* converted to floating-point -and stored in the second. The 32 bit size comes from the fact that -FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to -divide that by two it means that rs1 element width is to be taken as 32. - -Similar rules apply to the destination register. - -# Exceptions - -> What does an ADD of two different-sized vectors do in simple-V? - -* if the two source operands are not the same, throw an exception. -* if the destination operand is also a vector, and the source is longer - than the destination, throw an exception. - -> And what about instructions like JALR?  -> What does jumping to a vector do? - -* Throw an exception. Whether that actually results in spawning threads - as part of the trap-handling remains to be seen. - -# Under consideration - -From the Chennai 2018 slides the following issues were raised. -Efforts to analyse and answer these questions are below. - -* Should future extra bank be included now? -* How many Register and Predication CSRs should there be? - (and how many in RV32E) -* How many in M-Mode (for doing context-switch)? -* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? -* Can CLIP be done as a CSR (mode, like elwidth) -* SIMD saturation (etc.) also set as a mode? -* Include src1/src2 predication on Comparison Ops? - (same arrangement as C.MV, with same flexibility/power) -* 8/16-bit ops is it worthwhile adding a "start offset"? - (a bit like misaligned addressing... for registers) - or just use predication to skip start? - -## Future (extra) bank be included (made mandatory) - -The implications of expanding the *standard* register file from -32 entries per bank to 64 per bank is quite an extensive architectural -change. Also it has implications for context-switching. - -Therefore, on balance, it is not recommended and certainly should -not be made a *mandatory* requirement for the use of SV. SV's design -ethos is to be minimally-disruptive for implementors to shoe-horn -into an existing design. - -## How large should the Register and Predication CSR key-value stores be? - -This is something that definitely needs actual evaluation and for -code to be run and the results analysed. At the time of writing -(12jul2018) that is too early to tell. An approximate best-guess -however would be 16 entries. - -RV32E however is a special case, given that it is highly unlikely -(but not outside the realm of possibility) that it would be used -for performance reasons but instead for reducing instruction count. -The number of CSR entries therefore has to be considered extremely -carefully. - -## How many CSR entries in M-Mode or S-Mode (for context-switching)? - -The minimum required CSR entries would be 1 for each register-bank: -one for integer and one for floating-point. However, as shown -in the "Context Switch Example" section, for optimal efficiency -(minimal instructions in a low-latency situation) the CSRs for -the context-switch should be set up *and left alone*. - -This means that it is not really a good idea to touch the CSRs -used for context-switching in the M-Mode (or S-Mode) trap, so -if there is ever demonstrated a need for vectors then there would -need to be *at least* one more free. However just one does not make -much sense (as it one only covers scalar-vector ops) so it is more -likely that at least two extra would be needed. - -This *in addition* - in the RV32E case - if an RV32E implementation -happens also to support U/S/M modes. This would be considered quite -rare but not outside of the realm of possibility. - -Conclusion: all needs careful analysis and future work. - -## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? - -On balance it's a neat idea however it does seem to be one where the -benefits are not really clear. It would however obviate the need for -an exception to be raised if the VL runs out of registers to put -things in (gets to x31, tries a non-existent x32 and fails), however -the "fly in the ointment" is that x0 is hard-coded to "zero". The -increment therefore would need to be double-stepped to skip over x0. -Some microarchitectures could run into difficulties (SIMD-like ones -in particular) so it needs a lot more thought. - -## Can CLIP be done as a CSR (mode, like elwidth) - -RVV appears to be going this way. At the time of writing (12jun2018) -it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do -clip by way of exactly this method: setting a "clip mode" in a CSR. - -No details are given however the most sensible thing to have would be -to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have -extra bits specifying the type of clipping to be carried out, on -a per-register basis. Other bits may be used for other purposes -(see SIMD saturation below) - -## SIMD saturation (etc.) also set as a mode? - -Similar to "CLIP" as an extension to the CSR key-value store, "saturate" -may also need extra details (what the saturation maximum is for example). - -## Include src1/src2 predication on Comparison Ops? - -In the C.MV (and other ops - see "C.MV Instruction"), the decision -was taken, unlike in ADD (etc.) which are 3-operand ops, to use -*both* the src *and* dest predication masks to give an extremely -powerful and flexible instruction that covers a huge number of -"traditional" vector opcodes. - -The natural question therefore to ask is: where else could this -flexibility be deployed? What about comparison operations? - -Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst -predicated comparison operations are actually a *three* operand -instruction: - - regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0) - -Therefore at first glance it does not make sense to use src1 and src2 -predication masks, as it breaks the rule of 3-operand instructions -to use the *destination* predication register. - -In this case however, the destination *is* a predication register -as opposed to being a predication mask that is applied *to* the -(vectorised) operation, element-at-a-time on src1 and src2. - -Thus the question is directly inter-related to whether the modification -of the predication mask should *itself* be predicated. - -It is quite complex, in other words, and needs careful consideration. - -## 8/16-bit ops is it worthwhile adding a "start offset"? - -The idea here is to make it possible, particularly in a "Packed SIMD" -case, to be able to avoid doing unaligned Load/Store operations -by specifying that operations, instead of being carried out -element-for-element, are offset by a fixed amount *even* in 8 and 16-bit -element Packed SIMD cases. - -For example rather than take 2 32-bit registers divided into 4 8-bit -elements and have them ADDed element-for-element as follows: - - r3[0] = add r4[0], r6[0] - r3[1] = add r4[1], r6[1] - r3[2] = add r4[2], r6[2] - r3[3] = add r4[3], r6[3] - -an offset of 1 would result in four operations as follows, instead: - - r3[0] = add r4[1], r6[0] - r3[1] = add r4[2], r6[1] - r3[2] = add r4[3], r6[2] - r3[3] = add r5[0], r6[3] - -In non-packed-SIMD mode there is no benefit at all, as a vector may -be created using a different CSR that has the offset built-in. So this -leaves just the packed-SIMD case to consider. - -Two ways in which this could be implemented / emulated (without special -hardware): - -* bit-manipulation that shuffles the data along by one byte (or one word) - either prior to or as part of the operation requiring the offset. -* just use an unaligned Load/Store sequence, even if there are performance - penalties for doing so. - -The question then is whether the performance hit is worth the extra hardware -involving byte-shuffling/shifting the data by an arbitrary offset. On -balance given that there are two reasonable instruction-based options, the -hardware-offset option should be left out for the initial version of SV, -with the option to consider it in an "advanced" version of the specification. - -# Impementing V on top of Simple-V - -With Simple-V converting the original RVV draft concept-for-concept -from explicit opcodes to implicit overloading of existing RV Standard -Extensions, certain features were (deliberately) excluded that need -to be added back in for RVV to reach its full potential. This is -made slightly complicated by the fact that RVV itself has two -levels: Base and reserved future functionality. - -* Representation Encoding is entirely left out of Simple-V in favour of - implicitly taking the exact (explicit) meaning from RV Standard Extensions. -* VCLIP and VCLIPI do not have corresponding RV Standard Extension - opcodes (and are the only such operations). -* Extended Element bitwidths (1 through to 24576 bits) were left out - of Simple-V as, again, there is no corresponding RV Standard Extension - that covers anything even below 32-bit operands. -* Polymorphism was entirely left out of Simple-V due to the inherent - complexity of automatic type-conversion. -* Vector Register files were specifically left out of Simple-V in favour - of fitting on top of the integer and floating-point files. An - "RVV re-retro-fit" needs to be able to mark (implicitly marked) - registers as being actually in a separate *vector* register file. -* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector - register file size is 5 bits (32 registers), whilst the "Extended" - variant of RVV specifies 8 bits (256 registers) and has yet to - be published. -* One big difference: Sections 17.12 and 17.17, there are only two possible - predication registers in RVV "Base". Through the "indirect" method, - Simple-V provides a key-value CSR table that allows (arbitrarily) - up to 16 (TBD) of either the floating-point or integer registers to - be marked as "predicated" (key), and if so, which integer register to - use as the predication mask (value). - -**TODO** - -# Implementing P (renamed to DSP) on top of Simple-V - -* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR - (caveat: anything not specified drops through to software-emulation / traps) -* TODO - -# Appendix - -## V-Extension to Simple-V Comparative Analysis - -This section has been moved to its own page [[v_comparative_analysis]] - -## P-Ext ISA - -This section has been moved to its own page [[p_comparative_analysis]] - -## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals - -This section compares the various parallelism proposals as they stand, -including traditional SIMD, in terms of features, ease of implementation, -complexity, flexibility, and die area. - -### [[harmonised_rvv_rvp]] - -This is an interesting proposal under development to retro-fit the AndesStar -P-Ext into V-Ext. - -### [[alt_rvp]] - -Primary benefit of Alt-RVP is the simplicity with which parallelism -may be introduced (effective multiplication of regfiles and associated ALUs). - -* plus: the simplicity of the lanes (combined with the regularity of - allocating identical opcodes multiple independent registers) meaning - that SRAM or 2R1W can be used for entire regfile (potentially). -* minus: a more complex instruction set where the parallelism is much - more explicitly directly specified in the instruction and -* minus: if you *don't* have an explicit instruction (opcode) and you - need one, the only place it can be added is... in the vector unit and -* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are - not useable or accessible in other Extensions. -* plus-and-minus: Lanes may be utilised for high-speed context-switching - but with the down-side that they're an all-or-nothing part of the Extension. - No Alt-RVP: no fast register-bank switching. -* plus: Lane-switching would mean that complex operations not suited to - parallelisation can be carried out, followed by further parallel Lane-based - work, without moving register contents down to memory (and back) -* minus: Access to registers across multiple lanes is challenging. "Solution" - is to drop data into memory and immediately back in again (like MMX). - -### Simple-V - -Primary benefit of Simple-V is the OO abstraction of parallel principles -from actual (internal) parallel hardware. It's an API in effect that's -designed to be slotted in to an existing implementation (just after -instruction decode) with minimum disruption and effort. - -* minus: the complexity (if full parallelism is to be exploited) - of having to use register renames, OoO, VLIW, register file cacheing, - all of which has been done before but is a pain -* plus: transparent re-use of existing opcodes as-is just indirectly - saying "this register's now a vector" which -* plus: means that future instructions also get to be inherently - parallelised because there's no "separate vector opcodes" -* plus: Compressed instructions may also be (indirectly) parallelised -* minus: the indirect nature of Simple-V means that setup (setting - a CSR register to indicate vector length, a separate one to indicate - that it is a predicate register and so on) means a little more setup - time than Alt-RVP or RVV's "direct and within the (longer) instruction" - approach. -* plus: shared register file meaning that, like Alt-RVP, complex - operations not suited to parallelisation may be carried out interleaved - between parallelised instructions *without* requiring data to be dropped - down to memory and back (into a separate vectorised register engine). -* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register - files means that huge parallel workloads would use up considerable - chunks of the register file. However in the case of RV64 and 32-bit - operations, that effectively means 64 slots are available for parallel - operations. -* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to - be added, yet the instruction opcodes remain unchanged (and still appear - to be parallel). consistent "API" regardless of actual internal parallelism: - even an in-order single-issue implementation with a single ALU would still - appear to have parallel vectoristion. -* hard-to-judge: if actual inherent underlying ALU parallelism is added it's - hard to say if there would be pluses or minuses (on die area). At worse it - would be "no worse" than existing register renaming, OoO, VLIW and register - file cacheing schemes. - -### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) - -RVV is extremely well-designed and has some amazing features, including -2D reorganisation of memory through LOAD/STORE "strides". - -* plus: regular predictable workload means that implementations may - streamline effects on L1/L2 Cache. -* plus: regular and clear parallel workload also means that lanes - (similar to Alt-RVP) may be used as an implementation detail, - using either SRAM or 2R1W registers. -* plus: separate engine with no impact on the rest of an implementation -* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse - really feasible. -* minus: no ISA abstraction or re-use either: additions to other Extensions - do not gain parallelism, resulting in prolific duplication of functionality - inside RVV *and out*. -* minus: when operations require a different approach (scalar operations - using the standard integer or FP regfile) an entire vector must be - transferred out to memory, into standard regfiles, then back to memory, - then back to the vector unit, this to occur potentially multiple times. -* minus: will never fit into Compressed instruction space (as-is. May - be able to do so if "indirect" features of Simple-V are partially adopted). -* plus-and-slight-minus: extended variants may address up to 256 - vectorised registers (requires 48/64-bit opcodes to do it). -* minus-and-partial-plus: separate engine plus complexity increases - implementation time and die area, meaning that adoption is likely only - to be in high-performance specialist supercomputing (where it will - be absolutely superb). - -### Traditional SIMD - -The only really good things about SIMD are how easy it is to implement and -get good performance. Unfortunately that makes it quite seductive... - -* plus: really straightforward, ALU basically does several packed operations - at once. Parallelism is inherent at the ALU, making the addition of - SIMD-style parallelism an easy decision that has zero significant impact - on the rest of any given architectural design and layout. -* plus (continuation): SIMD in simple in-order single-issue designs can - therefore result in superb throughput, easily achieved even with a very - simple execution model. -* minus: ridiculously complex setup and corner-cases that disproportionately - increase instruction count on what would otherwise be a "simple loop", - should the number of elements in an array not happen to exactly match - the SIMD group width. -* minus: getting data usefully out of registers (if separate regfiles - are used) means outputting to memory and back. -* minus: quite a lot of supplementary instructions for bit-level manipulation - are needed in order to efficiently extract (or prepare) SIMD operands. -* minus: MASSIVE proliferation of ISA both in terms of opcodes in one - dimension and parallelism (width): an at least O(N^2) and quite probably - O(N^3) ISA proliferation that often results in several thousand - separate instructions. all requiring separate and distinct corner-case - algorithms! -* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of - 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction. - For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires - four separate and distinct instructions: one for (r1:low r2:high), - one for (r1:high r2:low), one for (r1:high r2:high) and one for - (r1:low r2:low) *per function*. -* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch - between operand and result bit-widths. In combination with high/low - proliferation the situation is made even worse. -* minor-saving-grace: some implementations *may* have predication masks - that allow control over individual elements within the SIMD block. - -## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals - -This section compares the various parallelism proposals as they stand, -*against* traditional SIMD as opposed to *alongside* SIMD. In other words, -the question is asked "How can each of the proposals effectively implement -(or replace) SIMD, and how effective would they be"? - -### [[alt_rvp]] - -* Alt-RVP would not actually replace SIMD but would augment it: just as with - a SIMD architecture where the ALU becomes responsible for the parallelism, - Alt-RVP ALUs would likewise be so responsible... with *additional* - (lane-based) parallelism on top. -* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by - at least one dimension are avoided (architectural upgrades introducing - 128-bit then 256-bit then 512-bit variants of the exact same 64-bit - SIMD block) -* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation - of instructions as SIMD, albeit not quite as badly (due to Lanes). -* In the same discussion for Alt-RVP, an additional proposal was made to - be able to subdivide the bits of each register lane (columns) down into - arbitrary bit-lengths (RGB 565 for example). -* A recommendation was given instead to make the subdivisions down to 32-bit, - 16-bit or even 8-bit, effectively dividing the registerfile into - Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane - "swapping" instructions were then introduced, some of the disadvantages - of SIMD could be mitigated. - -### RVV - -* RVV is designed to replace SIMD with a better paradigm: arbitrary-length - parallelism. -* However whilst SIMD is usually designed for single-issue in-order simple - DSPs with a focus on Multimedia (Audio, Video and Image processing), - RVV's primary focus appears to be on Supercomputing: optimisation of - mathematical operations that fit into the OpenCL space. -* Adding functions (operations) that would normally fit (in parallel) - into a SIMD instruction requires an equivalent to be added to the - RVV Extension, if one does not exist. Given the specialist nature of - some SIMD instructions (8-bit or 16-bit saturated or halving add), - this possibility seems extremely unlikely to occur, even if the - implementation overhead of RVV were acceptable (compared to - normal SIMD/DSP-style single-issue in-order simplicity). - -### Simple-V - -* Simple-V borrows hugely from RVV as it is intended to be easy to - topologically transplant every single instruction from RVV (as - designed) into Simple-V equivalents, with *zero loss of functionality - or capability*. -* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP" - Extension which contained the basic primitives (non-parallelised - 8, 16 or 32-bit SIMD operations) inherently *become* parallel, - automatically. -* Additionally, standard operations (ADD, MUL) that would normally have - to have special SIMD-parallel opcodes added need no longer have *any* - of the length-dependent variants (2of 32-bit ADDs in a 64-bit register, - 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the - *standard* RV opcodes (present and future) and automatically parallelises - them. -* By inheriting the RVV feature of arbitrary vector-length, then just as - with RVV the corner-cases and ISA proliferation of SIMD is avoided. -* Whilst not entirely finalised, registers are expected to be - capable of being subdivided down to an implementor-chosen bitwidth - in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8] - and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can - choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit - ALUs that perform twin 8-bit operations as they see fit, or anything - else including no subdivisions at all. -* Even though implementors have that choice even to have full 64-bit - (with RV64) SIMD, they *must* provide predication that transparently - switches off appropriate units on the last loop, thus neatly fitting - underlying SIMD ALU implementations *into* the arbitrary vector-length - RVV paradigm, keeping the uniform consistent API that is a key strategic - feature of Simple-V. -* With Simple-V fitting into the standard register files, certain classes - of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0]) - can be done by applying *Parallelised* Bit-manipulation operations - followed by parallelised *straight* versions of element-to-element - arithmetic operations, even if the bit-manipulation operations require - changing the bitwidth of the "vectors" to do so. Predication can - be utilised to skip high words (or low words) in source or destination. -* In essence, the key downside of SIMD - massive duplication of - identical functions over time as an architecture evolves from 32-bit - wide SIMD all the way up to 512-bit, is avoided with Simple-V, through - vector-style parallelism being dropped on top of 8-bit or 16-bit - operations, all the while keeping a consistent ISA-level "API" irrespective - of implementor design choices (or indeed actual implementations). - -### Example Instruction translation: - -Instructions "ADD r7 r4 r4" would result in three instructions being -generated and placed into the FIFO. r7 and r4 are marked as "vectorised": - -* ADD r7 r4 r4 -* ADD r8 r5 r5 -* ADD r9 r6 r6 - -Instructions "ADD r7 r4 r1" would result in three instructions being -generated and placed into the FIFO. r7 and r1 are marked as "vectorised" -whilst r4 is not: - -* ADD r7 r4 r1 -* ADD r8 r4 r2 -* ADD r9 r4 r3 - -## Example of vector / vector, vector / scalar, scalar / scalar => vector add - - function op_add(rd, rs1, rs2) # add not VADD! -  int i, id=0, irs1=0, irs2=0; -  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; -  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; -  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; -  predval = get_pred_val(FALSE, rd); -  for (i = 0; i < VL; i++) - if (predval & 1< - -One of the goals of this parallelism proposal is to avoid instruction -duplication. However, with the base ISA having been designed explictly -to *avoid* condition-codes entirely, shoe-horning predication into it -bcomes quite challenging. - -However what if all branch instructions, if referencing a vectorised -register, were instead given *completely new analogous meanings* that -resulted in a parallel bit-wise predication register being set? This -would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, -BLT and BGE. - -We might imagine that FEQ, FLT and FLT would also need to be converted, -however these are effectively *already* in the precise form needed and -do not need to be converted *at all*! The difference is that FEQ, FLT -and FLE *specifically* write a 1 to an integer register if the condition -holds, and 0 if not. All that needs to be done here is to say, "if -the integer register is tagged with a bit that says it is a predication -register, the **bit** in the integer register is set based on the -current vector index" instead. - -There is, in the standard Conditional Branch instruction, more than -adequate space to interpret it in a similar fashion: - -[[!table data=""" -31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 | -imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | -"""]] - -This would become: - -[[!table data=""" -31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | -imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | -1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | -reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | -"""]] - -Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, -with the interesting side-effect that there is space within what is presently -the "immediate offset" field to reinterpret that to add in not only a bit -field to distinguish between floating-point compare and integer compare, -not only to add in a second source register, but also use some of the bits as -a predication target as well. - -[[!table data=""" -15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 | -funct3 | imm | rs10 | imm | op | -3 | 3 | 3 | 5 | 2 | -C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | -"""]] - -Now uses the CS format: - -[[!table data=""" -15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 | -funct3 | imm | rs10 | imm | | op | -3 | 3 | 3 | 2 | 3 | 2 | -C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 | -"""]] - -Bit 6 would be decoded as "operation refers to Integer or Float" including -interpreting src1 and src2 accordingly as outlined in Table 12.2 of the -"C" Standard, version 2.0, -whilst Bit 5 would allow the operation to be extended, in combination with -funct3 = 110 or 111: a combination of four distinct (predicated) comparison -operators. In both floating-point and integer cases those could be -EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). - -## Register reordering - -### Register File - -| Reg Num | Bits | -| ------- | ---- | -| r0 | (32..0) | -| r1 | (32..0) | -| r2 | (32..0) | -| r3 | (32..0) | -| r4 | (32..0) | -| r5 | (32..0) | -| r6 | (32..0) | -| r7 | (32..0) | -| .. | (32..0) | -| r31| (32..0) | - -### Vectorised CSR - -May not be an actual CSR: may be generated from Vector Length CSR: -single-bit is less burdensome on instruction decode phase. - -| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | -| - | - | - | - | - | - | - | - | -| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | - -### Vector Length CSR - -| Reg Num | (3..0) | -| ------- | ---- | -| r0 | 2 | -| r1 | 0 | -| r2 | 1 | -| r3 | 1 | -| r4 | 3 | -| r5 | 0 | -| r6 | 0 | -| r7 | 1 | - -### Virtual Register Reordering - -This example assumes the above Vector Length CSR table - -| Reg Num | Bits (0) | Bits (1) | Bits (2) | -| ------- | -------- | -------- | -------- | -| r0 | (32..0) | (32..0) | -| r2 | (32..0) | -| r3 | (32..0) | -| r4 | (32..0) | (32..0) | (32..0) | -| r7 | (32..0) | - -### Bitwidth Virtual Register Reordering - -This example goes a little further and illustrates the effect that a -bitwidth CSR has been set on a register. Preconditions: - -* RV32 assumed -* CSRintbitwidth[2] = 010 # integer r2 is 16-bit -* CSRintvlength[2] = 3 # integer r2 is a vector of length 3 -* vsetl rs1, 5 # set the vector length to 5 - -This is interpreted as follows: - -* Given that the context is RV32, ELEN=32. -* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2 -* Therefore the actual vector length is up to *six* elements -* However vsetl sets a length 5 therefore the last "element" is skipped - -So when using an operation that uses r2 as a source (or destination) -the operation is carried out as follows: - -* 16-bit operation on r2(15..0) - vector element index 0 -* 16-bit operation on r2(31..16) - vector element index 1 -* 16-bit operation on r3(15..0) - vector element index 2 -* 16-bit operation on r3(31..16) - vector element index 3 -* 16-bit operation on r4(15..0) - vector element index 4 -* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5 - -Predication has been left out of the above example for simplicity, however -predication is ANDed with the latter stages (vsetl not equal to maximum -capacity). - -Note also that it is entirely an implementor's choice as to whether to have -actual separate ALUs down to the minimum bitwidth, or whether to have something -more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD -operations carried out 32-bits at a time is perfectly acceptable, as is -8-bit SIMD operations carried out 16-bits at a time requiring two ALUs). -Regardless of the internal parallelism choice, *predication must -still be respected*, making Simple-V in effect the "consistent public API". - -vew may be one of the following (giving a table "bytestable", used below): - -| vew | bitwidth | bytestable | -| --- | -------- | ---------- | -| 000 | default | XLEN/8 | -| 001 | 8 | 1 | -| 010 | 16 | 2 | -| 011 | 32 | 4 | -| 100 | 64 | 8 | -| 101 | 128 | 16 | -| 110 | rsvd | rsvd | -| 111 | rsvd | rsvd | - -Pseudocode for vector length taking CSR SIMD-bitwidth into account: - - vew = CSRbitwidth[rs1] - if (vew == 0) - bytesperreg = (XLEN/8) # or FLEN as appropriate - else: - bytesperreg = bytestable[vew] # 1 2 4 8 16 - simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate - vlen = CSRvectorlen[rs1] * simdmult - -To index an element in a register rnum where the vector element index is i: - - function regoffs(rnum, i): - regidx = floor(i / simdmult) # integer-div rounded down - byteidx = i % simdmult # integer-remainder - return rnum + regidx, # actual real register - byteidx * 8, # low - byteidx * 8 + (vew-1), # high - -### Insights - -SIMD register file splitting still to consider. For RV64, benefits of doubling -(quadrupling in the case of Half-Precision IEEE754 FP) the apparent -size of the floating point register file to 64 (128 in the case of HP) -seem pretty clear and worth the complexity. - -64 virtual 32-bit F.P. registers and given that 32-bit FP operations are -done on 64-bit registers it's not so conceptually difficult.  May even -be achieved by *actually* splitting the regfile into 64 virtual 32-bit -registers such that a 64-bit FP scalar operation is dropped into (r0.H -r0.L) tuples.  Implementation therefore hidden through register renaming. - -Implementations intending to introduce VLIW, OoO and parallelism -(even without Simple-V) would then find that the instructions are -generated quicker (or in a more compact fashion that is less heavy -on caches). Interestingly we observe then that Simple-V is about -"consolidation of instruction generation", where actual parallelism -of underlying hardware is an implementor-choice that could just as -equally be applied *without* Simple-V even being implemented. - -## Analysis of CSR decoding on latency - -It could indeed have been logically deduced (or expected), that there -would be additional decode latency in this proposal, because if -overloading the opcodes to have different meanings, there is guaranteed -to be some state, some-where, directly related to registers. - -There are several cases: - -* All operands vector-length=1 (scalars), all operands - packed-bitwidth="default": instructions are passed through direct as if - Simple-V did not exist.  Simple-V is, in effect, completely disabled. -* At least one operand vector-length > 1, all operands - packed-bitwidth="default": any parallel vector ALUs placed on "alert", - virtual parallelism looping may be activated. -* All operands vector-length=1 (scalars), at least one - operand packed-bitwidth != default: degenerate case of SIMD, - implementation-specific complexity here (packed decode before ALUs or - *IN* ALUs) -* At least one operand vector-length > 1, at least one operand - packed-bitwidth != default: parallel vector ALUs (if any) - placed on "alert", virtual parallelsim looping may be activated, - implementation-specific SIMD complexity kicks in (packed decode before - ALUs or *IN* ALUs). - -Bear in mind that the proposal includes that the decision whether -to parallelise in hardware or whether to virtual-parallelise (to -dramatically simplify compilers and also not to run into the SIMD -instruction proliferation nightmare) *or* a transprent combination -of both, be done on a *per-operand basis*, so that implementors can -specifically choose to create an application-optimised implementation -that they believe (or know) will sell extremely well, without having -"Extra Standards-Mandated Baggage" that would otherwise blow their area -or power budget completely out the window. - -Additionally, two possible CSR schemes have been proposed, in order to -greatly reduce CSR space: - -* per-register CSRs (vector-length and packed-bitwidth) -* a smaller number of CSRs with the same information but with an *INDEX* - specifying WHICH register in one of three regfiles (vector, fp, int) - the length and bitwidth applies to. - -(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details) - -In addition, LOAD/STORE has its own associated proposed CSRs that -mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of -V (and Hwacha). - -Also bear in mind that, for reasons of simplicity for implementors, -I was coming round to the idea of permitting implementors to choose -exactly which bitwidths they would like to support in hardware and which -to allow to fall through to software-trap emulation. - -So the question boils down to: - -* whether either (or both) of those two CSR schemes have significant - latency that could even potentially require an extra pipeline decode stage -* whether there are implementations that can be thought of which do *not* - introduce significant latency -* whether it is possible to explicitly (through quite simply - disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1, - all-simd-bitwidths=default) switch OFF any decoding, perhaps even to - the extreme of skipping an entire pipeline stage (if one is needed) -* whether packed bitwidth and associated regfile splitting is so complex - that it should definitely, definitely be made mandatory that implementors - move regfile splitting into the ALU, and what are the implications of that -* whether even if that *is* made mandatory, is software-trapped - "unsupported bitwidths" still desirable, on the basis that SIMD is such - a complete nightmare that *even* having a software implementation is - better, making Simple-V have more in common with a software API than - anything else. - -Whilst the above may seem to be severe minuses, there are some strong -pluses: - -* Significant reduction of V's opcode space: over 95%. -* Smaller reduction of P's opcode space: around 10%. -* The potential to use Compressed instructions in both Vector and SIMD - due to the overloading of register meaning (implicit vectorisation, - implicit packing) -* Not only present but also future extensions automatically gain parallelism. -* Already mentioned but worth emphasising: the simplification to compiler - writers and assembly-level writers of having the same consistent ISA - regardless of whether the internal level of parallelism (number of - parallel ALUs) is only equal to one ("virtual" parallelism), or is - greater than one, should not be underestimated. - -## Reducing Register Bank porting - -This looks quite reasonable. - - -The main details are outlined on page 4.  They propose a 2-level register -cache hierarchy, note that registers are typically only read once, that -you never write back from upper to lower cache level but always go in a -cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose -a scheme where you look ahead by only 2 instructions to determine which -registers to bring into the cache. - -The nice thing about a vector architecture is that you *know* that -*even more* registers are going to be pulled in: Hwacha uses this fact -to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough -by *introducing* deliberate latency into the execution phase. - -## Overflow registers in combination with predication - -**TODO**: propose overflow registers be actually one of the integer regs -(flowing to multiple regs). - -**TODO**: propose "mask" (predication) registers likewise. combination with -standard RV instructions and overflow registers extremely powerful, see -Aspex ASP. - -When integer overflow is stored in an easily-accessible bit (or another -register), parallelisation turns this into a group of bits which can -potentially be interacted with in predication, in interesting and powerful -ways. For example, by taking the integer-overflow result as a predication -field and shifting it by one, a predicated vectorised "add one" can emulate -"carry" on arbitrary (unlimited) length addition. - -However despite RVV having made room for floating-point exceptions, neither -RVV nor base RV have taken integer-overflow (carry) into account, which -makes proposing it quite challenging given that the relevant (Base) RV -sections are frozen. Consequently it makes sense to forgo this feature. - -## Context Switch Example - -An unusual side-effect of Simple-V mapping onto the standard register files -is that LOAD-multiple and STORE-multiple are accidentally available, as long -as it is acceptable that the register(s) to be loaded/stored are contiguous -(per instruction). An additional accidental benefit is that Compressed LD/ST -may also be used. - -To illustrate how this works, here is some example code from FreeRTOS -(GPLv2 licensed, portasm.S): - - /* Macro for saving task context */ - .macro portSAVE_CONTEXT - .global pxCurrentTCB - /* make room in stack */ - addi sp, sp, -REGBYTES * 32 - - /* Save Context */ - STORE x1, 0x0(sp) - STORE x2, 1 * REGBYTES(sp) - STORE x3, 2 * REGBYTES(sp) - ... - ... - STORE x30, 29 * REGBYTES(sp) - STORE x31, 30 * REGBYTES(sp) - - /* Store current stackpointer in task control block (TCB) */ - LOAD t0, pxCurrentTCB //pointer - STORE sp, 0x0(t0) - .endm - - /* Saves current error program counter (EPC) as task program counter */ - .macro portSAVE_EPC - csrr t0, mepc - STORE t0, 31 * REGBYTES(sp) - .endm - - /* Saves current return adress (RA) as task program counter */ - .macro portSAVE_RA - STORE ra, 31 * REGBYTES(sp) - .endm - - /* Macro for restoring task context */ - .macro portRESTORE_CONTEXT - - .global pxCurrentTCB - /* Load stack pointer from the current TCB */ - LOAD sp, pxCurrentTCB - LOAD sp, 0x0(sp) - - /* Load task program counter */ - LOAD t0, 31 * REGBYTES(sp) - csrw mepc, t0 - - /* Run in machine mode */ - li t0, MSTATUS_PRV1 - csrs mstatus, t0 - - /* Restore registers, - Skip global pointer because that does not change */ - LOAD x1, 0x0(sp) - LOAD x4, 3 * REGBYTES(sp) - LOAD x5, 4 * REGBYTES(sp) - ... - ... - LOAD x30, 29 * REGBYTES(sp) - LOAD x31, 30 * REGBYTES(sp) - - addi sp, sp, REGBYTES * 32 - mret - .endm - -The important bits are the Load / Save context, which may be replaced -with firstly setting up the Vectors and secondly using a *single* STORE -(or LOAD) including using C.ST or C.LD, to indicate that the entire -bank of registers is to be loaded/saved: - - /* a few things are assumed here: (a) that when switching to - M-Mode an entirely different set of CSRs is used from that - which is used in U-Mode and (b) that the M-Mode x1 and x4 - vectors are also not used anywhere else in M-Mode, consequently - only need to be set up just the once. - */ - .macroVectorSetup - MVECTORCSRx1 = 31, defaultlen - MVECTORCSRx4 = 28, defaultlen - - /* Save Context */ - SETVL x0, x0, 31 /* x0 ignored silently */ - STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth - - /* Restore registers, - Skip global pointer because that does not change */ - LOAD x1, 0x0(sp) - SETVL x0, x0, 28 /* x0 ignored silently */ - LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth - -Note that although it may just be a bug in portasm.S, x2 and x3 appear not -to be being restored. If however this is a bug and they *do* need to be -restored, then the SETVL call may be moved to *outside* the Save / Restore -Context assembly code, into the macroVectorSetup, as long as vectors are -never used anywhere else (i.e. VL is never altered by M-Mode). - -In effect the entire bank of repeated LOAD / STORE instructions is replaced -by one single (compressed if it is available) instruction. - -## Virtual Memory page-faults on LOAD/STORE - - -### Notes from conversations - -> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft -> riscv-isa-manual in order to work out how to re-map RVV onto the standard -> ISA, and came across an interesting comments at the bottom of pages 75 -> and 76: - -> " A common mechanism used in other ISAs to further reduce save/restore -> code size is load- multiple and store-multiple instructions. " - -> Fascinatingly, due to Simple-V proposing to use the *standard* register -> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly -> that: load-multiple and store-multiple instructions. Which brings us -> on to this comment: - -> "For virtual memory systems, some data accesses could be resident in -> physical memory and -> some could not, which requires a new restart mechanism for partially -> executed instructions." - -> Which then of course brings us to the interesting question: how does RVV -> cope with the scenario when, particularly with LD.X (Indexed / indirect -> loads), part-way through the loading a page fault occurs? - -> Has this been noted or discussed before? - -For applications-class platforms, the RVV exception model is -element-precise (that is, if an exception occurs on element j of a -vector instruction, elements 0..j-1 have completed execution and elements -j+1..vl-1 have not executed). - -Certain classes of embedded platforms where exceptions are always fatal -might choose to offer resumable/swappable interrupts but not precise -exceptions. - - -> Is RVV designed in any way to be re-entrant? - -Yes. - - -> What would the implications be for instructions that were in a FIFO at -> the time, in out-of-order and VLIW implementations, where partial decode -> had taken place? - -The usual bag of tricks for maintaining precise exceptions applies to -vector machines as well. Register renaming makes the job easier, and -it's relatively cheaper for vectors, since the control cost is amortized -over longer registers. - - -> Would it be reasonable at least to say *bypass* (and freeze) the -> instruction FIFO (drop down to a single-issue execution model temporarily) -> for the purposes of executing the instructions in the interrupt (whilst -> setting up the VM page), then re-continue the instruction with all -> state intact? - -This approach has been done successfully, but it's desirable to be -able to swap out the vector unit state to support context switches on -exceptions that result in long-latency I/O. - - -> Or would it be better to switch to an entirely separate secondary -> hyperthread context? - -> Does anyone have any ideas or know if there is any academic literature -> on solutions to this problem? - -The Vector VAX offered imprecise but restartable and swappable exceptions: -http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf - -Sec. 4.6 of Krste's dissertation assesses some of -the tradeoffs and references a bunch of related work: -http://people.eecs.berkeley.edu/~krste/thesis.pdf - - ----- - -Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P -exceptions" and thought, "hmmm that could go into a CSR, must re-read -the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly -thought, "ah ha! what if the memory exceptions were, instead of having -an immediate exception thrown, were simply stored in a type of predication -bit-field with a flag "error this element failed"? - -Then, *after* the vector load (or store, or even operation) was -performed, you could *then* raise an exception, at which point it -would be possible (yes in software... I know....) to go "hmmm, these -indexed operations didn't work, let's get them into memory by triggering -page-loads", then *re-run the entire instruction* but this time with a -"memory-predication CSR" that stops the already-performed operations -(whether they be loads, stores or an arithmetic / FP operation) from -being carried out a second time. - -This theoretically could end up being done multiple times in an SMP -environment, and also for LD.X there would be the remote outside annoying -possibility that the indexed memory address could end up being modified. - -The advantage would be that the order of execution need not be -sequential, which potentially could have some big advantages. -Am still thinking through the implications as any dependent operations -(particularly ones already decoded and moved into the execution FIFO) -would still be there (and stalled). hmmm. - ----- - - > > # assume internal parallelism of 8 and MAXVECTORLEN of 8 - > > VSETL r0, 8 - > > FADD x1, x2, x3 - > - > > x3[0]: ok - > > x3[1]: exception - > > x3[2]: ok - > > ... - > > ... - > > x3[7]: ok - > - > > what happens to result elements 2-7?  those may be *big* results - > > (RV128) - > > or in the RVV-Extended may be arbitrary bit-widths far greater. - > - >  (you replied:) - > - > Thrown away. - -discussion then led to the question of OoO architectures - -> The costs of the imprecise-exception model are greater than the benefit. -> Software doesn't want to cope with it.  It's hard to debug.  You can't -> migrate state between different microarchitectures--unless you force all -> implementations to support the same imprecise-exception model, which would -> greatly limit implementation flexibility.  (Less important, but still -> relevant, is that the imprecise model increases the size of the context -> structure, as the microarchitectural guts have to be spilled to memory.) - -## Zero/Non-zero Predication - ->> >  it just occurred to me that there's another reason why the data ->> > should be left instead of zeroed.  if the standard register file is ->> > used, such that vectorised operations are translated to mean "please ->> > insert multiple register-contiguous operations into the instruction ->> > FIFO" and predication is used to *skip* some of those, then if the ->> > next "vector" operation uses the (standard) registers that were masked ->> > *out* of the previous operation it may proceed without blocking. ->> > ->> >  if however zeroing is made mandatory then that optimisation becomes ->> > flat-out impossible to deploy. ->> > ->> >  whilst i haven't fully thought through the full implications, i ->> > suspect RVV might also be able to benefit by being able to fit more ->> > overlapping operations into the available SRAM by doing something ->> > similar. -> -> -> Luke, this is called density time masking. It doesn’t apply to only your -> model with the “standard register file” is used. it applies to any -> architecture that attempts to speed up by skipping computation and writeback -> of masked elements. -> -> That said, the writing of zeros need not be explicit. It is possible to add -> a “zero bit” per element that, when set, forces a zero to be read from the -> vector (although the underlying storage may have old data). In this case, -> there may be a way to implement DTM as well. - - -## Implementation detail for scalar-only op detection - -Note 1: this idea is a pipeline-bypass concept, which may *or may not* be -worthwhile. - -Note 2: this is just one possible implementation. Another implementation -may choose to treat *all* operations as vectorised (including treating -scalars as vectors of length 1), choosing to add an extra pipeline stage -dedicated to *all* instructions. - -This section *specifically* covers the implementor's freedom to choose -that they wish to minimise disruption to an existing design by detecting -"scalar-only operations", bypassing the vectorisation phase (which may -or may not require an additional pipeline stage) - -[[scalardetect.png]] - ->> For scalar ops an implementation may choose to compare 2-3 bits through an ->> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr ->> FIFO). - -> Those bits cannot be known until after the registers are decoded from the -> instruction and a lookup in the "vector length table" has completed. -> Considering that one of the reasons RISC-V keeps registers in invariant -> positions across all instructions is to simplify register decoding, I expect -> that inserting an SRAM read would lengthen the critical path in most -> implementations. - -reply: - -> briefly: the trick i mentioned about ANDing bits together to check if -> an op was fully-scalar or not was to be read out of a single 32-bit -> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per -> register indicating "is register vectorised yes no". 3R because you need -> to check src1, src2 and dest simultaneously. the entries are *generated* -> from the CSRs and are an optimisation that on slower embedded systems -> would likely not be needed. - -> is there anything unreasonable that anyone can foresee about that? -> what are the down-sides? - -## C.MV predicated src, predicated dest - -> Can this be usefully defined in such a way that it is -> equivalent to vector gather-scatter on each source, followed by a -> non-predicated vector-compare, followed by vector gather-scatter on the -> result? - -## element width conversion: restrict or remove? - -summary: don't restrict / remove. it's fine. - -> > it has virtually no cost/overhead as long as you specify -> > that inputs can only upconvert, and operations are always done at the -> > largest size, and downconversion only happens at the output. -> -> okaaay.  so that's a really good piece of implementation advice. -> algorithms do require data size conversion, so at some point you need to -> introduce the feature of upconverting and downconverting. -> -> > for int and uint, this is dead simple and fits well within the RVV pipeline -> > without any critical path, pipeline depth, or area implications. - - - -## Under review / discussion: remove CSR vector length, use VSETVL - -**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines -length on all regs**. This section kept for historical reasons. - -So the issue is as follows: - -* CSRs are used to set the "span" of a vector (how many of the standard - register file to contiguously use) -* VSETVL in RVV works as follows: it sets the vector length (copy of which - is placed in a dest register), and if the "required" length is longer - than the *available* length, the dest reg is set to the MIN of those - two. -* **HOWEVER**... in SV, *EVERY* vector register has its own separate - length and thus there is no way (at the time that VSETVL is called) to - know what to set the vector length *to*. -* At first glance it seems that it would be perfectly fine to just limit - the vector operation to the length specified in the destination - register's CSR, at the time that each instruction is issued... - except that that cannot possibly be guaranteed to match - with the value *already loaded into the target register from VSETVL*. - -Therefore a different approach is needed. - -Possible options include: - -* Removing the CSR "Vector Length" and always using the value from - VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and* - destreg equal to MIN(counterreg, lenimmed), with register-based - variant "VSETVL destreg, counterreg, lenreg" doing the same. -* Keeping the CSR "Vector Length" and having the lenreg version have - a "twist": "if lengreg is vectorised, read the length from the CSR" -* Other (TBD) - -The first option (of the ones brainstormed so far) is a lot simpler. -It does however mean that the length set in VSETVL will apply across-the-board -to all src1, src2 and dest vectorised registers until it is otherwise changed -(by another VSETVL call). This is probably desirable behaviour. - -## Implementation Paradigms - -TODO: assess various implementation paradigms. These are listed roughly -in order of simplicity (minimum compliance, for ultra-light-weight -embedded systems or to reduce design complexity and the burden of -design implementation and compliance, in non-critical areas), right the -way to high-performance systems. - -* Full (or partial) software-emulated (via traps): full support for CSRs - required, however when a register is used that is detected (in hardware) - to be vectorised, an exception is thrown. -* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP) -* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming -* Out-of-order with instruction FIFOs and aggressive register-renaming -* VLIW - -Also to be taken into consideration: - -* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism -* Comphrensive vectorisation: FIFOs and internal parallelism -* Hybrid Parallelism - -### Full or partial software-emulation - -The absolute, absolute minimal implementation is to provide the full -set of CSRs and detection logic for when any of the source or destination -registers are vectorised. On detection, a trap is thrown, whether it's -a branch, LOAD, STORE, or an arithmetic operation. - -Implementors are entirely free to choose whether to allow absolutely every -single operation to be software-emulated, or whether to provide some emulation -and some hardware support. In particular, for an RV32E implementation -where fast context-switching is a requirement (see "Context Switch Example"), -it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an -exception, as every context-switch will result in double-traps. - -# TODO Research - -> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs - -Idea: basic simple butterfly swap on a few element indices, primarily targetted -at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping, -perhaps allow reindexing of permutations up to 4 elements? 8? Reason: -such operations are less costly than a full indexed-shuffle, which requires -a separate instruction cycle. - -Predication "all zeros" needs to be "leave alone". Detection of -ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas -ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0. -Destruction of destination indices requires a copy of the entire vector -in advance to avoid. - -TBD: floating-point compare and other exception handling - ------- - -Multi-LR/SC - -Please don't try to use the L1 itself. - -Use the Load and Store buffers which capture instruction state prior -to being accessed in the L1 (and prior to data arriving in the case of -Store buffer). - -Also, use the L1 Miss buffers as these already HAVE to be snooped by -coherence traffic. These are used to monitor that all participating -cache lines remain interference free, and amalgamate same into a CPU -signal accessible ia branch or predicate. - -The Load buffers manage inbound traffic -The Store buffers manage outbound traffic. - -Done properly, the participating cache lines can exceed the associativity -of the L1 cache without architectural harm (may incur additional latency). - - - -> > > so, let's say instead of another LR *cancelling* the load -> > > reservation, the SMP core / hardware thread *blocks* for -> > > up to 63 further instructions, waiting for the reservation -> > > to clear. -> > -> > Can you explain what you mean by this paragraph? -> -> best put in sequential events, probably. -> -> LR <-- 64-instruction countdown starts here -> ... 63 -> ... 62 -> LR same address <--- notes that core1 is on 61, -> so pauses for **UP TO** 61 cycles -> ... 32 -> SC <- core1 didn't reach zero, therefore valid, therefore -> core2 is now **UNBLOCKED**, is granted the -> load-reservation (and begins its **own** 64-cycle -> LR instruction countdown) -> ... 63 -> ... 62 -> ... -> ... -> SC <- also valid - -Looks to me that you could effect the same functionality by simply -holding onto the cache line in core 1 preventing core 2 from - getting past the LR. - -On the other hand, the freeze is similar to how the MP CRAYs did -ATOMIC stuff. - # References * SIMD considered harmful -- 2.30.2