X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=6052db45f14896bfcb239098d0c18fcb053f1f5a;hb=958b1c9d1542ddb811a734a77e90adb8ebbbc7d8;hp=19f39487b967ae69e9efb5257e7c1873cd94d06c;hpb=dd998d14a710196e26d557999115031ae1b8f040;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 19f39487b..6052db45f 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,31 +1,71 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal +Key insight: Simple-V is intended as an abstraction layer to provide +a consistent "API" to parallelisation of existing *and future* operations. +*Actual* internal hardware-level parallelism is *not* required, such +that Simple-V may be viewed as providing a "compact" or "consolidated" +means of issuing multiple near-identical arithmetic instructions to an +instruction queue (FIFO), pending execution. + +*Actual* parallelism, if added independently of Simple-V in the form +of Out-of-order restructuring (including parallel ALU lanes) or VLIW +implementations, or SIMD, or anything else, would then benefit from +the uniformity of a consistent API. + +**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E. + +* Talk slides: +* Specification: now move to its own page: [[specification]] + [[!toc ]] +# Introduction + This proposal exists so as to be able to satisfy several disparate requirements: power-conscious, area-conscious, and performance-conscious designs all pull an ISA and its implementation in different conflicting directions, as do the specific intended uses for any given implementation. -Additionally, the existing P (SIMD) proposal and the V (Vector) proposals, +The existing P (SIMD) proposal and the V (Vector) proposals, whilst each extremely powerful in their own right and clearly desirable, are also: -* Clearly independent in their origins (Cray and AndeStar v3 respectively) +* Clearly independent in their origins (Cray and AndesStar v3 respectively) so need work to adapt to the RISC-V ethos and paradigm * Are sufficiently large so as to make adoption (and exploration for analysis and review purposes) prohibitively expensive * Both contain partial duplication of pre-existing RISC-V instructions (an undesirable characteristic) -* Both have independent and disparate methods for introducing parallelism - at the instruction level. +* Both have independent, incompatible and disparate methods for introducing + parallelism at the instruction level * Both require that their respective parallelism paradigm be implemented along-side and integral to their respective functionality *or not at all*. * Both independently have methods for introducing parallelism that could, if separated, benefit *other areas of RISC-V not just DSP or Floating-point respectively*. -Therefore it makes a huge amount of sense to have a means and method +There are also key differences between Vectorisation and SIMD (full +details outlined in the Appendix), the key points being: + +* SIMD has an extremely seductively compelling ease of implementation argument: + each operation is passed to the ALU, which is where the parallelism + lies. There is *negligeable* (if any) impact on the rest of the core + (with life instead being made hell for compiler writers and applications + writers due to extreme ISA proliferation). +* By contrast, Vectorisation has quite some complexity (for considerable + flexibility, reduction in opcode proliferation and much more). +* Vectorisation typically includes much more comprehensive memory load + and store schemes (unit stride, constant-stride and indexed), which + in turn have ramifications: virtual memory misses (TLB cache misses) + and even multiple page-faults... all caused by a *single instruction*, + yet with a clear benefit that the regularisation of LOAD/STOREs can + be optimised for minimal impact on caches and maximised throughput. +* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned + to pages), and these load/stores have absolutely nothing to do with the + SIMD / ALU engine, no matter how wide the operand. Simplicity but with + more impact on instruction and data caches. + +Overall it makes a huge amount of sense to have a means and method of introducing instruction parallelism in a flexible way that provides implementors with the option to choose exactly where they wish to offer performance improvements and where they wish to optimise for power @@ -36,148 +76,27 @@ Additionally it makes sense to *split out* the parallelism inherent within each of P and V, and to see if each of P and V then, in *combination* with a "best-of-both" parallelism extension, could be added on *on top* of this proposal, to topologically provide the exact same functionality of -each of P and V. +each of P and V. Each of P and V then can focus on providing the best +operations possible for their respective target areas, without being +hugely concerned about the actual parallelism. Furthermore, an additional goal of this proposal is to reduce the number of opcodes utilised by each of P and V as they currently stand, leveraging existing RISC-V opcodes where possible, and also potentially allowing P and V to make use of Compressed Instructions as a result. -**TODO**: reword this to better suit this document: - -Having looked at both P and V as they stand, they're _both_ very much -"separate engines" that, despite both their respective merits and -extremely powerful features, don't really cleanly fit into the RV design -ethos (or the flexible extensibility) and, as such, are both in danger -of not being widely adopted. I'm inclined towards recommending: - -* splitting out the DSP aspects of P-SIMD to create a single-issue DSP -* splitting out the polymorphism, esoteric data types (GF, complex - numbers) and unusual operations of V to create a single-issue "Esoteric - Floating-Point" extension -* splitting out the loop-aspects, vector aspects and data-width aspects - of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they - apply across *all* Extensions, whether those be DSP, M, Base, V, P - - everything. - -**TODO**: propose overflow registers be actually one of the integer regs -(flowing to multiple regs). - -**TODO**: propose "mask" (predication) registers likewise. combination with -standard RV instructions and overflow registers extremely powerful - -## CSRs marking registers as Vector - -A 32-bit CSR would be needed (1 bit per integer register) to indicate -whether a register was, if referred to, implicitly to be treated as -a vector. - -A second 32-bit CSR would be needed (1 bit per floating-point register) -to indicate whether a floating-point register was to be treated as a -vector. - -In this way any standard (current or future) operation involving -register operands may detect if the operation is to be vector-vector, -vector-scalar or scalar-scalar (standard) simply through a single -bit test. - -## CSR vector-length and CSR SIMD packed-bitwidth - -**TODO** analyse each of these: - -* splitting out the loop-aspects, vector aspects and data-width aspects -* integer reg 0 *and* fp reg0 share CSR vlen 0 *and* CSR packed-bitwidth 0 -* integer reg 1 *and* fp reg1 share CSR vlen 1 *and* CSR packed-bitwidth 1 -* .... -* ....  - -instead: - -* CSR vlen 0 *and* CSR packed-bitwidth 0 register contain extra bits - specifying an *INDEX* of WHICH int/fp register they refer to -* CSR vlen 1 *and* CSR packed-bitwidth 1 register contain extra bits - specifying an *INDEX* of WHICH int/fp register they refer to -* ... -* ... - -Have to be very *very* careful about not implementing too few of those -(or too many). Assess implementation impact on decode latency. Is it -worth it? - -Implementation of the latter: - -Operation involving (referring to) register M: - -> bitwidth = default # default for opcode? -> vectorlen = 1 # scalar -> -> for (o = 0, o < 2, o++) ->   if (CSR-Vector_registernum[o] == M) ->       bitwidth = CSR-Vector_bitwidth[o] ->       vectorlen = CSR-Vector_len[o] ->       break - -and for the former it would simply be: - -> bitwidth = CSR-Vector_bitwidth[M] -> vectorlen = CSR-Vector_len[M] - -Alternatives: - -* One single "global" vector-length CSR - -## Stride - -**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular -register as being "if you use this reg in LOAD/STORE, use the offset -amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous". -can be used for matrix spanning. - -> For LOAD/STORE, could a better option be to interpret the offset in the -> opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is -> configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12), -> t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the -> vector-control CSRs to select between offset-as-stride and unit-stride -> memory accesses? - -So there would be an instruction like this: - -| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM | -| opcode | 5 bit | 1 bit | 1 bit | 5 bit, OFFn=XLEN | - - -which would mean: - -* CSR-Offset register n <= (float|int) register number N -* CSR-Offset Stride-mode = offset or unit -* CSR-Offset amount register n = contents of register M - -LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set): - -> offs = 0 -> stride = 1 -> vector-len = CSR-Vector-length register N -> -> for (o = 0, o < 2, o++) -> if (CSR-Offset register o == M) -> offs = CSR-Offset amount register o -> if CSR-Offset Stride-mode == offset: -> stride = ldoffs -> break -> -> for (i = 0, i < vector-len; i++) -> r[N+i] = mem[(offs*i + r[M+i])*stride] - # Analysis and discussion of Vector vs SIMD -There are four combined areas between the two proposals that help with -parallelism without over-burdening the ISA with a huge proliferation of +There are six combined areas between the two proposals that help with +parallelism (increased performance, reduced power / area) without +over-burdening the ISA with a huge proliferation of instructions: * Fixed vs variable parallelism (fixed or variable "M" in SIMD) * Implicit vs fixed instruction bit-width (integral to instruction or not) * Implicit vs explicit type-conversion (compounded on bit-width) * Implicit vs explicit inner loops. +* Single-instruction LOAD/STORE. * Masks / tagging (selecting/preventing certain indexed elements from execution) The pros and cons of each are discussed and analysed below. @@ -195,6 +114,13 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable for general-purpose computation, and in the context of developing a general-purpose ISA, is never going to satisfy 100 percent of implementors. +To explain this further: for increased workloads over time, as the +performance requirements increase for new target markets, implementors +choose to extend the SIMD width (so as to again avoid mixing parallelism +into the instruction issue phases: the primary "simplicity" benefit of +SIMD in the first place), with the result that the entire opcode space +effectively doubles with each new SIMD width that's added to the ISA. + That basically leaves "variable-length vector" as the clear *general-purpose* winner, at least in terms of greatly simplifying the instruction set, reducing the number of instructions required for any given task, and thus @@ -205,7 +131,8 @@ reducing power consumption for the same. SIMD again has a severe disadvantage here, over Vector: huge proliferation of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and have to then have operations *for each and between each*. It gets very -messy, very quickly. +messy, very quickly: *six* separate dimensions giving an O(N^6) instruction +proliferation profile. The V-Extension on the other hand proposes to set the bit-width of future instructions on a per-register basis, such that subsequent instructions @@ -217,7 +144,7 @@ burdensome to implementations, given that instruction decode already has to direct the operation to a correctly-sized width ALU engine, anyway. Not least: in places where an ISA was previously constrained (due for -whatever reason, including limitations of the available operand spcace), +whatever reason, including limitations of the available operand space), implicit bit-width allows the meaning of certain operations to be type-overloaded *without* pollution or alteration of frozen and immutable instructions, in a fully backwards-compatible fashion. @@ -230,13 +157,16 @@ integer (and floating point) of various sizes is automatically inferred due to "type tagging" that is set with a special instruction. A register will be *specifically* marked as "16-bit Floating-Point" and, if added to an operand that is specifically tagged as "32-bit Integer" an implicit -type-conversion will take placce *without* requiring that type-conversion +type-conversion will take place *without* requiring that type-conversion to be explicitly done with its own separate instruction. However, implicit type-conversion is not only quite burdensome to implement (explosion of inferred type-to-type conversion) but also is never really going to be complete. It gets even worse when bit-widths -also have to be taken into consideration. +also have to be taken into consideration. Each new type results in +an increased O(N^2) conversion space that, as anyone who has examined +python's source code (which has built-in polymorphic type-conversion), +knows that the task is more complex than it first seems. Overall, type-conversion is generally best to leave to explicit type-conversion instructions, or in definite specific use-cases left to @@ -249,14 +179,48 @@ contains an extremely interesting feature: zero-overhead loops. This proposal would basically allow an inner loop of instructions to be repeated indefinitely, a fixed number of times. -Its specific advantage over explicit loops is that the pipeline in a -DSP can potentially be kept completely full *even in an in-order +Its specific advantage over explicit loops is that the pipeline in a DSP +can potentially be kept completely full *even in an in-order single-issue implementation*. Normally, it requires a superscalar architecture and -out-of-order execution capabilities to "pre-process" instructions in order -to keep ALU pipelines 100% occupied. - -This very simple proposal offers a way to increase pipeline activity in the -one key area which really matters: the inner loop. +out-of-order execution capabilities to "pre-process" instructions in +order to keep ALU pipelines 100% occupied. + +By bringing that capability in, this proposal could offer a way to increase +pipeline activity even in simpler implementations in the one key area +which really matters: the inner loop. + +However when looking at much more comprehensive schemes +"A portable specification of zero-overhead loop control hardware +applied to embedded processors" (ZOLC), optimising only the single +inner loop seems inadequate, tending to suggest that ZOLC may be +better off being proposed as an entirely separate Extension. + +## Single-instruction LOAD/STORE + +In traditional Vector Architectures there are instructions which +result in multiple register-memory transfer operations resulting +from a single instruction. They're complicated to implement in hardware, +yet the benefits are a huge consistent regularisation of memory accesses +that can be highly optimised with respect to both actual memory and any +L1, L2 or other caches. In Hwacha EECS-2015-263 it is explicitly made +clear the consequences of getting this architecturally wrong: +L2 cache-thrashing at the very least. + +Complications arise when Virtual Memory is involved: TLB cache misses +need to be dealt with, as do page faults. Some of the tradeoffs are +discussed in , Section +4.6, and an article by Jeff Bush when faced with some of these issues +is particularly enlightening + + +Interestingly, none of this complexity is faced in SIMD architectures... +but then they do not get the opportunity to optimise for highly-streamlined +memory accesses either. + +With the "bang-per-buck" ratio being so high and the indirect improvement +in L1 Instruction Cache usage (reduced instruction count), as well as +the opportunity to optimise L1 and L2 cache usage, the case for including +Vector LOAD/STORE is compelling. ## Mask and Tagging (Predication) @@ -277,8 +241,8 @@ So these are the ways in which conditional execution may be implemented: * explicit compare and branch: BNE x, y -> offs would jump offs instructions if x was not equal to y * explicit store of tag condition: CMP x, y -> tagbit -* implicit (condition-code) ADD results in a carry, carry bit implicitly - (or sometimes explicitly) goes into a "tag" (mask) register +* implicit (condition-code) such as ADD results in a carry, carry bit + implicitly (or sometimes explicitly) goes into a "tag" (mask) register The first of these is a "normal" branch method, which is flat-out impossible to parallelise without look-ahead and effectively rewriting instructions. @@ -307,87 +271,20 @@ condition-codes or predication. By adding a CSR it becomes possible to also tag certain registers as "predicated if referenced as a destination". Example: -> // in future operations if r0 is the destination use r5 as -> // the PREDICATION register -> IMPLICICSRPREDICATE r0, r5 -> // store the compares in r5 as the PREDICATION register -> CMPEQ8 r5, r1, r2 -> // r0 is used here. ah ha! that means it's predicated using r5! -> ADD8 r0, r1, r3 + // in future operations from now on, if r0 is the destination use r5 as + // the PREDICATION register + SET_IMPLICIT_CSRPREDICATE r0, r5 + // store the compares in r5 as the PREDICATION register + CMPEQ8 r5, r1, r2 + // r0 is used here. ah ha! that means it's predicated using r5! + ADD8 r0, r1, r3 -With enough registers (and there are enough registers) some fairly +With enough registers (and in RISC-V there are enough registers) some fairly complex predication can be set up and yet still execute without significant stalling, even in a simple non-superscalar architecture. -### Retro-fitting Predication into branch-explicit ISA - -One of the goals of this parallelism proposal is to avoid instruction -duplication. However, with the base ISA having been designed explictly -to *avoid* condition-codes entirely, shoe-horning predication into it -bcomes quite challenging. - -However what if all branch instructions, if referencing a vectorised -register, were instead given *completely new analogous meanings* that -resulted in a parallel bit-wise predication register being set? This -would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, -BLT and BGE. - -We might imagine that FEQ, FLT and FLT would also need to be converted, -however these are effectively *already* in the precise form needed and -do not need to be converted *at all*! The difference is that FEQ, FLT -and FLE *specifically* write a 1 to an integer register if the condition -holds, and 0 if not. All that needs to be done here is to say, "if -the integer register is tagged with a bit that says it is a predication -register, the **bit** in the integer register is set based on the -current vector index" instead. - -There is, in the standard Conditional Branch instruction, more than -adequate space to interpret it in a similar fashion: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | -"""]] - -This would become: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | -"""]] - -Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, -with the interesting side-effect that there is space within what is presently -the "immediate offset" field to reinterpret that to add in not only a bit -field to distinguish between floating-point compare and integer compare, -not only to add in a second source register, but also use some of the bits as -a predication target as well. - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | op | - 3 | 3 | 3 | 5 | 2 | - C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | -"""]] - -Now uses the CS format: - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | | op | - 3 | 3 | 3 | 2 | 3 | 2 | - C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | -"""]] - -Bit 6 would be decoded as "operation refers to Integer or Float" including -interpreting src1 and src2 accordingly as outlined in Table 12.2 of the -"C" Standard, version 2.0, -whilst Bit 5 would allow the operation to be extended, in combination with -funct3 = 110 or 111: a combination of four distinct comparison operators. +(For details on how Branch Instructions would be retro-fitted to indirectly +predicated equivalents, see Appendix) ## Conclusions @@ -399,29 +296,25 @@ follows: * Fixed vs variable parallelism: variable * Implicit (indirect) vs fixed (integral) instruction bit-width: indirect * Implicit vs explicit type-conversion: explicit -* Implicit vs explicit inner loops: implicit -* Tag or no-tag: Complex and needs further thought - -In particular: variable-length vectors came out on top because of the -high setup, teardown and corner-cases associated with the fixed width -of SIMD. Implicit bit-width helps to extend the ISA to escape from -former limitations and restrictions (in a backwards-compatible fashion), -and implicit (zero-overhead) loops provide a means to keep pipelines -potentially 100% occupied *without* requiring a super-scalar or out-of-order -architecture. - -Constructing a SIMD/Simple-Vector proposal based around even only these four -(five?) requirements would therefore seem to be a logical thing to do. - -# Instruction Format - -**TODO** *basically borrow from both P and V, which should be quite simple -to do, with the exception of Tag/no-tag, which needs a bit more -thought. V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS -gather-scatterer, and, if implemented, could actually be a really useful -way to span 8-bit up to 64-bit groups of data, where BGS as it stands -and described by Clifford does **bits** of up to 16 width. Lots to -look at and investigate!* +* Implicit vs explicit inner loops: implicit but best done separately +* Single-instruction Vector LOAD/STORE: Complex but highly beneficial +* Tag or no-tag: Complex but highly beneficial + +In particular: + +* variable-length vectors came out on top because of the high setup, teardown + and corner-cases associated with the fixed width of SIMD. +* Implicit bit-width helps to extend the ISA to escape from + former limitations and restrictions (in a backwards-compatible fashion), + whilst also leaving implementors free to simmplify implementations + by using actual explicit internal parallelism. +* Implicit (zero-overhead) loops provide a means to keep pipelines + potentially 100% occupied in a single-issue in-order implementation + i.e. *without* requiring a super-scalar or out-of-order architecture, + but doing a proper, full job (ZOLC) is an entirely different matter. + +Constructing a SIMD/Simple-Vector proposal based around four of these six +requirements would therefore seem to be a logical thing to do. # Note on implementation of parallelism @@ -447,505 +340,273 @@ basis* whether and how much "Virtual Parallelism" to deploy. It is absolutely critical to note that it is proposed that such choices MUST be **entirely transparent** to the end-user and the compiler. Whilst -a Vector (varible-width SIM) may not precisely match the width of the +a Vector (varible-width SIMD) may not precisely match the width of the parallelism within the implementation, the end-user **should not care** and in this way the performance benefits are gained but the ISA remains straightforward. All that happens at the end of an instruction run is: some parallel units (if there are any) would remain offline, completely transparently to the ISA, the program, and the compiler. -The "SIMD considered harmful" trap of having huge complexity and extra +To make that clear: should an implementor choose a particularly wide +SIMD-style ALU, each parallel unit *must* have predication so that +the parallel SIMD ALU may emulate variable-length parallel operations. +Thus the "SIMD considered harmful" trap of having huge complexity and extra instructions to deal with corner-cases is thus avoided, and implementors get to choose precisely where to focus and target the benefits of their implementation efforts, without "extra baggage". -# V-Extension to Simple-V Comparative Analysis - -This section covers the ways in which Simple-V is comparable -to, or more flexible than, V-Extension (V2.3-draft). Also covered is -one major weak-point (register files are fixed size, where V is -arbitrary length), and how best to deal with that, should V be adapted -to be on top of Simple-V. - -The first stages of this section go over each of the sections of V2.3-draft V -where appropriate +In addition, implementors will be free to choose whether to provide an +absolute bare minimum level of compliance with the "API" (software-traps +when vectorisation is detected), all the way up to full supercomputing +level all-hardware parallelism. Options are covered in the Appendix. -## 17.3 Shape Encoding -Simple-V's proposed means of expressing whether a register (from the -standard integer or the standard floating-point file) is a scalar or -a vector is to simply set the vector length to 1. The instruction -would however have to specify which register file (integer or FP) that -the vector-length was to be applied to. +### FMV, FNEG and FABS Instructions -Extended shapes (2-D etc) would not be part of Simple-V at all. +These are identical in form to C.MV, except covering floating-point +register copying. The same double-predication rules also apply. +However when elwidth is not set to default the instruction is implicitly +and automatic converted to a (vectorised) floating-point type conversion +operation of the appropriate size covering the source and destination +register bitwidths. -## 17.4 Representation Encoding +(Note that FMV, FNEG and FABS are all actually pseudo-instructions) -Simple-V would not have representation-encoding. This is part of -polymorphism, which is considered too complex to implement (TODO: confirm?) +### FVCT Instructions -## 17.5 Element Bitwidth +These are again identical in form to C.MV, except that they cover +floating-point to integer and integer to floating-point. When element +width in each vector is set to default, the instructions behave exactly +as they are defined for standard RV (scalar) operations, except vectorised +in exactly the same fashion as outlined in C.MV. -This is directly equivalent to Simple-V's "Packed", and implies that -integer (or floating-point) are divided down into vector-indexable -chunks of size Bitwidth. +However when the source or destination element width is not set to default, +the opcode's explicit element widths are *over-ridden* to new definitions, +and the opcode's element width is taken as indicative of the SIMD width +(if applicable i.e. if packed SIMD is requested) instead. -In this way it becomes possible to have ADD effectively and implicitly -turn into ADDb (8-bit add), ADDw (16-bit add) and so on, and where -vector-length has been set to greater than 1, it becomes a "Packed" -(SIMD) instruction. +For example FCVT.S.L would normally be used to convert a 64-bit +integer in register rs1 to a 64-bit floating-point number in rd. +If however the source rs1 is set to be a vector, where elwidth is set to +default/2 and "packed SIMD" is enabled, then the first 32 bits of +rs1 are converted to a floating-point number to be stored in rd's +first element and the higher 32-bits *also* converted to floating-point +and stored in the second. The 32 bit size comes from the fact that +FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to +divide that by two it means that rs1 element width is to be taken as 32. -It remains to be decided what should be done when RV32 / RV64 ADD (sized) -opcodes are used. One useful idea would be, on an RV64 system where -a 32-bit-sized ADD was performed, to simply use the least significant -32-bits of the register (exactly as is currently done) but at the same -time to *respect the packed bitwidth as well*. +Similar rules apply to the destination register. -The extended encoding (Table 17.6) would not be part of Simple-V. - -## 17.6 Base Vector Extension Supported Types +# Exceptions -TODO: analyse. probably exactly the same. +> What does an ADD of two different-sized vectors do in simple-V? -## 17.7 Maximum Vector Element Width +* if the two source operands are not the same, throw an exception. +* if the destination operand is also a vector, and the source is longer + than the destination, throw an exception. -No equivalent in Simple-V +> And what about instructions like JALR?  +> What does jumping to a vector do? -## 17.8 Vector Configuration Registers +* Throw an exception. Whether that actually results in spawning threads + as part of the trap-handling remains to be seen. -TODO: analyse. +# Under consideration -## 17.9 Legal Vector Unit Configurations +From the Chennai 2018 slides the following issues were raised. +Efforts to analyse and answer these questions are below. + +* Should future extra bank be included now? +* How many Register and Predication CSRs should there be? + (and how many in RV32E) +* How many in M-Mode (for doing context-switch)? +* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? +* Can CLIP be done as a CSR (mode, like elwidth) +* SIMD saturation (etc.) also set as a mode? +* Include src1/src2 predication on Comparison Ops? + (same arrangement as C.MV, with same flexibility/power) +* 8/16-bit ops is it worthwhile adding a "start offset"? + (a bit like misaligned addressing... for registers) + or just use predication to skip start? -TODO: analyse +## Future (extra) bank be included (made mandatory) -## 17.10 Vector Unit CSRs +The implications of expanding the *standard* register file from +32 entries per bank to 64 per bank is quite an extensive architectural +change. Also it has implications for context-switching. -TODO: analyse +Therefore, on balance, it is not recommended and certainly should +not be made a *mandatory* requirement for the use of SV. SV's design +ethos is to be minimally-disruptive for implementors to shoe-horn +into an existing design. -> Ok so this is an aspect of Simple-V that I hadn't thought through, -> yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section -> 17.10 the CSRs are listed.  I note that there's some general-purpose -> CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i -> don't precisely know what those are for. +## How large should the Register and Predication CSR key-value stores be? ->  In the Simple-V proposal, *every* register in both the integer -> register-file *and* the floating-point register-file would have at -> least a 2-bit "data-width" CSR and probably something like an 8-bit -> "vector-length" CSR (less in RV32E, by exactly one bit). +This is something that definitely needs actual evaluation and for +code to be run and the results analysed. At the time of writing +(12jul2018) that is too early to tell. An approximate best-guess +however would be 16 entries. ->  What I *don't* know is whether that would be considered perfectly -> reasonable or completely insane.  If it turns out that the proposed -> Simple-V CSRs can indeed be stored in SRAM then I would imagine that -> adding somewhere in the region of 10 bits per register would be... okay?  -> I really don't honestly know. +RV32E however is a special case, given that it is highly unlikely +(but not outside the realm of possibility) that it would be used +for performance reasons but instead for reducing instruction count. +The number of CSR entries therefore has to be considered extremely +carefully. ->  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to -> be multi-ported? No I don't believe they would. +## How many CSR entries in M-Mode or S-Mode (for context-switching)? -## 17.11 Maximum Vector Length (MVL) +The minimum required CSR entries would be 1 for each register-bank: +one for integer and one for floating-point. However, as shown +in the "Context Switch Example" section, for optimal efficiency +(minimal instructions in a low-latency situation) the CSRs for +the context-switch should be set up *and left alone*. -Basically implicitly this is set to the maximum size of the register -file multiplied by the number of 8-bit packed ints that can fit into -a register (4 for RV32, 8 for RV64 and 16 for RV128). +This means that it is not really a good idea to touch the CSRs +used for context-switching in the M-Mode (or S-Mode) trap, so +if there is ever demonstrated a need for vectors then there would +need to be *at least* one more free. However just one does not make +much sense (as it one only covers scalar-vector ops) so it is more +likely that at least two extra would be needed. -## !7.12 Vector Instruction Formats +This *in addition* - in the RV32E case - if an RV32E implementation +happens also to support U/S/M modes. This would be considered quite +rare but not outside of the realm of possibility. -No equivalent in Simple-V because *all* instructions of *all* Extensions -are implicitly parallelised (and packed). +Conclusion: all needs careful analysis and future work. -## 17.13 Polymorphic Vector Instructions +## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)? -Polymorphism (implicit type-casting) is deliberately not supported -in Simple-V. +On balance it's a neat idea however it does seem to be one where the +benefits are not really clear. It would however obviate the need for +an exception to be raised if the VL runs out of registers to put +things in (gets to x31, tries a non-existent x32 and fails), however +the "fly in the ointment" is that x0 is hard-coded to "zero". The +increment therefore would need to be double-stepped to skip over x0. +Some microarchitectures could run into difficulties (SIMD-like ones +in particular) so it needs a lot more thought. -## 17.14 Rapid Configuration Instructions +## Can CLIP be done as a CSR (mode, like elwidth) -TODO: analyse if this is useful to have an equivalent in Simple-V +RVV appears to be going this way. At the time of writing (12jun2018) +it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do +clip by way of exactly this method: setting a "clip mode" in a CSR. -## 17.15 Vector-Type-Change Instructions +No details are given however the most sensible thing to have would be +to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have +extra bits specifying the type of clipping to be carried out, on +a per-register basis. Other bits may be used for other purposes +(see SIMD saturation below) -TODO: analyse if this is useful to have an equivalent in Simple-V +## SIMD saturation (etc.) also set as a mode? -## 17.16 Vector Length +Similar to "CLIP" as an extension to the CSR key-value store, "saturate" +may also need extra details (what the saturation maximum is for example). -Has a direct corresponding equivalent. +## Include src1/src2 predication on Comparison Ops? -## 17.17 Predicated Execution +In the C.MV (and other ops - see "C.MV Instruction"), the decision +was taken, unlike in ADD (etc.) which are 3-operand ops, to use +*both* the src *and* dest predication masks to give an extremely +powerful and flexible instruction that covers a huge number of +"traditional" vector opcodes. -Predicated Execution is another name for "masking" or "tagging". Masked -(or tagged) implies that there is a bit field which is indexed, and each -bit associated with the corresponding indexed offset register within -the "Vector". If the tag / mask bit is 1, when a parallel operation is -issued, the indexed element of the vector has the operation carried out. -However if the tag / mask bit is *zero*, that particular indexed element -of the vector does *not* have the requested operation carried out. +The natural question therefore to ask is: where else could this +flexibility be deployed? What about comparison operations? -In V2.3-draft V, there is a significant (not recommended) difference: -the zero-tagged elements are *set to zero*. This loses a *significant* -advantage of mask / tagging, particularly if the entire mask register -is itself a general-purpose register, as that general-purpose register -can be inverted, shifted, and'ed, or'ed and so on. In other words -it becomes possible, especially if Carry/Overflow from each vector -operation is also accessible, to do conditional (step-by-step) vector -operations including things like turn vectors into 1024-bit or greater -operands with very few instructions, by treating the "carry" from -one instruction as a way to do "Conditional add of 1 to the register -next door". If V2.3-draft V sets zero-tagged elements to zero, such -extremely powerful techniques are simply not possible. +Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst +predicated comparison operations are actually a *three* operand +instruction: + + regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0) -It is noted that there is no mention of an equivalent to BEXT (element -skipping) which would be particularly fascinating and powerful to have. -In this mode, the "mask" would skip elements where its mask bit was zero -in either the source or the destination operand. +Therefore at first glance it does not make sense to use src1 and src2 +predication masks, as it breaks the rule of 3-operand instructions +to use the *destination* predication register. -Lots to be discussed. +In this case however, the destination *is* a predication register +as opposed to being a predication mask that is applied *to* the +(vectorised) operation, element-at-a-time on src1 and src2. -## 17.18 Vector Load/Store Instructions +Thus the question is directly inter-related to whether the modification +of the predication mask should *itself* be predicated. -These may not have a direct equivalent in Simple-V, except if mask/tagging -is to be deployed. +It is quite complex, in other words, and needs careful consideration. -To be discussed. +## 8/16-bit ops is it worthwhile adding a "start offset"? -## 17.19 Vector Register Gather +The idea here is to make it possible, particularly in a "Packed SIMD" +case, to be able to avoid doing unaligned Load/Store operations +by specifying that operations, instead of being carried out +element-for-element, are offset by a fixed amount *even* in 8 and 16-bit +element Packed SIMD cases. -TODO +For example rather than take 2 32-bit registers divided into 4 8-bit +elements and have them ADDed element-for-element as follows: -## TODO, sort + r3[0] = add r4[0], r6[0] + r3[1] = add r4[1], r6[1] + r3[2] = add r4[2], r6[2] + r3[3] = add r4[3], r6[3] -> However, there are also several features that go beyond simply attaching VL -> to a scalar operation and are crucial to being able to vectorize a lot of -> code. To name a few: -> - Conditional execution (i.e., predicated operations) -> - Inter-lane data movement (e.g. SLIDE, SELECT) -> - Reductions (e.g., VADD with a scalar destination) - - Ok so the Conditional and also the Reductions is one of the reasons - why as part of SimpleV / variable-SIMD / parallelism (gah gotta think - of a decent name) i proposed that it be implemented as "if you say r0 - is to be a vector / SIMD that means operations actually take place on - r0,r1,r2... r(N-1)". - - Consequently any parallel operation could be paused (or... more - specifically: vectors disabled by resetting it back to a default / - scalar / vector-length=1) yet the results would actually be in the - *main register file* (integer or float) and so anything that wasn't - possible to easily do in "simple" parallel terms could be done *out* - of parallel "mode" instead. - - I do appreciate that the above does imply that there is a limit to the - length that SimpleV (whatever) can be parallelised, namely that you - run out of registers! my thought there was, "leave space for the main - V-Ext proposal to extend it to the length that V currently supports". - Honestly i had not thought through precisely how that would work. - - Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that, - it reminds me of the discussion with Clifford on bit-manipulation - (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if - applied "globally and outside of V and P" SLIDE and SELECT might become - an extremely powerful way to do fast memory copy and reordering [2[. - - However I haven't quite got my head round how that would work: i am - used to the concept of register "tags" (the modern term is "masks") - and i *think* if "masks" were applied to a Simple-V-enhanced LOAD / - STORE you would get the exact same thing as SELECT. - - SLIDE you could do simply by setting say r0 vector-length to say 16 - (meaning that if referred to in any operation it would be an implicit - parallel operation on *all* registers r0 through r15), and temporarily - set say.... r7 vector-length to say... 5. Do a LOAD on r7 and it would - implicitly mean "load from memory into r7 through r11". Then you go - back and do an operation on r0 and ta-daa, you're actually doing an - operation on a SLID {SLIDED?) vector. - - The advantage of Simple-V (whatever) over V would be that you could - actually do *operations* in the middle of vectors (not just SLIDEs) - simply by (as above) setting r0 vector-length to 16 and r7 vector-length - to 5. There would be nothing preventing you from doing an ADD on r0 - (which meant do an ADD on r0 through r15) followed *immediately in the - next instruction with no setup cost* a MUL on r7 (which actually meant - "do a parallel MUL on r7 through r11"). - - btw it's worth mentioning that you'd get scalar-vector and vector-scalar - implicitly by having one of the source register be vector-length 1 - (the default) and one being N > 1. but without having special opcodes - to do it. i *believe* (or more like "logically infer or deduce" as - i haven't got access to the spec) that that would result in a further - opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V. - - Also, Reduction *might* be possible by specifying that the destination be - a scalar (vector-length=1) whilst the source be a vector. However... it - would be an awful lot of work to go through *every single instruction* - in *every* Extension, working out which ones could be parallelised (ADD, - MUL, XOR) and those that definitely could not (DIV, SUB). Is that worth - the effort? maybe. Would it result in huge complexity? probably. - Could an implementor just go "I ain't doing *that* as parallel! - let's make it virtual-parallelism (sequential reduction) instead"? - absolutely. So, now that I think it through, Simple-V (whatever) - covers Reduction as well. huh, that's a surprise. - - -> - Vector-length speculation (making it possible to vectorize some loops with -> unknown trip count) - I don't think this part of the proposal is written -> down yet. - - Now that _is_ an interesting concept. A little scary, i imagine, with - the possibility of putting a processor into a hard infinite execution - loop... :) - - -> Also, note the vector ISA consumes relatively little opcode space (all the -> arithmetic fits in 7/8ths of a major opcode). This is mainly because data -> type and size is a function of runtime configuration, rather than of opcode. - - yes. i love that aspect of V, i am a huge fan of polymorphism [1] - which is why i am keen to advocate that the same runtime principle be - extended to the rest of the RISC-V ISA [3] - - Yikes that's a lot. I'm going to need to pull this into the wiki to - make sure it's not lost. - -[1] inherent data type conversion: 25 years ago i designed a hypothetical -hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit -(escape-extended) opcodes and 2-bit (escape-extended) operands that -only required a fixed 8-bit instruction length. that relied heavily -on polymorphism and runtime size configurations as well. At the time -I thought it would have meant one HELL of a lot of CSRs... but then I -met RISC-V and was cured instantly of that delusion^Wmisapprehension :) - -[2] Interestingly if you then also add in the other aspect of Simple-V -(the data-size, which is effectively functionally orthogonal / identical -to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE -operations become byte / half-word / word augmenters of B-Ext's proposed -"BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored -LOAD / STORE would deal with 8 / 16 / 32 bits at a time. Where it -would get really REALLY interesting would be masked-packed-vectored -B-Ext BGS instructions. I can't even get my head fully round that, -which is a good sign that the combination would be *really* powerful :) - -[3] ok sadly maybe not the polymorphism, it's too complicated and I -think would be much too hard for implementors to easily "slide in" to an -existing non-Simple-V implementation.  i say that despite really *really* -wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some -fashion, for optimising 3D Graphics.  *sigh*. - -## TODO: analyse, auto-increment on unit-stride and constant-stride - -so i thought about that for a day or so, and wondered if it would be -possible to propose a variant of zero-overhead loop that included -auto-incrementing the two address registers a2 and a3, as well as -providing a means to interact between the zero-overhead loop and the -vsetvl instruction. a sort-of pseudo-assembly of that would look like: - -> # a2 to be auto-incremented by t0*4 -> zero-overhead-set-auto-increment a2, t0, 4 -> # a2 to be auto-incremented by t0*4 -> zero-overhead-set-auto-increment a3, t0, 4 -> zero-overhead-set-loop-terminator-condition a0 zero -> zero-overhead-set-start-end stripmine, stripmine+endoffset -> stripmine: -> vsetvl t0,a0 -> vlw v0, a2 -> vlw v1, a3 -> vfma v1, a1, v0, v1 -> vsw v1, a3 -> sub a0, a0, t0 ->stripmine+endoffset: - -the question is: would something like this even be desirable? it's a -variant of auto-increment [1]. last time i saw any hint of auto-increment -register opcodes was in the 1980s... 68000 if i recall correctly... yep -see [1] - -[1] http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node6.html - -Reply: - -Another option for auto-increment is for vector-memory-access instructions -to support post-increment addressing for unit-stride and constant-stride -modes. This can be implemented by the scalar unit passing the operation -to the vector unit while itself executing an appropriate multiply-and-add -to produce the incremented address. This does *not* require additional -ports on the scalar register file, unlike scalar post-increment addressing -modes. - -## TODO: instructions (based on Hwacha) V-Ext duplication analysis - -This is partly speculative due to lack of access to an up-to-date -V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing). However -basin an analysis instead on Hwacha, a cursory examination shows over -an **85%** duplication of V-Ext operand-related instructions when -compared to Simple-V on a standard RG64G base. Even Vector Fetch -is analogous to "zero-overhead loop". - -Exceptions are: - -* Vector Indexed Memory Instructions (non-contiguous) -* Vector Atomic Memory Instructions. -* Some of the Vector Misc ops: VEIDX, VFIRST, VCLASS, VPOPC - and potentially more. -* Consensual Jump - -Table of RV32V Instructions - -| RV32V | RV Equivalent (FP) | RV Equivalent (Int) | -| ----- | --- | | -| VADD | FADD | ADD | -| VSUB | FSUB | SUB | -| VSL | | | -| VSR | | | -| VAND | | AND | -| VOR | | OR | -| VXOR | | XOR | -| VSEQ | FEQ | BEQ | -| VSNE | !FEQ | BNE | -| VSLT | FLT | BLT | -| VSGE | !FLE | BGE | -| VCLIP | | | -| VCVT | | | -| VMPOP | | | -| VMFIRST | | | -| VEXTRACT | | | -| VINSERT | | | -| VMERGE | | | -| VSELECT | | | -| VSLIDE | | | -| VDIV | FDIV | DIV | -| VREM | | REM | -| VMUL | FMUL | MUL | -| VMULH | | | -| VMIN | FMIN | | -| VMAX | FMUX | | -| VSGNJ | FSGNJ | | -| VSGNJN | FSGNJN | | -| VSGNJX | FSNGJX | | -| VSQRT | FSQRT | | -| VCLASS | | | -| VPOPC | | | -| VADDI | | ADDI | -| VSLI | | SLI | -| VSRI | | SRI | -| VANDI | | ANDI | -| VORI | | ORI | -| VXORI | | XORI | -| VCLIPI | | | -| VMADD | FMADD | | -| VMSUB | FMSUB | | -| VNMADD | FNMSUB | | -| VNMSUB | FNMADD | | -| VLD | FLD | LD | -| VLDS | | LW | -| VLDX | | LWU | -| VST | FST | ST | -| VSTS | | | -| VSTX | | | -| VAMOSWAP | | AMOSWAP | -| VAMOADD | | AMOADD | -| VAMOAND | | AMOAND | -| VAMOOR | | AMOOR | -| VAMOXOR | | AMOXOR | -| VAMOMIN | | AMOMIN | -| VAMOMAX | | AMOMAX | - -## TODO: sort - -> I suspect that the "hardware loop" in question is actually a zero-overhead -> loop unit that diverts execution from address X to address Y if a certain -> condition is met. - - not quite.  The zero-overhead loop unit interestingly would be at -an [independent] level above vector-length.  The distinctions are -as follows: - -* Vector-length issues *virtual* instructions where the register - operands are *specifically* altered (to cover a range of registers), - whereas zero-overhead loops *specifically* do *NOT* alter the operands - in *ANY* way. - -* Vector-length-driven "virtual" instructions are driven by *one* - and *only* one instruction (whether it be a LOAD, STORE, or pure - one/two/three-operand opcode) whereas zero-overhead loop units - specifically apply to *multiple* instructions. - -Where vector-length-driven "virtual" instructions might get conceptually -blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD / -STORE, to actually be useful, vector-length-driven LOAD / STORE should -increment the LOAD / STORE memory address to correspondingly match the -increment in the register bank.  example: - -* set vector-length for r0 to 4 -* issue RV32 LOAD from addr 0x1230 to r0 - -translates effectively to: - -* RV32 LOAD from addr 0x1230 to r0 -* ... -* ... -* RV32 LOAD from addr 0x123B to r3 - -# P-Ext ISA - -## 16-bit Arithmetic - -| Mnemonic | 16-bit Instruction | Simple-V Equivalent | -| ------------------ | ------------------------- | ------------------- | -| ADD16 rt, ra, rb | add | RV ADD (bitwidth=16) | -| RADD16 rt, ra, rb | Signed Halving add | | -| URADD16 rt, ra, rb | Unsigned Halving add | | -| KADD16 rt, ra, rb | Signed Saturating add | | -| UKADD16 rt, ra, rb | Unsigned Saturating add | | -| SUB16 rt, ra, rb | sub | RV SUB (bitwidth=16) | -| RSUB16 rt, ra, rb | Signed Halving sub | | -| URSUB16 rt, ra, rb | Unsigned Halving sub | | -| KSUB16 rt, ra, rb | Signed Saturating sub | | -| UKSUB16 rt, ra, rb | Unsigned Saturating sub | | -| CRAS16 rt, ra, rb | Cross Add & Sub | | -| RCRAS16 rt, ra, rb | Signed Halving Cross Add & Sub | | -| URCRAS16 rt, ra, rb| Unsigned Halving Cross Add & Sub | | -| KCRAS16 rt, ra, rb | Signed Saturating Cross Add & Sub | | -| UKCRAS16 rt, ra, rb| Unsigned Saturating Cross Add & Sub | | -| CRSA16 rt, ra, rb | Cross Sub & Add | | -| RCRSA16 rt, ra, rb | Signed Halving Cross Sub & Add | | -| URCRSA16 rt, ra, rb| Unsigned Halving Cross Sub & Add | | -| KCRSA16 rt, ra, rb | Signed Saturating Cross Sub & Add | | -| UKCRSA16 rt, ra, rb| Unsigned Saturating Cross Sub & Add | | - -## 8-bit Arithmetic - -| Mnemonic | 16-bit Instruction | Simple-V Equivalent | -| ------------------ | ------------------------- | ------------------- | -| ADD8 rt, ra, rb | add | RV ADD (bitwidth=8)| -| RADD8 rt, ra, rb | Signed Halving add | | -| URADD8 rt, ra, rb | Unsigned Halving add | | -| KADD8 rt, ra, rb | Signed Saturating add | | -| UKADD8 rt, ra, rb | Unsigned Saturating add | | -| SUB8 rt, ra, rb | sub | RV SUB (bitwidth=8)| -| RSUB8 rt, ra, rb | Signed Halving sub | | -| URSUB8 rt, ra, rb | Unsigned Halving sub | | +an offset of 1 would result in four operations as follows, instead: -# Exceptions + r3[0] = add r4[1], r6[0] + r3[1] = add r4[2], r6[1] + r3[2] = add r4[3], r6[2] + r3[3] = add r5[0], r6[3] -> What does an ADD of two different-sized vectors do in simple-V? +In non-packed-SIMD mode there is no benefit at all, as a vector may +be created using a different CSR that has the offset built-in. So this +leaves just the packed-SIMD case to consider. -* if the two source operands are not the same, throw an exception. -* if the destination operand is also a vector, and the source is longer - than the destination, throw an exception. +Two ways in which this could be implemented / emulated (without special +hardware): -> And what about instructions like JALR?  -> What does jumping to a vector do? +* bit-manipulation that shuffles the data along by one byte (or one word) + either prior to or as part of the operation requiring the offset. +* just use an unaligned Load/Store sequence, even if there are performance + penalties for doing so. -* Throw an exception. Whether that actually results in spawning threads - as part of the trap-handling remains to be seen. +The question then is whether the performance hit is worth the extra hardware +involving byte-shuffling/shifting the data by an arbitrary offset. On +balance given that there are two reasonable instruction-based options, the +hardware-offset option should be left out for the initial version of SV, +with the option to consider it in an "advanced" version of the specification. # Impementing V on top of Simple-V -* Number of Offset CSRs extends from 2 -* Extra register file: vector-file -* Setup of Vector length and bitwidth CSRs now can specify vector-file - as well as integer or float file. -* TODO +With Simple-V converting the original RVV draft concept-for-concept +from explicit opcodes to implicit overloading of existing RV Standard +Extensions, certain features were (deliberately) excluded that need +to be added back in for RVV to reach its full potential. This is +made slightly complicated by the fact that RVV itself has two +levels: Base and reserved future functionality. + +* Representation Encoding is entirely left out of Simple-V in favour of + implicitly taking the exact (explicit) meaning from RV Standard Extensions. +* VCLIP and VCLIPI do not have corresponding RV Standard Extension + opcodes (and are the only such operations). +* Extended Element bitwidths (1 through to 24576 bits) were left out + of Simple-V as, again, there is no corresponding RV Standard Extension + that covers anything even below 32-bit operands. +* Polymorphism was entirely left out of Simple-V due to the inherent + complexity of automatic type-conversion. +* Vector Register files were specifically left out of Simple-V in favour + of fitting on top of the integer and floating-point files. An + "RVV re-retro-fit" needs to be able to mark (implicitly marked) + registers as being actually in a separate *vector* register file. +* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector + register file size is 5 bits (32 registers), whilst the "Extended" + variant of RVV specifies 8 bits (256 registers) and has yet to + be published. +* One big difference: Sections 17.12 and 17.17, there are only two possible + predication registers in RVV "Base". Through the "indirect" method, + Simple-V provides a key-value CSR table that allows (arbitrarily) + up to 16 (TBD) of either the floating-point or integer registers to + be marked as "predicated" (key), and if so, which integer register to + use as the predication mask (value). + +**TODO** # Implementing P (renamed to DSP) on top of Simple-V @@ -953,9 +614,492 @@ translates effectively to: (caveat: anything not specified drops through to software-emulation / traps) * TODO -# Analysis of CSR decoding on latency +# Appendix + +## V-Extension to Simple-V Comparative Analysis + +This section has been moved to its own page [[v_comparative_analysis]] + +## P-Ext ISA + +This section has been moved to its own page [[p_comparative_analysis]] + +## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals + +This section compares the various parallelism proposals as they stand, +including traditional SIMD, in terms of features, ease of implementation, +complexity, flexibility, and die area. + +### [[harmonised_rvv_rvp]] + +This is an interesting proposal under development to retro-fit the AndesStar +P-Ext into V-Ext. + +### [[alt_rvp]] + +Primary benefit of Alt-RVP is the simplicity with which parallelism +may be introduced (effective multiplication of regfiles and associated ALUs). + +* plus: the simplicity of the lanes (combined with the regularity of + allocating identical opcodes multiple independent registers) meaning + that SRAM or 2R1W can be used for entire regfile (potentially). +* minus: a more complex instruction set where the parallelism is much + more explicitly directly specified in the instruction and +* minus: if you *don't* have an explicit instruction (opcode) and you + need one, the only place it can be added is... in the vector unit and +* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are + not useable or accessible in other Extensions. +* plus-and-minus: Lanes may be utilised for high-speed context-switching + but with the down-side that they're an all-or-nothing part of the Extension. + No Alt-RVP: no fast register-bank switching. +* plus: Lane-switching would mean that complex operations not suited to + parallelisation can be carried out, followed by further parallel Lane-based + work, without moving register contents down to memory (and back) +* minus: Access to registers across multiple lanes is challenging. "Solution" + is to drop data into memory and immediately back in again (like MMX). + +### Simple-V + +Primary benefit of Simple-V is the OO abstraction of parallel principles +from actual (internal) parallel hardware. It's an API in effect that's +designed to be slotted in to an existing implementation (just after +instruction decode) with minimum disruption and effort. + +* minus: the complexity (if full parallelism is to be exploited) + of having to use register renames, OoO, VLIW, register file cacheing, + all of which has been done before but is a pain +* plus: transparent re-use of existing opcodes as-is just indirectly + saying "this register's now a vector" which +* plus: means that future instructions also get to be inherently + parallelised because there's no "separate vector opcodes" +* plus: Compressed instructions may also be (indirectly) parallelised +* minus: the indirect nature of Simple-V means that setup (setting + a CSR register to indicate vector length, a separate one to indicate + that it is a predicate register and so on) means a little more setup + time than Alt-RVP or RVV's "direct and within the (longer) instruction" + approach. +* plus: shared register file meaning that, like Alt-RVP, complex + operations not suited to parallelisation may be carried out interleaved + between parallelised instructions *without* requiring data to be dropped + down to memory and back (into a separate vectorised register engine). +* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register + files means that huge parallel workloads would use up considerable + chunks of the register file. However in the case of RV64 and 32-bit + operations, that effectively means 64 slots are available for parallel + operations. +* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to + be added, yet the instruction opcodes remain unchanged (and still appear + to be parallel). consistent "API" regardless of actual internal parallelism: + even an in-order single-issue implementation with a single ALU would still + appear to have parallel vectoristion. +* hard-to-judge: if actual inherent underlying ALU parallelism is added it's + hard to say if there would be pluses or minuses (on die area). At worse it + would be "no worse" than existing register renaming, OoO, VLIW and register + file cacheing schemes. + +### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) + +RVV is extremely well-designed and has some amazing features, including +2D reorganisation of memory through LOAD/STORE "strides". + +* plus: regular predictable workload means that implementations may + streamline effects on L1/L2 Cache. +* plus: regular and clear parallel workload also means that lanes + (similar to Alt-RVP) may be used as an implementation detail, + using either SRAM or 2R1W registers. +* plus: separate engine with no impact on the rest of an implementation +* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse + really feasible. +* minus: no ISA abstraction or re-use either: additions to other Extensions + do not gain parallelism, resulting in prolific duplication of functionality + inside RVV *and out*. +* minus: when operations require a different approach (scalar operations + using the standard integer or FP regfile) an entire vector must be + transferred out to memory, into standard regfiles, then back to memory, + then back to the vector unit, this to occur potentially multiple times. +* minus: will never fit into Compressed instruction space (as-is. May + be able to do so if "indirect" features of Simple-V are partially adopted). +* plus-and-slight-minus: extended variants may address up to 256 + vectorised registers (requires 48/64-bit opcodes to do it). +* minus-and-partial-plus: separate engine plus complexity increases + implementation time and die area, meaning that adoption is likely only + to be in high-performance specialist supercomputing (where it will + be absolutely superb). + +### Traditional SIMD + +The only really good things about SIMD are how easy it is to implement and +get good performance. Unfortunately that makes it quite seductive... + +* plus: really straightforward, ALU basically does several packed operations + at once. Parallelism is inherent at the ALU, making the addition of + SIMD-style parallelism an easy decision that has zero significant impact + on the rest of any given architectural design and layout. +* plus (continuation): SIMD in simple in-order single-issue designs can + therefore result in superb throughput, easily achieved even with a very + simple execution model. +* minus: ridiculously complex setup and corner-cases that disproportionately + increase instruction count on what would otherwise be a "simple loop", + should the number of elements in an array not happen to exactly match + the SIMD group width. +* minus: getting data usefully out of registers (if separate regfiles + are used) means outputting to memory and back. +* minus: quite a lot of supplementary instructions for bit-level manipulation + are needed in order to efficiently extract (or prepare) SIMD operands. +* minus: MASSIVE proliferation of ISA both in terms of opcodes in one + dimension and parallelism (width): an at least O(N^2) and quite probably + O(N^3) ISA proliferation that often results in several thousand + separate instructions. all requiring separate and distinct corner-case + algorithms! +* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of + 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction. + For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires + four separate and distinct instructions: one for (r1:low r2:high), + one for (r1:high r2:low), one for (r1:high r2:high) and one for + (r1:low r2:low) *per function*. +* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch + between operand and result bit-widths. In combination with high/low + proliferation the situation is made even worse. +* minor-saving-grace: some implementations *may* have predication masks + that allow control over individual elements within the SIMD block. + +## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals + +This section compares the various parallelism proposals as they stand, +*against* traditional SIMD as opposed to *alongside* SIMD. In other words, +the question is asked "How can each of the proposals effectively implement +(or replace) SIMD, and how effective would they be"? + +### [[alt_rvp]] + +* Alt-RVP would not actually replace SIMD but would augment it: just as with + a SIMD architecture where the ALU becomes responsible for the parallelism, + Alt-RVP ALUs would likewise be so responsible... with *additional* + (lane-based) parallelism on top. +* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by + at least one dimension are avoided (architectural upgrades introducing + 128-bit then 256-bit then 512-bit variants of the exact same 64-bit + SIMD block) +* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation + of instructions as SIMD, albeit not quite as badly (due to Lanes). +* In the same discussion for Alt-RVP, an additional proposal was made to + be able to subdivide the bits of each register lane (columns) down into + arbitrary bit-lengths (RGB 565 for example). +* A recommendation was given instead to make the subdivisions down to 32-bit, + 16-bit or even 8-bit, effectively dividing the registerfile into + Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane + "swapping" instructions were then introduced, some of the disadvantages + of SIMD could be mitigated. + +### RVV + +* RVV is designed to replace SIMD with a better paradigm: arbitrary-length + parallelism. +* However whilst SIMD is usually designed for single-issue in-order simple + DSPs with a focus on Multimedia (Audio, Video and Image processing), + RVV's primary focus appears to be on Supercomputing: optimisation of + mathematical operations that fit into the OpenCL space. +* Adding functions (operations) that would normally fit (in parallel) + into a SIMD instruction requires an equivalent to be added to the + RVV Extension, if one does not exist. Given the specialist nature of + some SIMD instructions (8-bit or 16-bit saturated or halving add), + this possibility seems extremely unlikely to occur, even if the + implementation overhead of RVV were acceptable (compared to + normal SIMD/DSP-style single-issue in-order simplicity). + +### Simple-V + +* Simple-V borrows hugely from RVV as it is intended to be easy to + topologically transplant every single instruction from RVV (as + designed) into Simple-V equivalents, with *zero loss of functionality + or capability*. +* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP" + Extension which contained the basic primitives (non-parallelised + 8, 16 or 32-bit SIMD operations) inherently *become* parallel, + automatically. +* Additionally, standard operations (ADD, MUL) that would normally have + to have special SIMD-parallel opcodes added need no longer have *any* + of the length-dependent variants (2of 32-bit ADDs in a 64-bit register, + 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the + *standard* RV opcodes (present and future) and automatically parallelises + them. +* By inheriting the RVV feature of arbitrary vector-length, then just as + with RVV the corner-cases and ISA proliferation of SIMD is avoided. +* Whilst not entirely finalised, registers are expected to be + capable of being subdivided down to an implementor-chosen bitwidth + in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8] + and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can + choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit + ALUs that perform twin 8-bit operations as they see fit, or anything + else including no subdivisions at all. +* Even though implementors have that choice even to have full 64-bit + (with RV64) SIMD, they *must* provide predication that transparently + switches off appropriate units on the last loop, thus neatly fitting + underlying SIMD ALU implementations *into* the arbitrary vector-length + RVV paradigm, keeping the uniform consistent API that is a key strategic + feature of Simple-V. +* With Simple-V fitting into the standard register files, certain classes + of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0]) + can be done by applying *Parallelised* Bit-manipulation operations + followed by parallelised *straight* versions of element-to-element + arithmetic operations, even if the bit-manipulation operations require + changing the bitwidth of the "vectors" to do so. Predication can + be utilised to skip high words (or low words) in source or destination. +* In essence, the key downside of SIMD - massive duplication of + identical functions over time as an architecture evolves from 32-bit + wide SIMD all the way up to 512-bit, is avoided with Simple-V, through + vector-style parallelism being dropped on top of 8-bit or 16-bit + operations, all the while keeping a consistent ISA-level "API" irrespective + of implementor design choices (or indeed actual implementations). + +### Example Instruction translation: + +Instructions "ADD r7 r4 r4" would result in three instructions being +generated and placed into the FIFO. r7 and r4 are marked as "vectorised": + +* ADD r7 r4 r4 +* ADD r8 r5 r5 +* ADD r9 r6 r6 + +Instructions "ADD r7 r4 r1" would result in three instructions being +generated and placed into the FIFO. r7 and r1 are marked as "vectorised" +whilst r4 is not: + +* ADD r7 r4 r1 +* ADD r8 r4 r2 +* ADD r9 r4 r3 + +## Example of vector / vector, vector / scalar, scalar / scalar => vector add + + function op_add(rd, rs1, rs2) # add not VADD! +  int i, id=0, irs1=0, irs2=0; +  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; +  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; +  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; +  predval = get_pred_val(FALSE, rd); +  for (i = 0; i < VL; i++) + if (predval & 1< + +One of the goals of this parallelism proposal is to avoid instruction +duplication. However, with the base ISA having been designed explictly +to *avoid* condition-codes entirely, shoe-horning predication into it +bcomes quite challenging. + +However what if all branch instructions, if referencing a vectorised +register, were instead given *completely new analogous meanings* that +resulted in a parallel bit-wise predication register being set? This +would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, +BLT and BGE. + +We might imagine that FEQ, FLT and FLT would also need to be converted, +however these are effectively *already* in the precise form needed and +do not need to be converted *at all*! The difference is that FEQ, FLT +and FLE *specifically* write a 1 to an integer register if the condition +holds, and 0 if not. All that needs to be done here is to say, "if +the integer register is tagged with a bit that says it is a predication +register, the **bit** in the integer register is set based on the +current vector index" instead. + +There is, in the standard Conditional Branch instruction, more than +adequate space to interpret it in a similar fashion: + +[[!table data=""" +31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 | +imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | + 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | + offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | +"""]] + +This would become: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | +"""]] + +Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, +with the interesting side-effect that there is space within what is presently +the "immediate offset" field to reinterpret that to add in not only a bit +field to distinguish between floating-point compare and integer compare, +not only to add in a second source register, but also use some of the bits as +a predication target as well. - +[[!table data=""" +15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 | +funct3 | imm | rs10 | imm | op | +3 | 3 | 3 | 5 | 2 | +C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | +"""]] + +Now uses the CS format: + +[[!table data=""" +15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 | +funct3 | imm | rs10 | imm | | op | +3 | 3 | 3 | 2 | 3 | 2 | +C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 | +"""]] + +Bit 6 would be decoded as "operation refers to Integer or Float" including +interpreting src1 and src2 accordingly as outlined in Table 12.2 of the +"C" Standard, version 2.0, +whilst Bit 5 would allow the operation to be extended, in combination with +funct3 = 110 or 111: a combination of four distinct (predicated) comparison +operators. In both floating-point and integer cases those could be +EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). + +## Register reordering + +### Register File + +| Reg Num | Bits | +| ------- | ---- | +| r0 | (32..0) | +| r1 | (32..0) | +| r2 | (32..0) | +| r3 | (32..0) | +| r4 | (32..0) | +| r5 | (32..0) | +| r6 | (32..0) | +| r7 | (32..0) | +| .. | (32..0) | +| r31| (32..0) | + +### Vectorised CSR + +May not be an actual CSR: may be generated from Vector Length CSR: +single-bit is less burdensome on instruction decode phase. + +| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | +| - | - | - | - | - | - | - | - | +| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | + +### Vector Length CSR + +| Reg Num | (3..0) | +| ------- | ---- | +| r0 | 2 | +| r1 | 0 | +| r2 | 1 | +| r3 | 1 | +| r4 | 3 | +| r5 | 0 | +| r6 | 0 | +| r7 | 1 | + +### Virtual Register Reordering + +This example assumes the above Vector Length CSR table + +| Reg Num | Bits (0) | Bits (1) | Bits (2) | +| ------- | -------- | -------- | -------- | +| r0 | (32..0) | (32..0) | +| r2 | (32..0) | +| r3 | (32..0) | +| r4 | (32..0) | (32..0) | (32..0) | +| r7 | (32..0) | + +### Bitwidth Virtual Register Reordering + +This example goes a little further and illustrates the effect that a +bitwidth CSR has been set on a register. Preconditions: + +* RV32 assumed +* CSRintbitwidth[2] = 010 # integer r2 is 16-bit +* CSRintvlength[2] = 3 # integer r2 is a vector of length 3 +* vsetl rs1, 5 # set the vector length to 5 + +This is interpreted as follows: + +* Given that the context is RV32, ELEN=32. +* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2 +* Therefore the actual vector length is up to *six* elements +* However vsetl sets a length 5 therefore the last "element" is skipped + +So when using an operation that uses r2 as a source (or destination) +the operation is carried out as follows: + +* 16-bit operation on r2(15..0) - vector element index 0 +* 16-bit operation on r2(31..16) - vector element index 1 +* 16-bit operation on r3(15..0) - vector element index 2 +* 16-bit operation on r3(31..16) - vector element index 3 +* 16-bit operation on r4(15..0) - vector element index 4 +* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5 + +Predication has been left out of the above example for simplicity, however +predication is ANDed with the latter stages (vsetl not equal to maximum +capacity). + +Note also that it is entirely an implementor's choice as to whether to have +actual separate ALUs down to the minimum bitwidth, or whether to have something +more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD +operations carried out 32-bits at a time is perfectly acceptable, as is +8-bit SIMD operations carried out 16-bits at a time requiring two ALUs). +Regardless of the internal parallelism choice, *predication must +still be respected*, making Simple-V in effect the "consistent public API". + +vew may be one of the following (giving a table "bytestable", used below): + +| vew | bitwidth | bytestable | +| --- | -------- | ---------- | +| 000 | default | XLEN/8 | +| 001 | 8 | 1 | +| 010 | 16 | 2 | +| 011 | 32 | 4 | +| 100 | 64 | 8 | +| 101 | 128 | 16 | +| 110 | rsvd | rsvd | +| 111 | rsvd | rsvd | + +Pseudocode for vector length taking CSR SIMD-bitwidth into account: + + vew = CSRbitwidth[rs1] + if (vew == 0) + bytesperreg = (XLEN/8) # or FLEN as appropriate + else: + bytesperreg = bytestable[vew] # 1 2 4 8 16 + simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate + vlen = CSRvectorlen[rs1] * simdmult + +To index an element in a register rnum where the vector element index is i: + + function regoffs(rnum, i): + regidx = floor(i / simdmult) # integer-div rounded down + byteidx = i % simdmult # integer-remainder + return rnum + regidx, # actual real register + byteidx * 8, # low + byteidx * 8 + (vew-1), # high + +### Insights + +SIMD register file splitting still to consider. For RV64, benefits of doubling +(quadrupling in the case of Half-Precision IEEE754 FP) the apparent +size of the floating point register file to 64 (128 in the case of HP) +seem pretty clear and worth the complexity. + +64 virtual 32-bit F.P. registers and given that 32-bit FP operations are +done on 64-bit registers it's not so conceptually difficult.  May even +be achieved by *actually* splitting the regfile into 64 virtual 32-bit +registers such that a 64-bit FP scalar operation is dropped into (r0.H +r0.L) tuples.  Implementation therefore hidden through register renaming. + +Implementations intending to introduce VLIW, OoO and parallelism +(even without Simple-V) would then find that the instructions are +generated quicker (or in a more compact fashion that is less heavy +on caches). Interestingly we observe then that Simple-V is about +"consolidation of instruction generation", where actual parallelism +of underlying hardware is an implementor-choice that could just as +equally be applied *without* Simple-V even being implemented. + +## Analysis of CSR decoding on latency It could indeed have been logically deduced (or expected), that there would be additional decode latency in this proposal, because if @@ -1031,7 +1175,7 @@ So the question boils down to: Whilst the above may seem to be severe minuses, there are some strong pluses: -* Significant reduction of V's opcode space: over 85%. +* Significant reduction of V's opcode space: over 95%. * Smaller reduction of P's opcode space: around 10%. * The potential to use Compressed instructions in both Vector and SIMD due to the overloading of register meaning (implicit vectorisation, @@ -1043,6 +1187,530 @@ pluses: parallel ALUs) is only equal to one ("virtual" parallelism), or is greater than one, should not be underestimated. +## Reducing Register Bank porting + +This looks quite reasonable. + + +The main details are outlined on page 4.  They propose a 2-level register +cache hierarchy, note that registers are typically only read once, that +you never write back from upper to lower cache level but always go in a +cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose +a scheme where you look ahead by only 2 instructions to determine which +registers to bring into the cache. + +The nice thing about a vector architecture is that you *know* that +*even more* registers are going to be pulled in: Hwacha uses this fact +to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough +by *introducing* deliberate latency into the execution phase. + +## Overflow registers in combination with predication + +**TODO**: propose overflow registers be actually one of the integer regs +(flowing to multiple regs). + +**TODO**: propose "mask" (predication) registers likewise. combination with +standard RV instructions and overflow registers extremely powerful, see +Aspex ASP. + +When integer overflow is stored in an easily-accessible bit (or another +register), parallelisation turns this into a group of bits which can +potentially be interacted with in predication, in interesting and powerful +ways. For example, by taking the integer-overflow result as a predication +field and shifting it by one, a predicated vectorised "add one" can emulate +"carry" on arbitrary (unlimited) length addition. + +However despite RVV having made room for floating-point exceptions, neither +RVV nor base RV have taken integer-overflow (carry) into account, which +makes proposing it quite challenging given that the relevant (Base) RV +sections are frozen. Consequently it makes sense to forgo this feature. + +## Context Switch Example + +An unusual side-effect of Simple-V mapping onto the standard register files +is that LOAD-multiple and STORE-multiple are accidentally available, as long +as it is acceptable that the register(s) to be loaded/stored are contiguous +(per instruction). An additional accidental benefit is that Compressed LD/ST +may also be used. + +To illustrate how this works, here is some example code from FreeRTOS +(GPLv2 licensed, portasm.S): + + /* Macro for saving task context */ + .macro portSAVE_CONTEXT + .global pxCurrentTCB + /* make room in stack */ + addi sp, sp, -REGBYTES * 32 + + /* Save Context */ + STORE x1, 0x0(sp) + STORE x2, 1 * REGBYTES(sp) + STORE x3, 2 * REGBYTES(sp) + ... + ... + STORE x30, 29 * REGBYTES(sp) + STORE x31, 30 * REGBYTES(sp) + + /* Store current stackpointer in task control block (TCB) */ + LOAD t0, pxCurrentTCB //pointer + STORE sp, 0x0(t0) + .endm + + /* Saves current error program counter (EPC) as task program counter */ + .macro portSAVE_EPC + csrr t0, mepc + STORE t0, 31 * REGBYTES(sp) + .endm + + /* Saves current return adress (RA) as task program counter */ + .macro portSAVE_RA + STORE ra, 31 * REGBYTES(sp) + .endm + + /* Macro for restoring task context */ + .macro portRESTORE_CONTEXT + + .global pxCurrentTCB + /* Load stack pointer from the current TCB */ + LOAD sp, pxCurrentTCB + LOAD sp, 0x0(sp) + + /* Load task program counter */ + LOAD t0, 31 * REGBYTES(sp) + csrw mepc, t0 + + /* Run in machine mode */ + li t0, MSTATUS_PRV1 + csrs mstatus, t0 + + /* Restore registers, + Skip global pointer because that does not change */ + LOAD x1, 0x0(sp) + LOAD x4, 3 * REGBYTES(sp) + LOAD x5, 4 * REGBYTES(sp) + ... + ... + LOAD x30, 29 * REGBYTES(sp) + LOAD x31, 30 * REGBYTES(sp) + + addi sp, sp, REGBYTES * 32 + mret + .endm + +The important bits are the Load / Save context, which may be replaced +with firstly setting up the Vectors and secondly using a *single* STORE +(or LOAD) including using C.ST or C.LD, to indicate that the entire +bank of registers is to be loaded/saved: + + /* a few things are assumed here: (a) that when switching to + M-Mode an entirely different set of CSRs is used from that + which is used in U-Mode and (b) that the M-Mode x1 and x4 + vectors are also not used anywhere else in M-Mode, consequently + only need to be set up just the once. + */ + .macroVectorSetup + MVECTORCSRx1 = 31, defaultlen + MVECTORCSRx4 = 28, defaultlen + + /* Save Context */ + SETVL x0, x0, 31 /* x0 ignored silently */ + STORE x1, 0x0(sp) // x1 marked as 31-long vector of default bitwidth + + /* Restore registers, + Skip global pointer because that does not change */ + LOAD x1, 0x0(sp) + SETVL x0, x0, 28 /* x0 ignored silently */ + LOAD x4, 3 * REGBYTES(sp) // x4 marked as 28-long default bitwidth + +Note that although it may just be a bug in portasm.S, x2 and x3 appear not +to be being restored. If however this is a bug and they *do* need to be +restored, then the SETVL call may be moved to *outside* the Save / Restore +Context assembly code, into the macroVectorSetup, as long as vectors are +never used anywhere else (i.e. VL is never altered by M-Mode). + +In effect the entire bank of repeated LOAD / STORE instructions is replaced +by one single (compressed if it is available) instruction. + +## Virtual Memory page-faults on LOAD/STORE + + +### Notes from conversations + +> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft +> riscv-isa-manual in order to work out how to re-map RVV onto the standard +> ISA, and came across an interesting comments at the bottom of pages 75 +> and 76: + +> " A common mechanism used in other ISAs to further reduce save/restore +> code size is load- multiple and store-multiple instructions. " + +> Fascinatingly, due to Simple-V proposing to use the *standard* register +> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly +> that: load-multiple and store-multiple instructions. Which brings us +> on to this comment: + +> "For virtual memory systems, some data accesses could be resident in +> physical memory and +> some could not, which requires a new restart mechanism for partially +> executed instructions." + +> Which then of course brings us to the interesting question: how does RVV +> cope with the scenario when, particularly with LD.X (Indexed / indirect +> loads), part-way through the loading a page fault occurs? + +> Has this been noted or discussed before? + +For applications-class platforms, the RVV exception model is +element-precise (that is, if an exception occurs on element j of a +vector instruction, elements 0..j-1 have completed execution and elements +j+1..vl-1 have not executed). + +Certain classes of embedded platforms where exceptions are always fatal +might choose to offer resumable/swappable interrupts but not precise +exceptions. + + +> Is RVV designed in any way to be re-entrant? + +Yes. + + +> What would the implications be for instructions that were in a FIFO at +> the time, in out-of-order and VLIW implementations, where partial decode +> had taken place? + +The usual bag of tricks for maintaining precise exceptions applies to +vector machines as well. Register renaming makes the job easier, and +it's relatively cheaper for vectors, since the control cost is amortized +over longer registers. + + +> Would it be reasonable at least to say *bypass* (and freeze) the +> instruction FIFO (drop down to a single-issue execution model temporarily) +> for the purposes of executing the instructions in the interrupt (whilst +> setting up the VM page), then re-continue the instruction with all +> state intact? + +This approach has been done successfully, but it's desirable to be +able to swap out the vector unit state to support context switches on +exceptions that result in long-latency I/O. + + +> Or would it be better to switch to an entirely separate secondary +> hyperthread context? + +> Does anyone have any ideas or know if there is any academic literature +> on solutions to this problem? + +The Vector VAX offered imprecise but restartable and swappable exceptions: +http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf + +Sec. 4.6 of Krste's dissertation assesses some of +the tradeoffs and references a bunch of related work: +http://people.eecs.berkeley.edu/~krste/thesis.pdf + + +---- + +Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P +exceptions" and thought, "hmmm that could go into a CSR, must re-read +the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly +thought, "ah ha! what if the memory exceptions were, instead of having +an immediate exception thrown, were simply stored in a type of predication +bit-field with a flag "error this element failed"? + +Then, *after* the vector load (or store, or even operation) was +performed, you could *then* raise an exception, at which point it +would be possible (yes in software... I know....) to go "hmmm, these +indexed operations didn't work, let's get them into memory by triggering +page-loads", then *re-run the entire instruction* but this time with a +"memory-predication CSR" that stops the already-performed operations +(whether they be loads, stores or an arithmetic / FP operation) from +being carried out a second time. + +This theoretically could end up being done multiple times in an SMP +environment, and also for LD.X there would be the remote outside annoying +possibility that the indexed memory address could end up being modified. + +The advantage would be that the order of execution need not be +sequential, which potentially could have some big advantages. +Am still thinking through the implications as any dependent operations +(particularly ones already decoded and moved into the execution FIFO) +would still be there (and stalled). hmmm. + +---- + + > > # assume internal parallelism of 8 and MAXVECTORLEN of 8 + > > VSETL r0, 8 + > > FADD x1, x2, x3 + > + > > x3[0]: ok + > > x3[1]: exception + > > x3[2]: ok + > > ... + > > ... + > > x3[7]: ok + > + > > what happens to result elements 2-7?  those may be *big* results + > > (RV128) + > > or in the RVV-Extended may be arbitrary bit-widths far greater. + > + >  (you replied:) + > + > Thrown away. + +discussion then led to the question of OoO architectures + +> The costs of the imprecise-exception model are greater than the benefit. +> Software doesn't want to cope with it.  It's hard to debug.  You can't +> migrate state between different microarchitectures--unless you force all +> implementations to support the same imprecise-exception model, which would +> greatly limit implementation flexibility.  (Less important, but still +> relevant, is that the imprecise model increases the size of the context +> structure, as the microarchitectural guts have to be spilled to memory.) + +## Zero/Non-zero Predication + +>> >  it just occurred to me that there's another reason why the data +>> > should be left instead of zeroed.  if the standard register file is +>> > used, such that vectorised operations are translated to mean "please +>> > insert multiple register-contiguous operations into the instruction +>> > FIFO" and predication is used to *skip* some of those, then if the +>> > next "vector" operation uses the (standard) registers that were masked +>> > *out* of the previous operation it may proceed without blocking. +>> > +>> >  if however zeroing is made mandatory then that optimisation becomes +>> > flat-out impossible to deploy. +>> > +>> >  whilst i haven't fully thought through the full implications, i +>> > suspect RVV might also be able to benefit by being able to fit more +>> > overlapping operations into the available SRAM by doing something +>> > similar. +> +> +> Luke, this is called density time masking. It doesn’t apply to only your +> model with the “standard register file” is used. it applies to any +> architecture that attempts to speed up by skipping computation and writeback +> of masked elements. +> +> That said, the writing of zeros need not be explicit. It is possible to add +> a “zero bit” per element that, when set, forces a zero to be read from the +> vector (although the underlying storage may have old data). In this case, +> there may be a way to implement DTM as well. + + +## Implementation detail for scalar-only op detection + +Note 1: this idea is a pipeline-bypass concept, which may *or may not* be +worthwhile. + +Note 2: this is just one possible implementation. Another implementation +may choose to treat *all* operations as vectorised (including treating +scalars as vectors of length 1), choosing to add an extra pipeline stage +dedicated to *all* instructions. + +This section *specifically* covers the implementor's freedom to choose +that they wish to minimise disruption to an existing design by detecting +"scalar-only operations", bypassing the vectorisation phase (which may +or may not require an additional pipeline stage) + +[[scalardetect.png]] + +>> For scalar ops an implementation may choose to compare 2-3 bits through an +>> AND gate: are src & dest scalar? Yep, ok send straight to ALU  (or instr +>> FIFO). + +> Those bits cannot be known until after the registers are decoded from the +> instruction and a lookup in the "vector length table" has completed. +> Considering that one of the reasons RISC-V keeps registers in invariant +> positions across all instructions is to simplify register decoding, I expect +> that inserting an SRAM read would lengthen the critical path in most +> implementations. + +reply: + +> briefly: the trick i mentioned about ANDing bits together to check if +> an op was fully-scalar or not was to be read out of a single 32-bit +> 3R1W SRAM (64-bit if FPU exists). the 32/64-bit SRAM contains 1 bit per +> register indicating "is register vectorised yes no". 3R because you need +> to check src1, src2 and dest simultaneously. the entries are *generated* +> from the CSRs and are an optimisation that on slower embedded systems +> would likely not be needed. + +> is there anything unreasonable that anyone can foresee about that? +> what are the down-sides? + +## C.MV predicated src, predicated dest + +> Can this be usefully defined in such a way that it is +> equivalent to vector gather-scatter on each source, followed by a +> non-predicated vector-compare, followed by vector gather-scatter on the +> result? + +## element width conversion: restrict or remove? + +summary: don't restrict / remove. it's fine. + +> > it has virtually no cost/overhead as long as you specify +> > that inputs can only upconvert, and operations are always done at the +> > largest size, and downconversion only happens at the output. +> +> okaaay.  so that's a really good piece of implementation advice. +> algorithms do require data size conversion, so at some point you need to +> introduce the feature of upconverting and downconverting. +> +> > for int and uint, this is dead simple and fits well within the RVV pipeline +> > without any critical path, pipeline depth, or area implications. + + + +## Under review / discussion: remove CSR vector length, use VSETVL + +**DECISION: 11jun2018 - CSR vector length removed, VSETVL determines +length on all regs**. This section kept for historical reasons. + +So the issue is as follows: + +* CSRs are used to set the "span" of a vector (how many of the standard + register file to contiguously use) +* VSETVL in RVV works as follows: it sets the vector length (copy of which + is placed in a dest register), and if the "required" length is longer + than the *available* length, the dest reg is set to the MIN of those + two. +* **HOWEVER**... in SV, *EVERY* vector register has its own separate + length and thus there is no way (at the time that VSETVL is called) to + know what to set the vector length *to*. +* At first glance it seems that it would be perfectly fine to just limit + the vector operation to the length specified in the destination + register's CSR, at the time that each instruction is issued... + except that that cannot possibly be guaranteed to match + with the value *already loaded into the target register from VSETVL*. + +Therefore a different approach is needed. + +Possible options include: + +* Removing the CSR "Vector Length" and always using the value from + VSETVL. "VSETVL destreg, counterreg, #lenimmed" will set VL *and* + destreg equal to MIN(counterreg, lenimmed), with register-based + variant "VSETVL destreg, counterreg, lenreg" doing the same. +* Keeping the CSR "Vector Length" and having the lenreg version have + a "twist": "if lengreg is vectorised, read the length from the CSR" +* Other (TBD) + +The first option (of the ones brainstormed so far) is a lot simpler. +It does however mean that the length set in VSETVL will apply across-the-board +to all src1, src2 and dest vectorised registers until it is otherwise changed +(by another VSETVL call). This is probably desirable behaviour. + +## Implementation Paradigms + +TODO: assess various implementation paradigms. These are listed roughly +in order of simplicity (minimum compliance, for ultra-light-weight +embedded systems or to reduce design complexity and the burden of +design implementation and compliance, in non-critical areas), right the +way to high-performance systems. + +* Full (or partial) software-emulated (via traps): full support for CSRs + required, however when a register is used that is detected (in hardware) + to be vectorised, an exception is thrown. +* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP) +* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming +* Out-of-order with instruction FIFOs and aggressive register-renaming +* VLIW + +Also to be taken into consideration: + +* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism +* Comphrensive vectorisation: FIFOs and internal parallelism +* Hybrid Parallelism + +### Full or partial software-emulation + +The absolute, absolute minimal implementation is to provide the full +set of CSRs and detection logic for when any of the source or destination +registers are vectorised. On detection, a trap is thrown, whether it's +a branch, LOAD, STORE, or an arithmetic operation. + +Implementors are entirely free to choose whether to allow absolutely every +single operation to be software-emulated, or whether to provide some emulation +and some hardware support. In particular, for an RV32E implementation +where fast context-switching is a requirement (see "Context Switch Example"), +it makes no sense to allow Vectorised-LOAD/STORE to be implemented as an +exception, as every context-switch will result in double-traps. + +# TODO Research + +> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs + +Idea: basic simple butterfly swap on a few element indices, primarily targetted +at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping, +perhaps allow reindexing of permutations up to 4 elements? 8? Reason: +such operations are less costly than a full indexed-shuffle, which requires +a separate instruction cycle. + +Predication "all zeros" needs to be "leave alone". Detection of +ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas +ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0. +Destruction of destination indices requires a copy of the entire vector +in advance to avoid. + +TBD: floating-point compare and other exception handling + +------ + +Multi-LR/SC + +Please don't try to use the L1 itself. + +Use the Load and Store buffers which capture instruction state prior +to being accessed in the L1 (and prior to data arriving in the case of +Store buffer). + +Also, use the L1 Miss buffers as these already HAVE to be snooped by +coherence traffic. These are used to monitor that all participating +cache lines remain interference free, and amalgamate same into a CPU +signal accessible ia branch or predicate. + +The Load buffers manage inbound traffic +The Store buffers manage outbound traffic. + +Done properly, the participating cache lines can exceed the associativity +of the L1 cache without architectural harm (may incur additional latency). + + + +> > > so, let's say instead of another LR *cancelling* the load +> > > reservation, the SMP core / hardware thread *blocks* for +> > > up to 63 further instructions, waiting for the reservation +> > > to clear. +> > +> > Can you explain what you mean by this paragraph? +> +> best put in sequential events, probably. +> +> LR <-- 64-instruction countdown starts here +> ... 63 +> ... 62 +> LR same address <--- notes that core1 is on 61, +> so pauses for **UP TO** 61 cycles +> ... 32 +> SC <- core1 didn't reach zero, therefore valid, therefore +> core2 is now **UNBLOCKED**, is granted the +> load-reservation (and begins its **own** 64-cycle +> LR instruction countdown) +> ... 63 +> ... 62 +> ... +> ... +> SC <- also valid + +Looks to me that you could effect the same functionality by simply +holding onto the cache line in core 1 preventing core 2 from + getting past the LR. + +On the other hand, the freeze is similar to how the MP CRAYs did +ATOMIC stuff. # References @@ -1062,3 +1730,24 @@ pluses: * Branch Divergence * Life of Triangles (3D) * Videocore-IV +* Discussion proposing CSRs that change ISA definition + +* Zero-overhead loops +* Multi-ported VLIW Register File Implementation +* Fast context save/restore proposal +* Register File Bank Cacheing +* Expired Patent on Vector Virtual Memory solutions + +* Discussion on RVV "re-entrant" capabilities allowing operations to be + restarted if an exception occurs (VM page-table miss) + +* Dot Product Vector +* RVV slides 2017 +* Wavefront skipping using BRAMS +* Streaming Pipelines +* Barcelona SIMD Presentation +* +* Full Description (last page) of RVV instructions + +* PULP Low-energy Cluster Vector Processor +