X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=7aff13daf9a43aecfb01b46f9f6183b6ce440da7;hb=82afca3c8990cf848d836b8bae877e6e983467e7;hp=94ee77af7a9b304c509d25922aa0792004f0e440;hpb=0998547c269af1f9bd116538f1cea6641e56608d;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 94ee77af7..7aff13daf 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,21 +1,19 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal -[[!toc ]] - -# Summary - Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. *Actual* internal hardware-level parallelism is *not* required, such that Simple-V may be viewed as providing a "compact" or "consolidated" means of issuing multiple near-identical arithmetic instructions to an -instruction queue (FILO), pending execution. +instruction queue (FIFO), pending execution. *Actual* parallelism, if added independently of Simple-V in the form of Out-of-order restructuring (including parallel ALU lanes) or VLIW implementations, or SIMD, or anything else, would then benefit *if* Simple-V was added on top. +[[!toc ]] + # Introduction This proposal exists so as to be able to satisfy several disparate @@ -23,7 +21,7 @@ requirements: power-conscious, area-conscious, and performance-conscious designs all pull an ISA and its implementation in different conflicting directions, as do the specific intended uses for any given implementation. -Additionally, the existing P (SIMD) proposal and the V (Vector) proposals, +The existing P (SIMD) proposal and the V (Vector) proposals, whilst each extremely powerful in their own right and clearly desirable, are also: @@ -33,15 +31,36 @@ are also: analysis and review purposes) prohibitively expensive * Both contain partial duplication of pre-existing RISC-V instructions (an undesirable characteristic) -* Both have independent and disparate methods for introducing parallelism - at the instruction level. +* Both have independent, incompatible and disparate methods for introducing + parallelism at the instruction level * Both require that their respective parallelism paradigm be implemented along-side and integral to their respective functionality *or not at all*. * Both independently have methods for introducing parallelism that could, if separated, benefit *other areas of RISC-V not just DSP or Floating-point respectively*. -Therefore it makes a huge amount of sense to have a means and method +There are also key differences between Vectorisation and SIMD (full +details outlined in the Appendix), the key points being: + +* SIMD has an extremely seductively compelling ease of implementation argument: + each operation is passed to the ALU, which is where the parallelism + lies. There is *negligeable* (if any) impact on the rest of the core + (with life instead being made hell for compiler writers and applications + writers due to extreme ISA proliferation). +* By contrast, Vectorisation has quite some complexity (for considerable + flexibility, reduction in opcode proliferation and much more). +* Vectorisation typically includes much more comprehensive memory load + and store schemes (unit stride, constant-stride and indexed), which + in turn have ramifications: virtual memory misses (TLB cache misses) + and even multiple page-faults... all caused by a *single instruction*, + yet with a clear benefit that the regularisation of LOAD/STOREs can + be optimised for minimal impact on caches and maximised throughput. +* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned + to pages), and these load/stores have absolutely nothing to do with the + SIMD / ALU engine, no matter how wide the operand. Simplicity but with + more impact on instruction and data caches. + +Overall it makes a huge amount of sense to have a means and method of introducing instruction parallelism in a flexible way that provides implementors with the option to choose exactly where they wish to offer performance improvements and where they wish to optimise for power @@ -52,148 +71,27 @@ Additionally it makes sense to *split out* the parallelism inherent within each of P and V, and to see if each of P and V then, in *combination* with a "best-of-both" parallelism extension, could be added on *on top* of this proposal, to topologically provide the exact same functionality of -each of P and V. +each of P and V. Each of P and V then can focus on providing the best +operations possible for their respective target areas, without being +hugely concerned about the actual parallelism. Furthermore, an additional goal of this proposal is to reduce the number of opcodes utilised by each of P and V as they currently stand, leveraging existing RISC-V opcodes where possible, and also potentially allowing P and V to make use of Compressed Instructions as a result. -**TODO**: reword this to better suit this document: - -Having looked at both P and V as they stand, they're _both_ very much -"separate engines" that, despite both their respective merits and -extremely powerful features, don't really cleanly fit into the RV design -ethos (or the flexible extensibility) and, as such, are both in danger -of not being widely adopted. I'm inclined towards recommending: - -* splitting out the DSP aspects of P-SIMD to create a single-issue DSP -* splitting out the polymorphism, esoteric data types (GF, complex - numbers) and unusual operations of V to create a single-issue "Esoteric - Floating-Point" extension -* splitting out the loop-aspects, vector aspects and data-width aspects - of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they - apply across *all* Extensions, whether those be DSP, M, Base, V, P - - everything. - -**TODO**: propose overflow registers be actually one of the integer regs -(flowing to multiple regs). - -**TODO**: propose "mask" (predication) registers likewise. combination with -standard RV instructions and overflow registers extremely powerful - -## CSRs marking registers as Vector - -A 32-bit CSR would be needed (1 bit per integer register) to indicate -whether a register was, if referred to, implicitly to be treated as -a vector. - -A second 32-bit CSR would be needed (1 bit per floating-point register) -to indicate whether a floating-point register was to be treated as a -vector. - -In this way any standard (current or future) operation involving -register operands may detect if the operation is to be vector-vector, -vector-scalar or scalar-scalar (standard) simply through a single -bit test. - -## CSR vector-length and CSR SIMD packed-bitwidth - -**TODO** analyse each of these: - -* splitting out the loop-aspects, vector aspects and data-width aspects -* integer reg 0 *and* fp reg0 share CSR vlen 0 *and* CSR packed-bitwidth 0 -* integer reg 1 *and* fp reg1 share CSR vlen 1 *and* CSR packed-bitwidth 1 -* .... -* ....  - -instead: - -* CSR vlen 0 *and* CSR packed-bitwidth 0 register contain extra bits - specifying an *INDEX* of WHICH int/fp register they refer to -* CSR vlen 1 *and* CSR packed-bitwidth 1 register contain extra bits - specifying an *INDEX* of WHICH int/fp register they refer to -* ... -* ... - -Have to be very *very* careful about not implementing too few of those -(or too many). Assess implementation impact on decode latency. Is it -worth it? - -Implementation of the latter: - -Operation involving (referring to) register M: - - bitwidth = default # default for opcode? - vectorlen = 1 # scalar - - for (o = 0, o < 2, o++) -   if (CSR-Vector_registernum[o] == M) -       bitwidth = CSR-Vector_bitwidth[o] -       vectorlen = CSR-Vector_len[o] -       break - -and for the former it would simply be: - - bitwidth = CSR-Vector_bitwidth[M] - vectorlen = CSR-Vector_len[M] - -Alternatives: - -* One single "global" vector-length CSR - -## Stride - -**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular -register as being "if you use this reg in LOAD/STORE, use the offset -amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous". -can be used for matrix spanning. - -> For LOAD/STORE, could a better option be to interpret the offset in the -> opcode as a stride instead, so "LOAD t3, 12(t2)" would, if t3 is -> configured as a length-4 vector base, result in t3 = *t2, t4 = *(t2+12), -> t5 = *(t2+24), t6 = *(t2+32)?  Perhaps include a bit in the -> vector-control CSRs to select between offset-as-stride and unit-stride -> memory accesses? - -So there would be an instruction like this: - -| SETOFF | On=rN | OBank={float|int} | Smode={offs|unit} | OFFn=rM | -| opcode | 5 bit | 1 bit | 1 bit | 5 bit, OFFn=XLEN | - - -which would mean: - -* CSR-Offset register n <= (float|int) register number N -* CSR-Offset Stride-mode = offset or unit -* CSR-Offset amount register n = contents of register M - -LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set): - - offs = 0 - stride = 1 - vector-len = CSR-Vector-length register N - - for (o = 0, o < 2, o++) - if (CSR-Offset register o == M) - offs = CSR-Offset amount register o - if CSR-Offset Stride-mode == offset: - stride = ldoffs - break - - for (i = 0, i < vector-len; i++) - r[N+i] = mem[(offs*i + r[M+i])*stride] - # Analysis and discussion of Vector vs SIMD -There are four combined areas between the two proposals that help with -parallelism without over-burdening the ISA with a huge proliferation of +There are six combined areas between the two proposals that help with +parallelism (increased performance, reduced power / area) without +over-burdening the ISA with a huge proliferation of instructions: * Fixed vs variable parallelism (fixed or variable "M" in SIMD) * Implicit vs fixed instruction bit-width (integral to instruction or not) * Implicit vs explicit type-conversion (compounded on bit-width) * Implicit vs explicit inner loops. +* Single-instruction LOAD/STORE. * Masks / tagging (selecting/preventing certain indexed elements from execution) The pros and cons of each are discussed and analysed below. @@ -211,6 +109,13 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable for general-purpose computation, and in the context of developing a general-purpose ISA, is never going to satisfy 100 percent of implementors. +To explain this further: for increased workloads over time, as the +performance requirements increase for new target markets, implementors +choose to extend the SIMD width (so as to again avoid mixing parallelism +into the instruction issue phases: the primary "simplicity" benefit of +SIMD in the first place), with the result that the entire opcode space +effectively doubles with each new SIMD width that's added to the ISA. + That basically leaves "variable-length vector" as the clear *general-purpose* winner, at least in terms of greatly simplifying the instruction set, reducing the number of instructions required for any given task, and thus @@ -233,7 +138,7 @@ burdensome to implementations, given that instruction decode already has to direct the operation to a correctly-sized width ALU engine, anyway. Not least: in places where an ISA was previously constrained (due for -whatever reason, including limitations of the available operand spcace), +whatever reason, including limitations of the available operand space), implicit bit-width allows the meaning of certain operations to be type-overloaded *without* pollution or alteration of frozen and immutable instructions, in a fully backwards-compatible fashion. @@ -246,13 +151,16 @@ integer (and floating point) of various sizes is automatically inferred due to "type tagging" that is set with a special instruction. A register will be *specifically* marked as "16-bit Floating-Point" and, if added to an operand that is specifically tagged as "32-bit Integer" an implicit -type-conversion will take placce *without* requiring that type-conversion +type-conversion will take place *without* requiring that type-conversion to be explicitly done with its own separate instruction. However, implicit type-conversion is not only quite burdensome to implement (explosion of inferred type-to-type conversion) but also is never really going to be complete. It gets even worse when bit-widths -also have to be taken into consideration. +also have to be taken into consideration. Each new type results in +an increased O(N^2) conversion space that, as anyone who has examined +python's source code (which has built-in polymorphic type-conversion), +knows that the task is more complex than it first seems. Overall, type-conversion is generally best to leave to explicit type-conversion instructions, or in definite specific use-cases left to @@ -281,6 +189,33 @@ applied to embedded processors" (ZOLC), optimising only the single inner loop seems inadequate, tending to suggest that ZOLC may be better off being proposed as an entirely separate Extension. +## Single-instruction LOAD/STORE + +In traditional Vector Architectures there are instructions which +result in multiple register-memory transfer operations resulting +from a single instruction. They're complicated to implement in hardware, +yet the benefits are a huge consistent regularisation of memory accesses +that can be highly optimised with respect to both actual memory and any +L1, L2 or other caches. In Hwacha EECS-2015-263 it is explicitly made +clear the consequences of getting this architecturally wrong: +L2 cache-thrashing at the very least. + +Complications arise when Virtual Memory is involved: TLB cache misses +need to be dealt with, as do page faults. Some of the tradeoffs are +discussed in , Section +4.6, and an article by Jeff Bush when faced with some of these issues +is particularly enlightening + + +Interestingly, none of this complexity is faced in SIMD architectures... +but then they do not get the opportunity to optimise for highly-streamlined +memory accesses either. + +With the "bang-per-buck" ratio being so high and the indirect improvement +in L1 Instruction Cache usage (reduced instruction count), as well as +the opportunity to optimise L1 and L2 cache usage, the case for including +Vector LOAD/STORE is compelling. + ## Mask and Tagging (Predication) Tagging (aka Masks aka Predication) is a pseudo-method of implementing @@ -300,8 +235,8 @@ So these are the ways in which conditional execution may be implemented: * explicit compare and branch: BNE x, y -> offs would jump offs instructions if x was not equal to y * explicit store of tag condition: CMP x, y -> tagbit -* implicit (condition-code) ADD results in a carry, carry bit implicitly - (or sometimes explicitly) goes into a "tag" (mask) register +* implicit (condition-code) such as ADD results in a carry, carry bit + implicitly (or sometimes explicitly) goes into a "tag" (mask) register The first of these is a "normal" branch method, which is flat-out impossible to parallelise without look-ahead and effectively rewriting instructions. @@ -330,87 +265,20 @@ condition-codes or predication. By adding a CSR it becomes possible to also tag certain registers as "predicated if referenced as a destination". Example: - // in future operations if r0 is the destination use r5 as + // in future operations from now on, if r0 is the destination use r5 as // the PREDICATION register - IMPLICICSRPREDICATE r0, r5 - // store the compares in r5 as the PREDICATION register + SET_IMPLICIT_CSRPREDICATE r0, r5 + // store the compares in r5 as the PREDICATION register CMPEQ8 r5, r1, r2 - // r0 is used here. ah ha! that means it's predicated using r5! + // r0 is used here. ah ha! that means it's predicated using r5! ADD8 r0, r1, r3 -With enough registers (and there are enough registers) some fairly +With enough registers (and in RISC-V there are enough registers) some fairly complex predication can be set up and yet still execute without significant stalling, even in a simple non-superscalar architecture. -### Retro-fitting Predication into branch-explicit ISA - -One of the goals of this parallelism proposal is to avoid instruction -duplication. However, with the base ISA having been designed explictly -to *avoid* condition-codes entirely, shoe-horning predication into it -bcomes quite challenging. - -However what if all branch instructions, if referencing a vectorised -register, were instead given *completely new analogous meanings* that -resulted in a parallel bit-wise predication register being set? This -would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, -BLT and BGE. - -We might imagine that FEQ, FLT and FLT would also need to be converted, -however these are effectively *already* in the precise form needed and -do not need to be converted *at all*! The difference is that FEQ, FLT -and FLE *specifically* write a 1 to an integer register if the condition -holds, and 0 if not. All that needs to be done here is to say, "if -the integer register is tagged with a bit that says it is a predication -register, the **bit** in the integer register is set based on the -current vector index" instead. - -There is, in the standard Conditional Branch instruction, more than -adequate space to interpret it in a similar fashion: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | -"""]] - -This would become: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | -"""]] - -Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, -with the interesting side-effect that there is space within what is presently -the "immediate offset" field to reinterpret that to add in not only a bit -field to distinguish between floating-point compare and integer compare, -not only to add in a second source register, but also use some of the bits as -a predication target as well. - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | op | - 3 | 3 | 3 | 5 | 2 | - C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | -"""]] - -Now uses the CS format: - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | | op | - 3 | 3 | 3 | 2 | 3 | 2 | - C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | -"""]] - -Bit 6 would be decoded as "operation refers to Integer or Float" including -interpreting src1 and src2 accordingly as outlined in Table 12.2 of the -"C" Standard, version 2.0, -whilst Bit 5 would allow the operation to be extended, in combination with -funct3 = 110 or 111: a combination of four distinct comparison operators. +(For details on how Branch Instructions would be retro-fitted to indirectly +predicated equivalents, see Appendix) ## Conclusions @@ -423,6 +291,7 @@ follows: * Implicit (indirect) vs fixed (integral) instruction bit-width: indirect * Implicit vs explicit type-conversion: explicit * Implicit vs explicit inner loops: implicit but best done separately +* Single-instruction Vector LOAD/STORE: Complex but highly beneficial * Tag or no-tag: Complex but highly beneficial In particular: @@ -438,19 +307,9 @@ In particular: i.e. *without* requiring a super-scalar or out-of-order architecture, but doing a proper, full job (ZOLC) is an entirely different matter. -Constructing a SIMD/Simple-Vector proposal based around four of these five +Constructing a SIMD/Simple-Vector proposal based around four of these six requirements would therefore seem to be a logical thing to do. -# Instruction Format - -**TODO** *basically borrow from both P and V, which should be quite simple -to do, with the exception of Tag/no-tag, which needs a bit more -thought. V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS -gather-scatterer, and, if implemented, could actually be a really useful -way to span 8-bit up to 64-bit groups of data, where BGS as it stands -and described by Clifford does **bits** of up to 16 width. Lots to -look at and investigate!* - # Note on implementation of parallelism One extremely important aspect of this proposal is to respect and support @@ -475,562 +334,506 @@ basis* whether and how much "Virtual Parallelism" to deploy. It is absolutely critical to note that it is proposed that such choices MUST be **entirely transparent** to the end-user and the compiler. Whilst -a Vector (varible-width SIM) may not precisely match the width of the +a Vector (varible-width SIMD) may not precisely match the width of the parallelism within the implementation, the end-user **should not care** and in this way the performance benefits are gained but the ISA remains straightforward. All that happens at the end of an instruction run is: some parallel units (if there are any) would remain offline, completely transparently to the ISA, the program, and the compiler. -The "SIMD considered harmful" trap of having huge complexity and extra +To make that clear: should an implementor choose a particularly wide +SIMD-style ALU, each parallel unit *must* have predication so that +the parallel SIMD ALU may emulate variable-length parallel operations. +Thus the "SIMD considered harmful" trap of having huge complexity and extra instructions to deal with corner-cases is thus avoided, and implementors get to choose precisely where to focus and target the benefits of their implementation efforts, without "extra baggage". -# Example of vector / vector, vector / scalar, scalar / scalar => vector add - - register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... - register CSRpredicate[XLEN][4]; # 2^4 is max vector length - register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well - register x[32][XLEN]; - - function op_add(rd, rs1, rs2, predr) - { -    /* note that this is ADD, not PADD */ -    int i, id, irs1, irs2; -    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored -    # also destination makes no sense as a scalar but what the hell... -    for (i = 0, id=0, irs1=0, irs2=0; i -Simple-V's proposed means of expressing whether a register (from the -standard integer or the standard floating-point file) is a scalar or -a vector is to simply set the vector length to 1. The instruction -would however have to specify which register file (integer or FP) that -the vector-length was to be applied to. +There are a number of CSRs needed, which are used at the instruction +decode phase to re-interpret RV opcodes (a practice that has +precedent in the setting of MISA to enable / disable extensions). -Extended shapes (2-D etc) would not be part of Simple-V at all. +* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1) +* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) +* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) +* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) +* Integer Register N is a Predication Register (note: a key-value store) +* Vector Length CSR (VSETVL, VGETVL) -## 17.4 Representation Encoding - -Simple-V would not have representation-encoding. This is part of -polymorphism, which is considered too complex to implement (TODO: confirm?) - -## 17.5 Element Bitwidth - -This is directly equivalent to Simple-V's "Packed", and implies that -integer (or floating-point) are divided down into vector-indexable -chunks of size Bitwidth. - -In this way it becomes possible to have ADD effectively and implicitly -turn into ADDb (8-bit add), ADDw (16-bit add) and so on, and where -vector-length has been set to greater than 1, it becomes a "Packed" -(SIMD) instruction. - -It remains to be decided what should be done when RV32 / RV64 ADD (sized) -opcodes are used. One useful idea would be, on an RV64 system where -a 32-bit-sized ADD was performed, to simply use the least significant -32-bits of the register (exactly as is currently done) but at the same -time to *respect the packed bitwidth as well*. - -The extended encoding (Table 17.6) would not be part of Simple-V. - -## 17.6 Base Vector Extension Supported Types - -TODO: analyse. probably exactly the same. - -## 17.7 Maximum Vector Element Width - -No equivalent in Simple-V - -## 17.8 Vector Configuration Registers - -TODO: analyse. - -## 17.9 Legal Vector Unit Configurations - -TODO: analyse - -## 17.10 Vector Unit CSRs +Notes: -TODO: analyse +* for the purposes of LOAD / STORE, Integer Registers which are + marked as a Vector will result in a Vector LOAD / STORE. +* Vector Lengths are *not* the same as vsetl but are an integral part + of vsetl. +* Actual vector length is *multipled* by how many blocks of length + "bitwidth" may fit into an XLEN-sized register file. +* Predication is a key-value store due to the implicit referencing, + as opposed to having the predicate register explicitly in the instruction. +* Whilst the predication CSR is a key-value store it *generates* easier-to-use + state information. +* TODO: assess whether the same technique could be applied to the other + Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4, + V2.3-Draft ISA Reference) it becomes possible to greatly reduce state + needed for context-switches (empty slots need never be stored). + +## Predication CSR + +The Predication CSR is a key-value store indicating whether, if a given +destination register (integer or floating-point) is referred to in an +instruction, it is to be predicated. The first entry is whether predication +is enabled. The second entry is whether the register index refers to a +floating-point or an integer register. The third entry is the index +of that register which is to be predicated (if referred to). The fourth entry +is the integer register that is treated as a bitfield, indexable by the +vector element index. + +| RegNo | 6 | 5 | (4..0) | (4..0) | +| ----- | - | - | ------- | ------- | +| r0 | pren0 | i/f | regidx | predidx | +| r1 | pren1 | i/f | regidx | predidx | +| .. | pren.. | i/f | regidx | predidx | +| r15 | pren15 | i/f | regidx | predidx | + +The Predication CSR Table is a key-value store, so implementation-wise +it will be faster to turn the table around (maintain topologically +equivalent state): + + fp_pred_enabled[32]; + int_pred_enabled[32]; + for (i = 0; i < 16; i++) + if CSRpred[i].pren: + idx = CSRpred[i].regidx + predidx = CSRpred[i].predidx + if CSRpred[i].type == 0: # integer + int_pred_enabled[idx] = 1 + int_pred_reg[idx] = predidx + else: + fp_pred_enabled[idx] = 1 + fp_pred_reg[idx] = predidx + +So when an operation is to be predicated, it is the internal state that +is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following +pseudo-code for operations is given, where p is the explicit (direct) +reference to the predication register to be used: -> Ok so this is an aspect of Simple-V that I hadn't thought through, -> yet (proposal / idea only a few days old!).  in V2.3-Draft ISA Section -> 17.10 the CSRs are listed.  I note that there's some general-purpose -> CSRs (including a global/active vector-length) and 16 vcfgN CSRs.  i -> don't precisely know what those are for. + for (int i=0; i  In the Simple-V proposal, *every* register in both the integer -> register-file *and* the floating-point register-file would have at -> least a 2-bit "data-width" CSR and probably something like an 8-bit -> "vector-length" CSR (less in RV32E, by exactly one bit). +This instead becomes an *indirect* reference using the *internal* state +table generated from the Predication CSR key-value store: ->  What I *don't* know is whether that would be considered perfectly -> reasonable or completely insane.  If it turns out that the proposed -> Simple-V CSRs can indeed be stored in SRAM then I would imagine that -> adding somewhere in the region of 10 bits per register would be... okay?  -> I really don't honestly know. + if type(iop) == INT: + pred_enabled = int_pred_enabled + preg = int_pred_reg[rd] + else: + pred_enabled = fp_pred_enabled + preg = fp_pred_reg[rd] ->  Would these proposed 10-or-so-bit per-register Simple-V CSRs need to -> be multi-ported? No I don't believe they would. + for (int i=0; i 1; + s2 = CSRvectorlen[src2] > 1; + for (int i=0; i However, there are also several features that go beyond simply attaching VL -> to a scalar operation and are crucial to being able to vectorize a lot of -> code. To name a few: -> - Conditional execution (i.e., predicated operations) -> - Inter-lane data movement (e.g. SLIDE, SELECT) -> - Reductions (e.g., VADD with a scalar destination) - - Ok so the Conditional and also the Reductions is one of the reasons - why as part of SimpleV / variable-SIMD / parallelism (gah gotta think - of a decent name) i proposed that it be implemented as "if you say r0 - is to be a vector / SIMD that means operations actually take place on - r0,r1,r2... r(N-1)". - - Consequently any parallel operation could be paused (or... more - specifically: vectors disabled by resetting it back to a default / - scalar / vector-length=1) yet the results would actually be in the - *main register file* (integer or float) and so anything that wasn't - possible to easily do in "simple" parallel terms could be done *out* - of parallel "mode" instead. - - I do appreciate that the above does imply that there is a limit to the - length that SimpleV (whatever) can be parallelised, namely that you - run out of registers! my thought there was, "leave space for the main - V-Ext proposal to extend it to the length that V currently supports". - Honestly i had not thought through precisely how that would work. - - Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that, - it reminds me of the discussion with Clifford on bit-manipulation - (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if - applied "globally and outside of V and P" SLIDE and SELECT might become - an extremely powerful way to do fast memory copy and reordering [2[. - - However I haven't quite got my head round how that would work: i am - used to the concept of register "tags" (the modern term is "masks") - and i *think* if "masks" were applied to a Simple-V-enhanced LOAD / - STORE you would get the exact same thing as SELECT. - - SLIDE you could do simply by setting say r0 vector-length to say 16 - (meaning that if referred to in any operation it would be an implicit - parallel operation on *all* registers r0 through r15), and temporarily - set say.... r7 vector-length to say... 5. Do a LOAD on r7 and it would - implicitly mean "load from memory into r7 through r11". Then you go - back and do an operation on r0 and ta-daa, you're actually doing an - operation on a SLID {SLIDED?) vector. - - The advantage of Simple-V (whatever) over V would be that you could - actually do *operations* in the middle of vectors (not just SLIDEs) - simply by (as above) setting r0 vector-length to 16 and r7 vector-length - to 5. There would be nothing preventing you from doing an ADD on r0 - (which meant do an ADD on r0 through r15) followed *immediately in the - next instruction with no setup cost* a MUL on r7 (which actually meant - "do a parallel MUL on r7 through r11"). - - btw it's worth mentioning that you'd get scalar-vector and vector-scalar - implicitly by having one of the source register be vector-length 1 - (the default) and one being N > 1. but without having special opcodes - to do it. i *believe* (or more like "logically infer or deduce" as - i haven't got access to the spec) that that would result in a further - opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V. - - Also, Reduction *might* be possible by specifying that the destination be - a scalar (vector-length=1) whilst the source be a vector. However... it - would be an awful lot of work to go through *every single instruction* - in *every* Extension, working out which ones could be parallelised (ADD, - MUL, XOR) and those that definitely could not (DIV, SUB). Is that worth - the effort? maybe. Would it result in huge complexity? probably. - Could an implementor just go "I ain't doing *that* as parallel! - let's make it virtual-parallelism (sequential reduction) instead"? - absolutely. So, now that I think it through, Simple-V (whatever) - covers Reduction as well. huh, that's a surprise. - - -> - Vector-length speculation (making it possible to vectorize some loops with -> unknown trip count) - I don't think this part of the proposal is written -> down yet. - - Now that _is_ an interesting concept. A little scary, i imagine, with - the possibility of putting a processor into a hard infinite execution - loop... :) - - -> Also, note the vector ISA consumes relatively little opcode space (all the -> arithmetic fits in 7/8ths of a major opcode). This is mainly because data -> type and size is a function of runtime configuration, rather than of opcode. - - yes. i love that aspect of V, i am a huge fan of polymorphism [1] - which is why i am keen to advocate that the same runtime principle be - extended to the rest of the RISC-V ISA [3] - - Yikes that's a lot. I'm going to need to pull this into the wiki to - make sure it's not lost. - -[1] inherent data type conversion: 25 years ago i designed a hypothetical -hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit -(escape-extended) opcodes and 2-bit (escape-extended) operands that -only required a fixed 8-bit instruction length. that relied heavily -on polymorphism and runtime size configurations as well. At the time -I thought it would have meant one HELL of a lot of CSRs... but then I -met RISC-V and was cured instantly of that delusion^Wmisapprehension :) - -[2] Interestingly if you then also add in the other aspect of Simple-V -(the data-size, which is effectively functionally orthogonal / identical -to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE -operations become byte / half-word / word augmenters of B-Ext's proposed -"BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored -LOAD / STORE would deal with 8 / 16 / 32 bits at a time. Where it -would get really REALLY interesting would be masked-packed-vectored -B-Ext BGS instructions. I can't even get my head fully round that, -which is a good sign that the combination would be *really* powerful :) - -[3] ok sadly maybe not the polymorphism, it's too complicated and I -think would be much too hard for implementors to easily "slide in" to an -existing non-Simple-V implementation.  i say that despite really *really* -wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some -fashion, for optimising 3D Graphics.  *sigh*. - -## TODO: analyse, auto-increment on unit-stride and constant-stride - -so i thought about that for a day or so, and wondered if it would be -possible to propose a variant of zero-overhead loop that included -auto-incrementing the two address registers a2 and a3, as well as -providing a means to interact between the zero-overhead loop and the -vsetvl instruction. a sort-of pseudo-assembly of that would look like: - - # a2 to be auto-incremented by t0 times 4 - zero-overhead-set-auto-increment a2, t0, 4 - # a2 to be auto-incremented by t0 times 4 - zero-overhead-set-auto-increment a3, t0, 4 - zero-overhead-set-loop-terminator-condition a0 zero - zero-overhead-set-start-end stripmine, stripmine+endoffset - stripmine: - vsetvl t0,a0 - vlw v0, a2 - vlw v1, a3 - vfma v1, a1, v0, v1 - vsw v1, a3 - sub a0, a0, t0 - stripmine+endoffset: - -the question is: would something like this even be desirable? it's a -variant of auto-increment [1]. last time i saw any hint of auto-increment -register opcodes was in the 1980s... 68000 if i recall correctly... yep -see [1] - -[1] http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node6.html - -Reply: - -Another option for auto-increment is for vector-memory-access instructions -to support post-increment addressing for unit-stride and constant-stride -modes. This can be implemented by the scalar unit passing the operation -to the vector unit while itself executing an appropriate multiply-and-add -to produce the incremented address. This does *not* require additional -ports on the scalar register file, unlike scalar post-increment addressing -modes. - -## TODO: instructions (based on Hwacha) V-Ext duplication analysis - -This is partly speculative due to lack of access to an up-to-date -V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing). However -basin an analysis instead on Hwacha, a cursory examination shows over -an **85%** duplication of V-Ext operand-related instructions when -compared to Simple-V on a standard RG64G base. Even Vector Fetch -is analogous to "zero-overhead loop". - -Exceptions are: - -* Vector Indexed Memory Instructions (non-contiguous) -* Vector Atomic Memory Instructions. -* Some of the Vector Misc ops: VEIDX, VFIRST, VCLASS, VPOPC - and potentially more. -* Consensual Jump - -Table of RV32V Instructions - -| RV32V | RV Equivalent (FP) | RV Equivalent (Int) | Notes | -| ----- | --- | | | -| VADD | FADD | ADD | | -| VSUB | FSUB | SUB | | -| VSL | | SLL | | -| VSR | | SRL | | -| VAND | | AND | | -| VOR | | OR | | -| VXOR | | XOR | | -| VSEQ | FEQ | BEQ | {1} | -| VSNE | !FEQ | BNE | {1} | -| VSLT | FLT | BLT | {1} | -| VSGE | !FLE | BGE | {1} | -| VCLIP | | | | -| VCVT | FCVT | | | -| VMPOP | | | | -| VMFIRST | | | | -| VEXTRACT | | | | -| VINSERT | | | | -| VMERGE | | | | -| VSELECT | | | | -| VSLIDE | | | | -| VDIV | FDIV | DIV | | -| VREM | | REM | | -| VMUL | FMUL | MUL | | -| VMULH | | | | -| VMIN | FMIN | | | -| VMAX | FMUX | | | -| VSGNJ | FSGNJ | | | -| VSGNJN | FSGNJN | | | -| VSGNJX | FSNGJX | | | -| VSQRT | FSQRT | | | -| VCLASS | | | | -| VPOPC | | | | -| VADDI | | ADDI | | -| VSLI | | SLI | | -| VSRI | | SRI | | -| VANDI | | ANDI | | -| VORI | | ORI | | -| VXORI | | XORI | | -| VCLIPI | | | | -| VMADD | FMADD | | | -| VMSUB | FMSUB | | | -| VNMADD | FNMSUB | | | -| VNMSUB | FNMADD | | | -| VLD | FLD | LD | | -| VLDS | | LW | | -| VLDX | | LWU | | -| VST | FST | ST | | -| VSTS | | | | -| VSTX | | | | -| VAMOSWAP | | AMOSWAP | | -| VAMOADD | | AMOADD | | -| VAMOAND | | AMOAND | | -| VAMOOR | | AMOOR | | -| VAMOXOR | | AMOXOR | | -| VAMOMIN | | AMOMIN | | -| VAMOMAX | | AMOMAX | | - -Notes: - -* {1} retro-fit predication variants into branch instructions (base and C), - decoding triggered by CSR bit marking register as "Vector type". - -## TODO: sort - -> I suspect that the "hardware loop" in question is actually a zero-overhead -> loop unit that diverts execution from address X to address Y if a certain -> condition is met. - - not quite.  The zero-overhead loop unit interestingly would be at -an [independent] level above vector-length.  The distinctions are -as follows: - -* Vector-length issues *virtual* instructions where the register - operands are *specifically* altered (to cover a range of registers), - whereas zero-overhead loops *specifically* do *NOT* alter the operands - in *ANY* way. - -* Vector-length-driven "virtual" instructions are driven by *one* - and *only* one instruction (whether it be a LOAD, STORE, or pure - one/two/three-operand opcode) whereas zero-overhead loop units - specifically apply to *multiple* instructions. - -Where vector-length-driven "virtual" instructions might get conceptually -blurred with zero-overhead loops is LOAD / STORE.  In the case of LOAD / -STORE, to actually be useful, vector-length-driven LOAD / STORE should -increment the LOAD / STORE memory address to correspondingly match the -increment in the register bank.  example: - -* set vector-length for r0 to 4 -* issue RV32 LOAD from addr 0x1230 to r0 - -translates effectively to: - -* RV32 LOAD from addr 0x1230 to r0 -* ... -* ... -* RV32 LOAD from addr 0x123B to r3 - -# P-Ext ISA - -## 16-bit Arithmetic - -| Mnemonic | 16-bit Instruction | Simple-V Equivalent | -| ------------------ | ------------------------- | ------------------- | -| ADD16 rt, ra, rb | add | RV ADD (bitwidth=16) | -| RADD16 rt, ra, rb | Signed Halving add | | -| URADD16 rt, ra, rb | Unsigned Halving add | | -| KADD16 rt, ra, rb | Signed Saturating add | | -| UKADD16 rt, ra, rb | Unsigned Saturating add | | -| SUB16 rt, ra, rb | sub | RV SUB (bitwidth=16) | -| RSUB16 rt, ra, rb | Signed Halving sub | | -| URSUB16 rt, ra, rb | Unsigned Halving sub | | -| KSUB16 rt, ra, rb | Signed Saturating sub | | -| UKSUB16 rt, ra, rb | Unsigned Saturating sub | | -| CRAS16 rt, ra, rb | Cross Add & Sub | | -| RCRAS16 rt, ra, rb | Signed Halving Cross Add & Sub | | -| URCRAS16 rt, ra, rb| Unsigned Halving Cross Add & Sub | | -| KCRAS16 rt, ra, rb | Signed Saturating Cross Add & Sub | | -| UKCRAS16 rt, ra, rb| Unsigned Saturating Cross Add & Sub | | -| CRSA16 rt, ra, rb | Cross Sub & Add | | -| RCRSA16 rt, ra, rb | Signed Halving Cross Sub & Add | | -| URCRSA16 rt, ra, rb| Unsigned Halving Cross Sub & Add | | -| KCRSA16 rt, ra, rb | Signed Saturating Cross Sub & Add | | -| UKCRSA16 rt, ra, rb| Unsigned Saturating Cross Sub & Add | | - -## 8-bit Arithmetic - -| Mnemonic | 16-bit Instruction | Simple-V Equivalent | -| ------------------ | ------------------------- | ------------------- | -| ADD8 rt, ra, rb | add | RV ADD (bitwidth=8)| -| RADD8 rt, ra, rb | Signed Halving add | | -| URADD8 rt, ra, rb | Unsigned Halving add | | -| KADD8 rt, ra, rb | Signed Saturating add | | -| UKADD8 rt, ra, rb | Unsigned Saturating add | | -| SUB8 rt, ra, rb | sub | RV SUB (bitwidth=8)| -| RSUB8 rt, ra, rb | Signed Halving sub | | -| URSUB8 rt, ra, rb | Unsigned Halving sub | | +Unfortunately it is not possible to fit the full functionality +of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed) +require another operand (rs2) in addition to the operand width +(which is also missing), offset, base, and src/dest. + +However a close approximation may be achieved by taking the top bit +of the offset in each of the five types of LD (and ST), reducing the +offset to 4 bits and utilising the 5th bit to indicate whether "stride" +is to be enabled. In this way it is at least possible to introduce +that functionality. + +(**TODO**: *assess whether the loss of one bit from offset is worth having +"stride" capability.*) + +We also assume (including for the "stride" variant) that the "width" +parameter, which is missing, is derived and implicit, just as it is +with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD +and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for +C.FLW and C.FLD the width is implicitly 4 and 8 respectively. + +Interestingly we note that the Vectorised Simple-V variant of +LOAD/STORE (Compressed and otherwise), due to it effectively using the +standard register file(s), is the direct functional equivalent of +standard load-multiple and store-multiple instructions found in other +processors. + +In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on +page 76, "For virtual memory systems some data accesses could be resident +in physical memory and some not". The interesting question then arises: +how does RVV deal with the exact same scenario? +Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method +of detecting early page / segmentation faults and adjusting the TLB +in advance, accordingly: other strategies are explored in the Appendix +Section "Virtual Memory Page Faults". # Exceptions @@ -1041,18 +844,69 @@ translates effectively to: than the destination, throw an exception. > And what about instructions like JALR?  -> What does jumping to a vector do? +> What does jumping to a vector do? * Throw an exception. Whether that actually results in spawning threads as part of the trap-handling remains to be seen. -# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals +# Impementing V on top of Simple-V + +With Simple-V converting the original RVV draft concept-for-concept +from explicit opcodes to implicit overloading of existing RV Standard +Extensions, certain features were (deliberately) excluded that need +to be added back in for RVV to reach its full potential. This is +made slightly complicated by the fact that RVV itself has two +levels: Base and reserved future functionality. + +* Representation Encoding is entirely left out of Simple-V in favour of + implicitly taking the exact (explicit) meaning from RV Standard Extensions. +* VCLIP and VCLIPI do not have corresponding RV Standard Extension + opcodes (and are the only such operations). +* Extended Element bitwidths (1 through to 24576 bits) were left out + of Simple-V as, again, there is no corresponding RV Standard Extension + that covers anything even below 32-bit operands. +* Polymorphism was entirely left out of Simple-V due to the inherent + complexity of automatic type-conversion. +* Vector Register files were specifically left out of Simple-V in favour + of fitting on top of the integer and floating-point files. An + "RVV re-retro-fit" needs to be able to mark (implicitly marked) + registers as being actually in a separate *vector* register file. +* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector + register file size is 5 bits (32 registers), whilst the "Extended" + variant of RVV specifies 8 bits (256 registers) and has yet to + be published. +* One big difference: Sections 17.12 and 17.17, there are only two possible + predication registers in RVV "Base". Through the "indirect" method, + Simple-V provides a key-value CSR table that allows (arbitrarily) + up to 16 (TBD) of either the floating-point or integer registers to + be marked as "predicated" (key), and if so, which integer register to + use as the predication mask (value). + +**TODO** + +# Implementing P (renamed to DSP) on top of Simple-V + +* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR + (caveat: anything not specified drops through to software-emulation / traps) +* TODO + +# Appendix + +## V-Extension to Simple-V Comparative Analysis + +This section has been moved to its own page [[v_comparative_analysis]] + +## P-Ext ISA + +This section has been moved to its own page [[p_comparative_analysis]] + +## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals This section compares the various parallelism proposals as they stand, including traditional SIMD, in terms of features, ease of implementation, complexity, flexibility, and die area. -## [[alt_rvp]] +### [[alt_rvp]] Primary benefit of Alt-RVP is the simplicity with which parallelism may be introduced (effective multiplication of regfiles and associated ALUs). @@ -1075,16 +929,16 @@ may be introduced (effective multiplication of regfiles and associated ALUs). * minus: Access to registers across multiple lanes is challenging. "Solution" is to drop data into memory and immediately back in again (like MMX). -## Simple-V +### Simple-V Primary benefit of Simple-V is the OO abstraction of parallel principles -from actual hardware. It's an API in effect that's designed to be -slotted in to an existing implementation (just after instruction decode) -with minimum disruption and effort. +from actual (internal) parallel hardware. It's an API in effect that's +designed to be slotted in to an existing implementation (just after +instruction decode) with minimum disruption and effort. -* minus: the complexity of having to use register renames, OoO, VLIW, - register file cacheing, all of which has been done before but is a - pain +* minus: the complexity (if full parallelism is to be exploited) + of having to use register renames, OoO, VLIW, register file cacheing, + all of which has been done before but is a pain * plus: transparent re-use of existing opcodes as-is just indirectly saying "this register's now a vector" which * plus: means that future instructions also get to be inherently @@ -1114,12 +968,12 @@ with minimum disruption and effort. would be "no worse" than existing register renaming, OoO, VLIW and register file cacheing schemes. -## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) +### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft) RVV is extremely well-designed and has some amazing features, including 2D reorganisation of memory through LOAD/STORE "strides". -* plus: regular predictable workload means that implmentations may +* plus: regular predictable workload means that implementations may streamline effects on L1/L2 Cache. * plus: regular and clear parallel workload also means that lanes (similar to Alt-RVP) may be used as an implementation detail, @@ -1143,7 +997,7 @@ RVV is extremely well-designed and has some amazing features, including to be in high-performance specialist supercomputing (where it will be absolutely superb). -## Traditional SIMD +### Traditional SIMD The only really good things about SIMD are how easy it is to implement and get good performance. Unfortunately that makes it quite seductive... @@ -1180,20 +1034,20 @@ get good performance. Unfortunately that makes it quite seductive... * minor-saving-grace: some implementations *may* have predication masks that allow control over individual elements within the SIMD block. -# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals +## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals This section compares the various parallelism proposals as they stand, *against* traditional SIMD as opposed to *alongside* SIMD. In other words, the question is asked "How can each of the proposals effectively implement (or replace) SIMD, and how effective would they be"? -## [[alt_rvp]] +### [[alt_rvp]] * Alt-RVP would not actually replace SIMD but would augment it: just as with a SIMD architecture where the ALU becomes responsible for the parallelism, Alt-RVP ALUs would likewise be so responsible... with *additional* (lane-based) parallelism on top. -* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by +* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by at least one dimension are avoided (architectural upgrades introducing 128-bit then 256-bit then 512-bit variants of the exact same 64-bit SIMD block) @@ -1208,7 +1062,7 @@ the question is asked "How can each of the proposals effectively implement "swapping" instructions were then introduced, some of the disadvantages of SIMD could be mitigated. -## RVV +### RVV * RVV is designed to replace SIMD with a better paradigm: arbitrary-length parallelism. @@ -1216,7 +1070,7 @@ the question is asked "How can each of the proposals effectively implement DSPs with a focus on Multimedia (Audio, Video and Image processing), RVV's primary focus appears to be on Supercomputing: optimisation of mathematical operations that fit into the OpenCL space. -* Adding functions (operations) that would normally fit (in parallel) +* Adding functions (operations) that would normally fit (in parallel) into a SIMD instruction requires an equivalent to be added to the RVV Extension, if one does not exist. Given the specialist nature of some SIMD instructions (8-bit or 16-bit saturated or halving add), @@ -1224,7 +1078,7 @@ the question is asked "How can each of the proposals effectively implement implementation overhead of RVV were acceptable (compared to normal SIMD/DSP-style single-issue in-order simplicity). -## Simple-V +### Simple-V * Simple-V borrows hugely from RVV as it is intended to be easy to topologically transplant every single instruction from RVV (as @@ -1269,23 +1123,115 @@ the question is asked "How can each of the proposals effectively implement operations, all the while keeping a consistent ISA-level "API" irrespective of implementor design choices (or indeed actual implementations). -# Impementing V on top of Simple-V +### Example Instruction translation: -* Number of Offset CSRs extends from 2 -* Extra register file: vector-file -* Setup of Vector length and bitwidth CSRs now can specify vector-file - as well as integer or float file. -* TODO +Instructions "ADD r2 r4 r4" would result in three instructions being +generated and placed into the FIFO: -# Implementing P (renamed to DSP) on top of Simple-V +* ADD r2 r4 r4 +* ADD r2 r5 r5 +* ADD r2 r6 r6 -* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR - (caveat: anything not specified drops through to software-emulation / traps) -* TODO +## Example of vector / vector, vector / scalar, scalar / scalar => vector add -# Register reordering + register CSRvectorlen[XLEN][4]; # not quite decided yet about this one... + register CSRpredicate[XLEN][4]; # 2^4 is max vector length + register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well + register x[32][XLEN]; -## Register File + function op_add(rd, rs1, rs2, predr) + { +    /* note that this is ADD, not PADD */ +    int i, id, irs1, irs2; +    # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored +    # also destination makes no sense as a scalar but what the hell... +    for (i = 0, id=0, irs1=0, irs2=0; i + +One of the goals of this parallelism proposal is to avoid instruction +duplication. However, with the base ISA having been designed explictly +to *avoid* condition-codes entirely, shoe-horning predication into it +bcomes quite challenging. + +However what if all branch instructions, if referencing a vectorised +register, were instead given *completely new analogous meanings* that +resulted in a parallel bit-wise predication register being set? This +would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, +BLT and BGE. + +We might imagine that FEQ, FLT and FLT would also need to be converted, +however these are effectively *already* in the precise form needed and +do not need to be converted *at all*! The difference is that FEQ, FLT +and FLE *specifically* write a 1 to an integer register if the condition +holds, and 0 if not. All that needs to be done here is to say, "if +the integer register is tagged with a bit that says it is a predication +register, the **bit** in the integer register is set based on the +current vector index" instead. + +There is, in the standard Conditional Branch instruction, more than +adequate space to interpret it in a similar fashion: + +[[!table data=""" +31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 | +imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | + 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | + offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | +"""]] + +This would become: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | +"""]] + +Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, +with the interesting side-effect that there is space within what is presently +the "immediate offset" field to reinterpret that to add in not only a bit +field to distinguish between floating-point compare and integer compare, +not only to add in a second source register, but also use some of the bits as +a predication target as well. + +[[!table data=""" +15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 | +funct3 | imm | rs10 | imm | op | +3 | 3 | 3 | 5 | 2 | +C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | +"""]] + +Now uses the CS format: + +[[!table data=""" +15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 | +funct3 | imm | rs10 | imm | | op | +3 | 3 | 3 | 2 | 3 | 2 | +C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 | +"""]] + +Bit 6 would be decoded as "operation refers to Integer or Float" including +interpreting src1 and src2 accordingly as outlined in Table 12.2 of the +"C" Standard, version 2.0, +whilst Bit 5 would allow the operation to be extended, in combination with +funct3 = 110 or 111: a combination of four distinct (predicated) comparison +operators. In both floating-point and integer cases those could be +EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). + +## Register reordering + +### Register File | Reg Num | Bits | | ------- | ---- | @@ -1297,17 +1243,19 @@ the question is asked "How can each of the proposals effectively implement | r5 | (32..0) | | r6 | (32..0) | | r7 | (32..0) | +| .. | (32..0) | +| r31| (32..0) | -## Vectorised CSR +### Vectorised CSR May not be an actual CSR: may be generated from Vector Length CSR: single-bit is less burdensome on instruction decode phase. | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | -| - | - | - | - | - | - | - | - | +| - | - | - | - | - | - | - | - | | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | -## Vector Length CSR +### Vector Length CSR | Reg Num | (3..0) | | ------- | ---- | @@ -1320,7 +1268,9 @@ single-bit is less burdensome on instruction decode phase. | r6 | 0 | | r7 | 1 | -## Virtual Register Reordering: +### Virtual Register Reordering + +This example assumes the above Vector Length CSR table | Reg Num | Bits (0) | Bits (1) | Bits (2) | | ------- | -------- | -------- | -------- | @@ -1330,16 +1280,78 @@ single-bit is less burdensome on instruction decode phase. | r4 | (32..0) | (32..0) | (32..0) | | r7 | (32..0) | -## Example Instruction translation: - -Instructions "ADD r2 r4 r4" would result in three instructions being -generated and placed into the FILO: - -* ADD r2 r4 r4 -* ADD r2 r5 r5 -* ADD r2 r6 r6 - -## Insights +### Bitwidth Virtual Register Reordering + +This example goes a little further and illustrates the effect that a +bitwidth CSR has been set on a register. Preconditions: + +* RV32 assumed +* CSRintbitwidth[2] = 010 # integer r2 is 16-bit +* CSRintvlength[2] = 3 # integer r2 is a vector of length 3 +* vsetl rs1, 5 # set the vector length to 5 + +This is interpreted as follows: + +* Given that the context is RV32, ELEN=32. +* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2 +* Therefore the actual vector length is up to *six* elements +* However vsetl sets a length 5 therefore the last "element" is skipped + +So when using an operation that uses r2 as a source (or destination) +the operation is carried out as follows: + +* 16-bit operation on r2(15..0) - vector element index 0 +* 16-bit operation on r2(31..16) - vector element index 1 +* 16-bit operation on r3(15..0) - vector element index 2 +* 16-bit operation on r3(31..16) - vector element index 3 +* 16-bit operation on r4(15..0) - vector element index 4 +* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5 + +Predication has been left out of the above example for simplicity, however +predication is ANDed with the latter stages (vsetl not equal to maximum +capacity). + +Note also that it is entirely an implementor's choice as to whether to have +actual separate ALUs down to the minimum bitwidth, or whether to have something +more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD +operations carried out 32-bits at a time is perfectly acceptable, as is +8-bit SIMD operations carried out 16-bits at a time requiring two ALUs). +Regardless of the internal parallelism choice, *predication must +still be respected*, making Simple-V in effect the "consistent public API". + +vew may be one of the following (giving a table "bytestable", used below): + +| vew | bitwidth | bytestable | +| --- | -------- | ---------- | +| 000 | default | XLEN/8 | +| 001 | 8 | 1 | +| 010 | 16 | 2 | +| 011 | 32 | 4 | +| 100 | 64 | 8 | +| 101 | 128 | 16 | +| 110 | rsvd | rsvd | +| 111 | rsvd | rsvd | + +Pseudocode for vector length taking CSR SIMD-bitwidth into account: + + vew = CSRbitwidth[rs1] + if (vew == 0) + bytesperreg = (XLEN/8) # or FLEN as appropriate + else: + bytesperreg = bytestable[vew] # 1 2 4 8 16 + simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate + vlen = CSRvectorlen[rs1] * simdmult + +To index an element in a register rnum where the vector element index is i: + + function regoffs(rnum, i): + regidx = floor(i / simdmult) # integer-div rounded down + byteidx = i % simdmult # integer-remainder + return rnum + regidx, # actual real register + byteidx * 8, # low + byteidx * 8 + (vew-1), # high + +### Insights SIMD register file splitting still to consider. For RV64, benefits of doubling (quadrupling in the case of Half-Precision IEEE754 FP) the apparent @@ -1360,7 +1372,7 @@ on caches). Interestingly we observe then that Simple-V is about of underlying hardware is an implementor-choice that could just as equally be applied *without* Simple-V even being implemented. -# Analysis of CSR decoding on latency +## Analysis of CSR decoding on latency It could indeed have been logically deduced (or expected), that there would be additional decode latency in this proposal, because if @@ -1436,7 +1448,7 @@ So the question boils down to: Whilst the above may seem to be severe minuses, there are some strong pluses: -* Significant reduction of V's opcode space: over 85%. +* Significant reduction of V's opcode space: over 95%. * Smaller reduction of P's opcode space: around 10%. * The potential to use Compressed instructions in both Vector and SIMD due to the overloading of register meaning (implicit vectorisation, @@ -1448,9 +1460,7 @@ pluses: parallel ALUs) is only equal to one ("virtual" parallelism), or is greater than one, should not be underestimated. -# Appendix - -# Reducing Register Bank porting +## Reducing Register Bank porting This looks quite reasonable. @@ -1467,7 +1477,205 @@ The nice thing about a vector architecture is that you *know* that to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough by *introducing* deliberate latency into the execution phase. +## Overflow registers in combination with predication + +**TODO**: propose overflow registers be actually one of the integer regs +(flowing to multiple regs). + +**TODO**: propose "mask" (predication) registers likewise. combination with +standard RV instructions and overflow registers extremely powerful, see +Aspex ASP. + +When integer overflow is stored in an easily-accessible bit (or another +register), parallelisation turns this into a group of bits which can +potentially be interacted with in predication, in interesting and powerful +ways. For example, by taking the integer-overflow result as a predication +field and shifting it by one, a predicated vectorised "add one" can emulate +"carry" on arbitrary (unlimited) length addition. + +However despite RVV having made room for floating-point exceptions, neither +RVV nor base RV have taken integer-overflow (carry) into account, which +makes proposing it quite challenging given that the relevant (Base) RV +sections are frozen. Consequently it makes sense to forgo this feature. + +## Virtual Memory page-faults on LOAD/STORE + + +### Notes from conversations + +> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft +> riscv-isa-manual in order to work out how to re-map RVV onto the standard +> ISA, and came across an interesting comments at the bottom of pages 75 +> and 76: + +> " A common mechanism used in other ISAs to further reduce save/restore +> code size is load- multiple and store-multiple instructions. " + +> Fascinatingly, due to Simple-V proposing to use the *standard* register +> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly +> that: load-multiple and store-multiple instructions. Which brings us +> on to this comment: + +> "For virtual memory systems, some data accesses could be resident in +> physical memory and +> some could not, which requires a new restart mechanism for partially +> executed instructions." + +> Which then of course brings us to the interesting question: how does RVV +> cope with the scenario when, particularly with LD.X (Indexed / indirect +> loads), part-way through the loading a page fault occurs? + +> Has this been noted or discussed before? + +For applications-class platforms, the RVV exception model is +element-precise (that is, if an exception occurs on element j of a +vector instruction, elements 0..j-1 have completed execution and elements +j+1..vl-1 have not executed). + +Certain classes of embedded platforms where exceptions are always fatal +might choose to offer resumable/swappable interrupts but not precise +exceptions. + + +> Is RVV designed in any way to be re-entrant? + +Yes. + + +> What would the implications be for instructions that were in a FIFO at +> the time, in out-of-order and VLIW implementations, where partial decode +> had taken place? + +The usual bag of tricks for maintaining precise exceptions applies to +vector machines as well. Register renaming makes the job easier, and +it's relatively cheaper for vectors, since the control cost is amortized +over longer registers. + + +> Would it be reasonable at least to say *bypass* (and freeze) the +> instruction FIFO (drop down to a single-issue execution model temporarily) +> for the purposes of executing the instructions in the interrupt (whilst +> setting up the VM page), then re-continue the instruction with all +> state intact? + +This approach has been done successfully, but it's desirable to be +able to swap out the vector unit state to support context switches on +exceptions that result in long-latency I/O. + + +> Or would it be better to switch to an entirely separate secondary +> hyperthread context? + +> Does anyone have any ideas or know if there is any academic literature +> on solutions to this problem? + +The Vector VAX offered imprecise but restartable and swappable exceptions: +http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf + +Sec. 4.6 of Krste's dissertation assesses some of +the tradeoffs and references a bunch of related work: +http://people.eecs.berkeley.edu/~krste/thesis.pdf + + +---- + +Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P +exceptions" and thought, "hmmm that could go into a CSR, must re-read +the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly +thought, "ah ha! what if the memory exceptions were, instead of having +an immediate exception thrown, were simply stored in a type of predication +bit-field with a flag "error this element failed"? + +Then, *after* the vector load (or store, or even operation) was +performed, you could *then* raise an exception, at which point it +would be possible (yes in software... I know....) to go "hmmm, these +indexed operations didn't work, let's get them into memory by triggering +page-loads", then *re-run the entire instruction* but this time with a +"memory-predication CSR" that stops the already-performed operations +(whether they be loads, stores or an arithmetic / FP operation) from +being carried out a second time. + +This theoretically could end up being done multiple times in an SMP +environment, and also for LD.X there would be the remote outside annoying +possibility that the indexed memory address could end up being modified. + +The advantage would be that the order of execution need not be +sequential, which potentially could have some big advantages. +Am still thinking through the implications as any dependent operations +(particularly ones already decoded and moved into the execution FIFO) +would still be there (and stalled). hmmm. + +---- + + > > # assume internal parallelism of 8 and MAXVECTORLEN of 8 + > > VSETL r0, 8 + > > FADD x1, x2, x3 + > + > > x3[0]: ok + > > x3[1]: exception + > > x3[2]: ok + > > ... + > > ... + > > x3[7]: ok + > + > > what happens to result elements 2-7?  those may be *big* results + > > (RV128) + > > or in the RVV-Extended may be arbitrary bit-widths far greater. + > + >  (you replied:) + > + > Thrown away. + +discussion then led to the question of OoO architectures + +> The costs of the imprecise-exception model are greater than the benefit. +> Software doesn't want to cope with it.  It's hard to debug.  You can't +> migrate state between different microarchitectures--unless you force all +> implementations to support the same imprecise-exception model, which would +> greatly limit implementation flexibility.  (Less important, but still +> relevant, is that the imprecise model increases the size of the context +> structure, as the microarchitectural guts have to be spilled to memory.) + + +## Implementation Paradigms +TODO: assess various implementation paradigms. These are listed roughly +in order of simplicity (minimum compliance, for ultra-light-weight +embedded systems or to reduce design complexity and the burden of +design implementation and compliance, in non-critical areas), right the +way to high-performance systems. + +* Full (or partial) software-emulated (via traps): full support for CSRs + required, however when a register is used that is detected (in hardware) + to be vectorised, an exception is thrown. +* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP) +* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming +* Out-of-order with instruction FIFOs and aggressive register-renaming +* VLIW + +Also to be taken into consideration: + +* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism +* Comphrensive vectorisation: FIFOs and internal parallelism +* Hybrid Parallelism + +# TODO Research + +> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs + +Idea: basic simple butterfly swap on a few element indices, primarily targetted +at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping, +perhaps allow reindexing of permutations up to 4 elements? 8? Reason: +such operations are less costly than a full indexed-shuffle, which requires +a separate instruction cycle. + +Predication "all zeros" needs to be "leave alone". Detection of +ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas +ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0. +Destruction of destination indices requires a copy of the entire vector +in advance to avoid. + +TBD: floating-point compare and other exception handling # References @@ -1493,3 +1701,9 @@ by *introducing* deliberate latency into the execution phase. * Multi-ported VLIW Register File Implementation * Fast context save/restore proposal * Register File Bank Cacheing +* Expired Patent on Vector Virtual Memory solutions + +* Discussion on RVV "re-entrant" capabilities allowing operations to be + restarted if an exception occurs (VM page-table miss) + +* Dot Product Vector