X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=1f479badad0209ee49ebd85864e2376ee9613fb4;hb=a537f8e87eaf740e6eadb4517e1f93c8112bb3cb;hp=ec979a35869274e836676ea714e8869c02423b6c;hpb=a89370d558b165d7aa849c40024165618a5cccf0;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index ec979a358..1f479bada 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,9 +1,5 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal -[[!toc ]] - -# Summary - Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. *Actual* internal hardware-level parallelism is *not* required, such @@ -16,6 +12,8 @@ of Out-of-order restructuring (including parallel ALU lanes) or VLIW implementations, or SIMD, or anything else, would then benefit *if* Simple-V was added on top. +[[!toc ]] + # Introduction This proposal exists so as to be able to satisfy several disparate @@ -52,35 +50,21 @@ Additionally it makes sense to *split out* the parallelism inherent within each of P and V, and to see if each of P and V then, in *combination* with a "best-of-both" parallelism extension, could be added on *on top* of this proposal, to topologically provide the exact same functionality of -each of P and V. +each of P and V. Each of P and V then can focus on providing the best +operations possible for their respective target areas, without being +hugely concerned about the actual parallelism. Furthermore, an additional goal of this proposal is to reduce the number of opcodes utilised by each of P and V as they currently stand, leveraging existing RISC-V opcodes where possible, and also potentially allowing P and V to make use of Compressed Instructions as a result. -**TODO**: reword this to better suit this document: - -Having looked at both P and V as they stand, they're _both_ very much -"separate engines" that, despite both their respective merits and -extremely powerful features, don't really cleanly fit into the RV design -ethos (or the flexible extensibility) and, as such, are both in danger -of not being widely adopted. I'm inclined towards recommending: - -* splitting out the DSP aspects of P-SIMD to create a single-issue DSP -* splitting out the polymorphism, esoteric data types (GF, complex - numbers) and unusual operations of V to create a single-issue "Esoteric - Floating-Point" extension -* splitting out the loop-aspects, vector aspects and data-width aspects - of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they - apply across *all* Extensions, whether those be DSP, M, Base, V, P - - everything. - **TODO**: propose overflow registers be actually one of the integer regs (flowing to multiple regs). **TODO**: propose "mask" (predication) registers likewise. combination with -standard RV instructions and overflow registers extremely powerful +standard RV instructions and overflow registers extremely powerful, see +Aspex ASP. # Analysis and discussion of Vector vs SIMD @@ -109,6 +93,13 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable for general-purpose computation, and in the context of developing a general-purpose ISA, is never going to satisfy 100 percent of implementors. +To explain this further: for increased workloads over time, as the +performance requirements increase for new target markets, implementors +choose to extend the SIMD width (so as to again avoid mixing parallelism +into the instruction issue phases: the primary "simplicity" benefit of +SIMD in the first place), with the result that the entire opcode space +effectively doubles with each new SIMD width that's added to the ISA. + That basically leaves "variable-length vector" as the clear *general-purpose* winner, at least in terms of greatly simplifying the instruction set, reducing the number of instructions required for any given task, and thus @@ -144,13 +135,16 @@ integer (and floating point) of various sizes is automatically inferred due to "type tagging" that is set with a special instruction. A register will be *specifically* marked as "16-bit Floating-Point" and, if added to an operand that is specifically tagged as "32-bit Integer" an implicit -type-conversion will take placce *without* requiring that type-conversion +type-conversion will take place *without* requiring that type-conversion to be explicitly done with its own separate instruction. However, implicit type-conversion is not only quite burdensome to implement (explosion of inferred type-to-type conversion) but also is never really going to be complete. It gets even worse when bit-widths -also have to be taken into consideration. +also have to be taken into consideration. Each new type results in +an increased O(N^2) conversion space that, as anyone who has examined +python's source code (which has built-in polymorphic type-conversion), +knows that the task is more complex than it first seems. Overall, type-conversion is generally best to leave to explicit type-conversion instructions, or in definite specific use-cases left to @@ -228,89 +222,20 @@ condition-codes or predication. By adding a CSR it becomes possible to also tag certain registers as "predicated if referenced as a destination". Example: - // in future operations if r0 is the destination use r5 as + // in future operations from now on, if r0 is the destination use r5 as // the PREDICATION register - IMPLICICSRPREDICATE r0, r5 + SET_IMPLICIT_CSRPREDICATE r0, r5 // store the compares in r5 as the PREDICATION register CMPEQ8 r5, r1, r2 // r0 is used here. ah ha! that means it's predicated using r5! ADD8 r0, r1, r3 -With enough registers (and there are enough registers) some fairly +With enough registers (and in RISC-V there are enough registers) some fairly complex predication can be set up and yet still execute without significant stalling, even in a simple non-superscalar architecture. -### Retro-fitting Predication into branch-explicit ISA - -One of the goals of this parallelism proposal is to avoid instruction -duplication. However, with the base ISA having been designed explictly -to *avoid* condition-codes entirely, shoe-horning predication into it -bcomes quite challenging. - -However what if all branch instructions, if referencing a vectorised -register, were instead given *completely new analogous meanings* that -resulted in a parallel bit-wise predication register being set? This -would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, -BLT and BGE. - -We might imagine that FEQ, FLT and FLT would also need to be converted, -however these are effectively *already* in the precise form needed and -do not need to be converted *at all*! The difference is that FEQ, FLT -and FLE *specifically* write a 1 to an integer register if the condition -holds, and 0 if not. All that needs to be done here is to say, "if -the integer register is tagged with a bit that says it is a predication -register, the **bit** in the integer register is set based on the -current vector index" instead. - -There is, in the standard Conditional Branch instruction, more than -adequate space to interpret it in a similar fashion: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | -"""]] - -This would become: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | -"""]] - -Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, -with the interesting side-effect that there is space within what is presently -the "immediate offset" field to reinterpret that to add in not only a bit -field to distinguish between floating-point compare and integer compare, -not only to add in a second source register, but also use some of the bits as -a predication target as well. - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | op | - 3 | 3 | 3 | 5 | 2 | - C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | -"""]] - -Now uses the CS format: - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | | op | - 3 | 3 | 3 | 2 | 3 | 2 | - C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | -"""]] - -Bit 6 would be decoded as "operation refers to Integer or Float" including -interpreting src1 and src2 accordingly as outlined in Table 12.2 of the -"C" Standard, version 2.0, -whilst Bit 5 would allow the operation to be extended, in combination with -funct3 = 110 or 111: a combination of four distinct (predicated) comparison -operators. In both floating-point and integer cases those could be -EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). +(For details on how Branch Instructions would be retro-fitted to indirectly +predicated equivalents, see Appendix) ## Conclusions @@ -343,16 +268,215 @@ requirements would therefore seem to be a logical thing to do. # Instruction Format -**TODO** *basically borrow from both P and V, which should be quite simple -to do, with the exception of Tag/no-tag, which needs a bit more -thought. V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS -gather-scatterer, and, if implemented, could actually be a really useful -way to span 8-bit up to 64-bit groups of data, where BGS as it stands -and described by Clifford does **bits** of up to 16 width. Lots to -look at and investigate* +The instruction format for Simple-V does not actually have *any* compare +operations, *any* arithmetic, floating point or memory instructions. +Instead it *overloads* pre-existing branch operations into predicated +variants, and implicitly overloads arithmetic operations and LOAD/STORE +depending on implicit CSR configurations for both vector length and +bitwidth. This includes Compressed instructions. * For analysis of RVV see [[v_comparative_analysis]] which begins to outline topologically-equivalent mappings of instructions +* Also see Appendix "Retro-fitting Predication into branch-explicit ISA" + for format of Branch opcodes. + +**TODO**: *analyse and decide whether the implicit nature of predication +as proposed is or is not a lot of hassle, and if explicit prefixes are +a better idea instead. Parallelism therefore effectively may end up +as always being 64-bit opcodes (32 for the prefix, 32 for the instruction) +with some opportunities for to use Compressed bringing it down to 48. +Also to consider is whether one or both of the last two remaining Compressed +instruction codes in Quadrant 1 could be used as a parallelism prefix, +bringing parallelised opcodes down to 32-bit and having the benefit of +being explicit.* + +## Branch Instruction: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +I/F | reserved | src2 | src1 | BPR | predicate rs3 || BRANCH | +0 | reserved | src2 | src1 | 000 | predicate rs3 || BEQ | +0 | reserved | src2 | src1 | 001 | predicate rs3 || BNE | +0 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd | +0 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd | +0 | reserved | src2 | src1 | 100 | predicate rs3 || BLE | +0 | reserved | src2 | src1 | 101 | predicate rs3 || BGE | +0 | reserved | src2 | src1 | 110 | predicate rs3 || BLTU | +0 | reserved | src2 | src1 | 111 | predicate rs3 || BGEU | +1 | reserved | src2 | src1 | 000 | predicate rs3 || FEQ | +1 | reserved | src2 | src1 | 001 | predicate rs3 || FNE | +1 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd | +1 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd | +1 | reserved | src2 | src1 | 100 | predicate rs3 || FLT | +1 | reserved | src2 | src1 | 101 | predicate rs3 || FLE | +1 | reserved | src2 | src1 | 110 | predicate rs3 || rsvd | +1 | reserved | src2 | src1 | 111 | predicate rs3 || rsvd | +"""]] + +In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given +for predicated compare operations of function "cmp": + + for (int i=0; i 1; + s2 = CSRvectorlen[src2] > 1; + for (int i=0; i r(N..N+M-1) * Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) * Integer Register N is a Predication Register (note: a key-value store) +* Vector Length CSR (VSETVL, VGETVL) Notes: @@ -860,6 +985,78 @@ This section has been moved to its own page [[p_comparative_analysis]]          irs2 += 1; } +## Retro-fitting Predication into branch-explicit ISA + +One of the goals of this parallelism proposal is to avoid instruction +duplication. However, with the base ISA having been designed explictly +to *avoid* condition-codes entirely, shoe-horning predication into it +bcomes quite challenging. + +However what if all branch instructions, if referencing a vectorised +register, were instead given *completely new analogous meanings* that +resulted in a parallel bit-wise predication register being set? This +would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, +BLT and BGE. + +We might imagine that FEQ, FLT and FLT would also need to be converted, +however these are effectively *already* in the precise form needed and +do not need to be converted *at all*! The difference is that FEQ, FLT +and FLE *specifically* write a 1 to an integer register if the condition +holds, and 0 if not. All that needs to be done here is to say, "if +the integer register is tagged with a bit that says it is a predication +register, the **bit** in the integer register is set based on the +current vector index" instead. + +There is, in the standard Conditional Branch instruction, more than +adequate space to interpret it in a similar fashion: + +[[!table data=""" + 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | +imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | + 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | + offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | +"""]] + +This would become: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | +"""]] + +Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, +with the interesting side-effect that there is space within what is presently +the "immediate offset" field to reinterpret that to add in not only a bit +field to distinguish between floating-point compare and integer compare, +not only to add in a second source register, but also use some of the bits as +a predication target as well. + +[[!table data=""" +15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | + funct3 | imm | rs10 | imm | op | + 3 | 3 | 3 | 5 | 2 | + C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | +"""]] + +Now uses the CS format: + +[[!table data=""" +15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | + funct3 | imm | rs10 | imm | | op | + 3 | 3 | 3 | 2 | 3 | 2 | + C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | +"""]] + +Bit 6 would be decoded as "operation refers to Integer or Float" including +interpreting src1 and src2 accordingly as outlined in Table 12.2 of the +"C" Standard, version 2.0, +whilst Bit 5 would allow the operation to be extended, in combination with +funct3 = 110 or 111: a combination of four distinct (predicated) comparison +operators. In both floating-point and integer cases those could be +EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). + ## Register reordering ### Register File @@ -926,6 +1123,7 @@ This is interpreted as follows: * Given that the context is RV32, ELEN=32. * With ELEN=32 and bitwidth=16, the number of SIMD elements is 2 * Therefore the actual vector length is up to *six* elements +* However vsetl sets a length 5 therefore the last "element" is skipped So when using an operation that uses r2 as a source (or destination) the operation is carried out as follows: @@ -949,6 +1147,38 @@ operations carried out 32-bits at a time is perfectly acceptable, as is Regardless of the internal parallelism choice, *predication must still be respected*, making Simple-V in effect the "consistent public API". +vew may be one of the following (giving a table "bytestable", used below): + +| vew | bitwidth | +| --- | -------- | +| 000 | default | +| 001 | 8 | +| 010 | 16 | +| 011 | 32 | +| 100 | 64 | +| 101 | 128 | +| 110 | rsvd | +| 111 | rsvd | + +Pseudocode for vector length taking CSR SIMD-bitwidth into account: + + vew = CSRbitwidth[rs1] + if (vew == 0) + bytesperreg = (XLEN/8) # or FLEN as appropriate + else: + bytesperreg = bytestable[vew] # 1 2 4 8 16 + simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate + vlen = CSRvectorlen[rs1] * simdmult + +To index an element in a register rnum where the vector element index is i: + + function regoffs(rnum, i): + regidx = floor(i / simdmult) # integer-div rounded down + byteidx = i % simdmult # integer-remainder + return rnum + regidx, # actual real register + byteidx * 8, # low + byteidx * 8 + (vew-1), # high + ### Example Instruction translation: Instructions "ADD r2 r4 r4" would result in three instructions being