From: Luke Kenneth Casson Leighton Date: Tue, 17 Apr 2018 04:01:51 +0000 (+0100) Subject: shuffle X-Git-Tag: convert-csv-opcode-to-binary~5635 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=050786c1b84f5a03bac2b5f0df7057b100807e5e;p=libreriscv.git shuffle --- diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 9bfc9ccc1..f077c5758 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -93,12 +93,12 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable for general-purpose computation, and in the context of developing a general-purpose ISA, is never going to satisfy 100 percent of implementors. -Worse, for increased workloads over time, as the performance requirements -increase for new target markets, implementors choose to extend the SIMD -width (so as to again avoid mixing parallelism into the instruction issue -phases: the primary "simplicity" benefit of SIMD in the first place), -with the result that the entire opcode space effectively doubles -with each new SIMD width that's added to the ISA. +To explain this further: for increased workloads over time, as the +performance requirements increase for new target markets, implementors +choose to extend the SIMD width (so as to again avoid mixing parallelism +into the instruction issue phases: the primary "simplicity" benefit of +SIMD in the first place), with the result that the entire opcode space +effectively doubles with each new SIMD width that's added to the ISA. That basically leaves "variable-length vector" as the clear *general-purpose* winner, at least in terms of greatly simplifying the instruction set, @@ -141,7 +141,10 @@ to be explicitly done with its own separate instruction. However, implicit type-conversion is not only quite burdensome to implement (explosion of inferred type-to-type conversion) but also is never really going to be complete. It gets even worse when bit-widths -also have to be taken into consideration. +also have to be taken into consideration. Each new type results in +an increased O(N^2) conversion space that, as anyone who has examined +python's source code (which has built-in polymorphic type-conversion), +knows that the task is more complex than it first seems. Overall, type-conversion is generally best to leave to explicit type-conversion instructions, or in definite specific use-cases left to @@ -219,89 +222,20 @@ condition-codes or predication. By adding a CSR it becomes possible to also tag certain registers as "predicated if referenced as a destination". Example: - // in future operations if r0 is the destination use r5 as + // in future operations from now on, if r0 is the destination use r5 as // the PREDICATION register - IMPLICICSRPREDICATE r0, r5 + SET_IMPLICIT_CSRPREDICATE r0, r5 // store the compares in r5 as the PREDICATION register CMPEQ8 r5, r1, r2 // r0 is used here. ah ha! that means it's predicated using r5! ADD8 r0, r1, r3 -With enough registers (and there are enough registers) some fairly +With enough registers (and in RISC-V there are enough registers) some fairly complex predication can be set up and yet still execute without significant stalling, even in a simple non-superscalar architecture. -### Retro-fitting Predication into branch-explicit ISA - -One of the goals of this parallelism proposal is to avoid instruction -duplication. However, with the base ISA having been designed explictly -to *avoid* condition-codes entirely, shoe-horning predication into it -bcomes quite challenging. - -However what if all branch instructions, if referencing a vectorised -register, were instead given *completely new analogous meanings* that -resulted in a parallel bit-wise predication register being set? This -would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, -BLT and BGE. - -We might imagine that FEQ, FLT and FLT would also need to be converted, -however these are effectively *already* in the precise form needed and -do not need to be converted *at all*! The difference is that FEQ, FLT -and FLE *specifically* write a 1 to an integer register if the condition -holds, and 0 if not. All that needs to be done here is to say, "if -the integer register is tagged with a bit that says it is a predication -register, the **bit** in the integer register is set based on the -current vector index" instead. - -There is, in the standard Conditional Branch instruction, more than -adequate space to interpret it in a similar fashion: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | -"""]] - -This would become: - -[[!table data=""" - 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | -imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | - 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | - reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | -"""]] - -Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, -with the interesting side-effect that there is space within what is presently -the "immediate offset" field to reinterpret that to add in not only a bit -field to distinguish between floating-point compare and integer compare, -not only to add in a second source register, but also use some of the bits as -a predication target as well. - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | op | - 3 | 3 | 3 | 5 | 2 | - C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | -"""]] - -Now uses the CS format: - -[[!table data=""" -15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | - funct3 | imm | rs10 | imm | | op | - 3 | 3 | 3 | 2 | 3 | 2 | - C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | -"""]] - -Bit 6 would be decoded as "operation refers to Integer or Float" including -interpreting src1 and src2 accordingly as outlined in Table 12.2 of the -"C" Standard, version 2.0, -whilst Bit 5 would allow the operation to be extended, in combination with -funct3 = 110 or 111: a combination of four distinct (predicated) comparison -operators. In both floating-point and integer cases those could be -EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). +(For details on how Branch Instructions would be retro-fitted to indirectly +predicated equivalents, see Appendix) ## Conclusions @@ -344,6 +278,18 @@ look at and investigate* * For analysis of RVV see [[v_comparative_analysis]] which begins to outline topologically-equivalent mappings of instructions +* Also see Appendix "Retro-fitting Predication into branch-explicit ISA" + for format of Branch opcodes. + +**TODO**: *analyse and decide whether the implicit nature of predication +as proposed is or is not a lot of hassle, and if explicit prefixes are +a better idea instead. Parallelism therefore effectively may end up +as always being 64-bit opcodes (32 for the prefix, 32 for the instruction) +with some opportunities for to use Compressed bringing it down to 48. +Also to consider is whether one or both of the last two remaining Compressed +instruction codes in Quadrant 1 could be used as a parallelism prefix, +bringing parallelised opcodes down to 32-bit and having the benefit of +being explicit.* # Note on implementation of parallelism @@ -851,6 +797,78 @@ This section has been moved to its own page [[p_comparative_analysis]]          irs2 += 1; } +## Retro-fitting Predication into branch-explicit ISA + +One of the goals of this parallelism proposal is to avoid instruction +duplication. However, with the base ISA having been designed explictly +to *avoid* condition-codes entirely, shoe-horning predication into it +bcomes quite challenging. + +However what if all branch instructions, if referencing a vectorised +register, were instead given *completely new analogous meanings* that +resulted in a parallel bit-wise predication register being set? This +would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, +BLT and BGE. + +We might imagine that FEQ, FLT and FLT would also need to be converted, +however these are effectively *already* in the precise form needed and +do not need to be converted *at all*! The difference is that FEQ, FLT +and FLE *specifically* write a 1 to an integer register if the condition +holds, and 0 if not. All that needs to be done here is to say, "if +the integer register is tagged with a bit that says it is a predication +register, the **bit** in the integer register is set based on the +current vector index" instead. + +There is, in the standard Conditional Branch instruction, more than +adequate space to interpret it in a similar fashion: + +[[!table data=""" + 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 | +imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | + 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | + offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | +"""]] + +This would become: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | +"""]] + +Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, +with the interesting side-effect that there is space within what is presently +the "immediate offset" field to reinterpret that to add in not only a bit +field to distinguish between floating-point compare and integer compare, +not only to add in a second source register, but also use some of the bits as +a predication target as well. + +[[!table data=""" +15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 | + funct3 | imm | rs10 | imm | op | + 3 | 3 | 3 | 5 | 2 | + C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | +"""]] + +Now uses the CS format: + +[[!table data=""" +15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 | + funct3 | imm | rs10 | imm | | op | + 3 | 3 | 3 | 2 | 3 | 2 | + C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 | +"""]] + +Bit 6 would be decoded as "operation refers to Integer or Float" including +interpreting src1 and src2 accordingly as outlined in Table 12.2 of the +"C" Standard, version 2.0, +whilst Bit 5 would allow the operation to be extended, in combination with +funct3 = 110 or 111: a combination of four distinct (predicated) comparison +operators. In both floating-point and integer cases those could be +EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). + ## Register reordering ### Register File