for general-purpose computation, and in the context of developing a
general-purpose ISA, is never going to satisfy 100 percent of implementors.
-Worse, for increased workloads over time, as the performance requirements
-increase for new target markets, implementors choose to extend the SIMD
-width (so as to again avoid mixing parallelism into the instruction issue
-phases: the primary "simplicity" benefit of SIMD in the first place),
-with the result that the entire opcode space effectively doubles
-with each new SIMD width that's added to the ISA.
+To explain this further: for increased workloads over time, as the
+performance requirements increase for new target markets, implementors
+choose to extend the SIMD width (so as to again avoid mixing parallelism
+into the instruction issue phases: the primary "simplicity" benefit of
+SIMD in the first place), with the result that the entire opcode space
+effectively doubles with each new SIMD width that's added to the ISA.
That basically leaves "variable-length vector" as the clear *general-purpose*
winner, at least in terms of greatly simplifying the instruction set,
However, implicit type-conversion is not only quite burdensome to
implement (explosion of inferred type-to-type conversion) but also is
never really going to be complete. It gets even worse when bit-widths
-also have to be taken into consideration.
+also have to be taken into consideration. Each new type results in
+an increased O(N^2) conversion space that, as anyone who has examined
+python's source code (which has built-in polymorphic type-conversion),
+knows that the task is more complex than it first seems.
Overall, type-conversion is generally best to leave to explicit
type-conversion instructions, or in definite specific use-cases left to
to also tag certain registers as "predicated if referenced as a destination".
Example:
- // in future operations if r0 is the destination use r5 as
+ // in future operations from now on, if r0 is the destination use r5 as
// the PREDICATION register
- IMPLICICSRPREDICATE r0, r5
+ SET_IMPLICIT_CSRPREDICATE r0, r5
// store the compares in r5 as the PREDICATION register
CMPEQ8 r5, r1, r2
// r0 is used here. ah ha! that means it's predicated using r5!
ADD8 r0, r1, r3
-With enough registers (and there are enough registers) some fairly
+With enough registers (and in RISC-V there are enough registers) some fairly
complex predication can be set up and yet still execute without significant
stalling, even in a simple non-superscalar architecture.
-### Retro-fitting Predication into branch-explicit ISA
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication. However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.
-
-However what if all branch instructions, if referencing a vectorised
-register, were instead given *completely new analogous meanings* that
-resulted in a parallel bit-wise predication register being set? This
-would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
-BLT and BGE.
-
-We might imagine that FEQ, FLT and FLT would also need to be converted,
-however these are effectively *already* in the precise form needed and
-do not need to be converted *at all*! The difference is that FEQ, FLT
-and FLE *specifically* write a 1 to an integer register if the condition
-holds, and 0 if not. All that needs to be done here is to say, "if
-the integer register is tagged with a bit that says it is a predication
-register, the **bit** in the integer register is set based on the
-current vector index" instead.
-
-There is, in the standard Conditional Branch instruction, more than
-adequate space to interpret it in a similar fashion:
-
-[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
-"""]]
-
-This would become:
-
-[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
-"""]]
-
-Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
-with the interesting side-effect that there is space within what is presently
-the "immediate offset" field to reinterpret that to add in not only a bit
-field to distinguish between floating-point compare and integer compare,
-not only to add in a second source register, but also use some of the bits as
-a predication target as well.
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
-"""]]
-
-Now uses the CS format:
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
-"""]]
-
-Bit 6 would be decoded as "operation refers to Integer or Float" including
-interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
-"C" Standard, version 2.0,
-whilst Bit 5 would allow the operation to be extended, in combination with
-funct3 = 110 or 111: a combination of four distinct (predicated) comparison
-operators. In both floating-point and integer cases those could be
-EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
+(For details on how Branch Instructions would be retro-fitted to indirectly
+predicated equivalents, see Appendix)
## Conclusions
* For analysis of RVV see [[v_comparative_analysis]] which begins to
outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+ for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead. Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit and having the benefit of
+being explicit.*
# Note on implementation of parallelism
irs2 += 1;
}
+## Retro-fitting Predication into branch-explicit ISA
+
+One of the goals of this parallelism proposal is to avoid instruction
+duplication. However, with the base ISA having been designed explictly
+to *avoid* condition-codes entirely, shoe-horning predication into it
+bcomes quite challenging.
+
+However what if all branch instructions, if referencing a vectorised
+register, were instead given *completely new analogous meanings* that
+resulted in a parallel bit-wise predication register being set? This
+would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
+BLT and BGE.
+
+We might imagine that FEQ, FLT and FLT would also need to be converted,
+however these are effectively *already* in the precise form needed and
+do not need to be converted *at all*! The difference is that FEQ, FLT
+and FLE *specifically* write a 1 to an integer register if the condition
+holds, and 0 if not. All that needs to be done here is to say, "if
+the integer register is tagged with a bit that says it is a predication
+register, the **bit** in the integer register is set based on the
+current vector index" instead.
+
+There is, in the standard Conditional Branch instruction, more than
+adequate space to interpret it in a similar fashion:
+
+[[!table data="""
+ 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
+imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+"""]]
+
+This would become:
+
+[[!table data="""
+31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
+"""]]
+
+Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
+with the interesting side-effect that there is space within what is presently
+the "immediate offset" field to reinterpret that to add in not only a bit
+field to distinguish between floating-point compare and integer compare,
+not only to add in a second source register, but also use some of the bits as
+a predication target as well.
+
+[[!table data="""
+15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
+ funct3 | imm | rs10 | imm | op |
+ 3 | 3 | 3 | 5 | 2 |
+ C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+"""]]
+
+Now uses the CS format:
+
+[[!table data="""
+15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
+ funct3 | imm | rs10 | imm | | op |
+ 3 | 3 | 3 | 2 | 3 | 2 |
+ C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+"""]]
+
+Bit 6 would be decoded as "operation refers to Integer or Float" including
+interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
+"C" Standard, version 2.0,
+whilst Bit 5 would allow the operation to be extended, in combination with
+funct3 = 110 or 111: a combination of four distinct (predicated) comparison
+operators. In both floating-point and integer cases those could be
+EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
+
## Register reordering <a name="register_reordering"></a>
### Register File