It is absolutely critical to note that it is proposed that such choices MUST
be **entirely transparent** to the end-user and the compiler. Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
+a Vector (varible-width SIMD) may not precisely match the width of the
parallelism within the implementation, the end-user **should not care**
and in this way the performance benefits are gained but the ISA remains
straightforward. All that happens at the end of an instruction run is: some
parallel units (if there are any) would remain offline, completely
transparently to the ISA, the program, and the compiler.
-The "SIMD considered harmful" trap of having huge complexity and extra
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
instructions to deal with corner-cases is thus avoided, and implementors
get to choose precisely where to focus and target the benefits of their
implementation efforts, without "extra baggage".
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism. Options are covered in the Appendix.
+
# CSRs <a name="csrs"></a>
There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
+decode phase to re-interpret RV opcodes (a practice that has
precedent in the setting of MISA to enable / disable extensions).
* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
"bitwidth" may fit into an XLEN-sized register file.
* Predication is a key-value store due to the implicit referencing,
as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+ state information.
+* TODO: assess whether the same technique could be applied to the other
+ Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+ V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+ needed for context-switches (empty slots need never be stored).
## Predication CSR
An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector. A vector length of 1 indicates
-that it is to be treated as a scalar. Vector lengths of 0 are reserved.
+implicitly to be treated as a vector.
+
+Note:
+
+* A vector length of 1 indicates that it is to be treated as a scalar.
+ Bitwidths (on the same register) are interpreted and meaningful.
+* A vector length of 0 indicates that the parallelism is to be switched
+ off for this register (treated as a scalar). When length is 0,
+ the bitwidth CSR for the register is *ignored*.
Internally, implementations may choose to use the non-zero vector length
to set a bit-field per register, to be used in the instruction decode phase.
## Branch Instruction:
+Branch operations use standard RV opcodes that are reinterpreted to be
+"predicate variants" in the instance where either of the two src registers
+have their corresponding CSRvectorlen[src] entry as non-zero. When this
+reinterpretation is enabled the predicate target register rs3 is to be
+treated as a bitfield (up to a maximum of XLEN bits corresponding to a
+maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector. Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
This is the overloaded table for Integer-base Branch operations. Opcode
(bits 6..0) is set in all cases to 1100011.
reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
"""]]
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
Below is the overloaded table for Floating-point Predication operations.
Interestingly no change is needed to the instruction format because
FP Compare already stores a 1 or a zero in its "rd" integer register
up to 16 (TBD) of either the floating-point or integer registers to
be marked as "predicated" (key), and if so, which integer register to
use as the predication mask (value).
-
+
**TODO**
# Implementing P (renamed to DSP) on top of Simple-V
designed to be slotted in to an existing implementation (just after
instruction decode) with minimum disruption and effort.
-* minus: the complexity of having to use register renames, OoO, VLIW,
- register file cacheing, all of which has been done before but is a
- pain
+* minus: the complexity (if full parallelism is to be exploited)
+ of having to use register renames, OoO, VLIW, register file cacheing,
+ all of which has been done before but is a pain
* plus: transparent re-use of existing opcodes as-is just indirectly
saying "this register's now a vector" which
* plus: means that future instructions also get to be inherently
a SIMD architecture where the ALU becomes responsible for the parallelism,
Alt-RVP ALUs would likewise be so responsible... with *additional*
(lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
at least one dimension are avoided (architectural upgrades introducing
128-bit then 256-bit then 512-bit variants of the exact same 64-bit
SIMD block)
adequate space to interpret it in a similar fashion:
[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
+imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
"""]]
This would become:
a predication target as well.
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | op |
+3 | 3 | 3 | 5 | 2 |
+C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
"""]]
Now uses the CS format:
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
"""]]
Bit 6 would be decoded as "operation refers to Integer or Float" including
vew may be one of the following (giving a table "bytestable", used below):
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
Pseudocode for vector length taking CSR SIMD-bitwidth into account:
Whilst the above may seem to be severe minuses, there are some strong
pluses:
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
* Smaller reduction of P's opcode space: around 10%.
* The potential to use Compressed instructions in both Vector and SIMD
due to the overloading of register meaning (implicit vectorisation,
>
> Thrown away.
-discussion then led to the question of OoO architectures
+discussion then led to the question of OoO architectures
-> The costs of the imprecise-exception model are greater than the benefit.
+> The costs of the imprecise-exception model are greater than the benefit.
> Software doesn't want to cope with it. It's hard to debug. You can't
> migrate state between different microarchitectures--unless you force all
> implementations to support the same imprecise-exception model, which would
> structure, as the microarchitectural guts have to be spilled to memory.)
-## Implementation Paradigms
+## Implementation Paradigms <a name="implementation_paradigms"></a>
TODO: assess various implementation paradigms. These are listed roughly
in order of simplicity (minimum compliance, for ultra-light-weight
Destruction of destination indices requires a copy of the entire vector
in advance to avoid.
+TBD: floating-point compare and other exception handling
+
# References
* SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>