# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
-[[!toc ]]
-
-# Summary
-
Key insight: Simple-V is intended as an abstraction layer to provide
a consistent "API" to parallelisation of existing *and future* operations.
*Actual* internal hardware-level parallelism is *not* required, such
implementations, or SIMD, or anything else, would then benefit *if*
Simple-V was added on top.
+[[!toc ]]
+
# Introduction
This proposal exists so as to be able to satisfy several disparate
each of P and V, and to see if each of P and V then, in *combination* with
a "best-of-both" parallelism extension, could be added on *on top* of
this proposal, to topologically provide the exact same functionality of
-each of P and V.
+each of P and V. Each of P and V then can focus on providing the best
+operations possible for their respective target areas, without being
+hugely concerned about the actual parallelism.
Furthermore, an additional goal of this proposal is to reduce the number
of opcodes utilised by each of P and V as they currently stand, leveraging
existing RISC-V opcodes where possible, and also potentially allowing
P and V to make use of Compressed Instructions as a result.
-**TODO**: reword this to better suit this document:
-
-Having looked at both P and V as they stand, they're _both_ very much
-"separate engines" that, despite both their respective merits and
-extremely powerful features, don't really cleanly fit into the RV design
-ethos (or the flexible extensibility) and, as such, are both in danger
-of not being widely adopted. I'm inclined towards recommending:
-
-* splitting out the DSP aspects of P-SIMD to create a single-issue DSP
-* splitting out the polymorphism, esoteric data types (GF, complex
- numbers) and unusual operations of V to create a single-issue "Esoteric
- Floating-Point" extension
-* splitting out the loop-aspects, vector aspects and data-width aspects
- of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they
- apply across *all* Extensions, whether those be DSP, M, Base, V, P -
- everything.
-
**TODO**: propose overflow registers be actually one of the integer regs
(flowing to multiple regs).
**TODO**: propose "mask" (predication) registers likewise. combination with
-standard RV instructions and overflow registers extremely powerful
+standard RV instructions and overflow registers extremely powerful, see
+Aspex ASP.
# Analysis and discussion of Vector vs SIMD
for general-purpose computation, and in the context of developing a
general-purpose ISA, is never going to satisfy 100 percent of implementors.
+To explain this further: for increased workloads over time, as the
+performance requirements increase for new target markets, implementors
+choose to extend the SIMD width (so as to again avoid mixing parallelism
+into the instruction issue phases: the primary "simplicity" benefit of
+SIMD in the first place), with the result that the entire opcode space
+effectively doubles with each new SIMD width that's added to the ISA.
+
That basically leaves "variable-length vector" as the clear *general-purpose*
winner, at least in terms of greatly simplifying the instruction set,
reducing the number of instructions required for any given task, and thus
due to "type tagging" that is set with a special instruction. A register
will be *specifically* marked as "16-bit Floating-Point" and, if added
to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take placce *without* requiring that type-conversion
+type-conversion will take place *without* requiring that type-conversion
to be explicitly done with its own separate instruction.
However, implicit type-conversion is not only quite burdensome to
implement (explosion of inferred type-to-type conversion) but also is
never really going to be complete. It gets even worse when bit-widths
-also have to be taken into consideration.
+also have to be taken into consideration. Each new type results in
+an increased O(N^2) conversion space that, as anyone who has examined
+python's source code (which has built-in polymorphic type-conversion),
+knows that the task is more complex than it first seems.
Overall, type-conversion is generally best to leave to explicit
type-conversion instructions, or in definite specific use-cases left to
to also tag certain registers as "predicated if referenced as a destination".
Example:
- // in future operations if r0 is the destination use r5 as
+ // in future operations from now on, if r0 is the destination use r5 as
// the PREDICATION register
- IMPLICICSRPREDICATE r0, r5
+ SET_IMPLICIT_CSRPREDICATE r0, r5
// store the compares in r5 as the PREDICATION register
CMPEQ8 r5, r1, r2
// r0 is used here. ah ha! that means it's predicated using r5!
ADD8 r0, r1, r3
-With enough registers (and there are enough registers) some fairly
+With enough registers (and in RISC-V there are enough registers) some fairly
complex predication can be set up and yet still execute without significant
stalling, even in a simple non-superscalar architecture.
-### Retro-fitting Predication into branch-explicit ISA
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication. However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.
-
-However what if all branch instructions, if referencing a vectorised
-register, were instead given *completely new analogous meanings* that
-resulted in a parallel bit-wise predication register being set? This
-would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
-BLT and BGE.
-
-We might imagine that FEQ, FLT and FLT would also need to be converted,
-however these are effectively *already* in the precise form needed and
-do not need to be converted *at all*! The difference is that FEQ, FLT
-and FLE *specifically* write a 1 to an integer register if the condition
-holds, and 0 if not. All that needs to be done here is to say, "if
-the integer register is tagged with a bit that says it is a predication
-register, the **bit** in the integer register is set based on the
-current vector index" instead.
-
-There is, in the standard Conditional Branch instruction, more than
-adequate space to interpret it in a similar fashion:
-
-[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
-"""]]
-
-This would become:
-
-[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
-"""]]
-
-Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
-with the interesting side-effect that there is space within what is presently
-the "immediate offset" field to reinterpret that to add in not only a bit
-field to distinguish between floating-point compare and integer compare,
-not only to add in a second source register, but also use some of the bits as
-a predication target as well.
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
-"""]]
-
-Now uses the CS format:
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
-"""]]
-
-Bit 6 would be decoded as "operation refers to Integer or Float" including
-interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
-"C" Standard, version 2.0,
-whilst Bit 5 would allow the operation to be extended, in combination with
-funct3 = 110 or 111: a combination of four distinct comparison operators.
+(For details on how Branch Instructions would be retro-fitted to indirectly
+predicated equivalents, see Appendix)
## Conclusions
# Instruction Format
-**TODO** *basically borrow from both P and V, which should be quite simple
-to do, with the exception of Tag/no-tag, which needs a bit more
-thought. V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
-gather-scatterer, and, if implemented, could actually be a really useful
-way to span 8-bit up to 64-bit groups of data, where BGS as it stands
-and described by Clifford does **bits** of up to 16 width. Lots to
-look at and investigate!*
+The instruction format for Simple-V does not actually have *any* compare
+operations, *any* arithmetic, floating point or memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations and LOAD/STORE
+depending on implicit CSR configurations for both vector length and
+bitwidth. This includes Compressed instructions.
+
+* For analysis of RVV see [[v_comparative_analysis]] which begins to
+ outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+ for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead. Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit and having the benefit of
+being explicit.*
+
+## Branch Instruction:
+
+[[!table data="""
+31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+I/F | reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
+0 | reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
+0 | reserved | src2 | src1 | 001 | predicate rs3 || BNE |
+0 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+0 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+0 | reserved | src2 | src1 | 100 | predicate rs3 || BLE |
+0 | reserved | src2 | src1 | 101 | predicate rs3 || BGE |
+0 | reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
+0 | reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
+1 | reserved | src2 | src1 | 000 | predicate rs3 || FEQ |
+1 | reserved | src2 | src1 | 001 | predicate rs3 || FNE |
+1 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 100 | predicate rs3 || FLT |
+1 | reserved | src2 | src1 | 101 | predicate rs3 || FLE |
+1 | reserved | src2 | src1 | 110 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 111 | predicate rs3 || rsvd |
+"""]]
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]);
+
+With associated predication, vector-length adjustments and so on,
+and temporarily ignoring bitwidth (which makes the comparisons more
+complex), this becomes:
+
+ if I/F == INT: # integer type cmp
+ pred_enabled = int_pred_enabled # TODO: exception if not set!
+ preg = int_pred_reg[rd]
+ else:
+ pred_enabled = fp_pred_enabled # TODO: exception if not set!
+ preg = fp_pred_reg[rd]
+
+ s1 = CSRvectorlen[src1] > 1;
+ s2 = CSRvectorlen[src2] > 1;
+ for (int i=0; i<vl; ++i)
+ preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
+ s2 ? reg[src2+i] : reg[src2]);
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+ into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+ Reordering") setting Vector-Length * (number of SIMD elements) bits
+ in Predicate Register rs3 as opposed to just Vector-Length bits.
+* Predicated Branches do not actually have an adjustment to the Program
+ Counter, so all of bits 25 through 30 in every case are not needed.
+* There are plenty of reserved opcodes for which bits 25 through 30 could
+ be put to good use if there is a suitable use-case.
+* FEQ and FNE (and BEQ and BNE) are included in order to save one
+ instruction having to invert the resultant predicate bitfield.
+ FLT and FLE may be inverted to FGT and FGE if needed by swapping
+ src1 and src2 (likewise the integer counterparts).
+
+## Compressed Branch Instruction:
+
+[[!table data="""
+15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
+funct3 | imm | rs10 | imm | | op | |
+3 | 3 | 3 | 2 | 3 | 2 | |
+C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
+110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
+111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
+110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
+111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
+"""]]
+
+Notes:
+
+* Bits 5 13 14 and 15 make up the comparator type
+* In both floating-point and integer cases there are four predication
+ comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
+ src1 and src2).
+
+## LOAD / STORE Instructions
+
+For full analysis of topological adaptation of RVV LOAD/STORE
+see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
+may be implicitly overloaded into the one base RV LOAD instruction.
+
+Revised LOAD:
+
+[[!table data="""
+31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+imm[11:0] |||| rs1 | funct3 | rd | opcode |
+1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
+? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
+"""]]
+
+The exact same corresponding adaptation is also carried out on the single,
+double and quad precision floating-point LOAD-FP and STORE-FP operations,
+which fit the exact same instruction format. Thus all three types
+(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
+as well as FSW, FSD and FSQ.
+
+Notes:
+
+* LOAD remains functionally (topologically) identical to RVV LOAD
+ (for both integer and floating-point variants).
+* Predication CSR-marking register is not explicitly shown in instruction, it's
+ implicit based on the CSR predicate state for the rd (destination) register
+* rs2, the source, may *also be marked as a vector*, which implicitly
+ is taken to indicate "Indexed Load" (LD.X)
+* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
+* Bit 31 is reserved (ideas under consideration: auto-increment)
+* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
+* **TODO**: clarify where width maps to elsize
+
+Pseudo-code (excludes CSR SIMD bitwidth):
+
+ if (unit-strided) stride = elsize;
+ else stride = areg[as2]; // constant-strided
+
+ pred_enabled = int_pred_enabled
+ preg = int_pred_reg[rd]
+
+ for (int i=0; i<vl; ++i)
+ if (preg_enabled[rd] && [!]preg[i])
+ for (int j=0; j<seglen+1; j++)
+ {
+ if CSRvectorised[rs2])
+ offs = vreg[rs2][i]
+ else
+ offs = i*(seglen+1)*stride;
+ vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
+ }
+
+Taking CSR (SIMD) bitwidth into account involves using the vector
+length and register encoding according to the "Bitwidth Virtual Register
+Reordering" scheme shown in the Appendix (see function "regoffs").
+
+A similar instruction exists for STORE, with identical topological
+translation of all features. **TODO**
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are of the same format, where bits 2-4 are
+a src register instead of dest:
+
+[[!table data="""
+15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
+funct3 | imm | rs10 | imm | rd0 | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
+"""]]
+
+Unfortunately it is not possible to fit the full functionality
+of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
+require another operand (rs2) in addition to the operand width
+(which is also missing), offset, base, and src/dest.
+
+However a close approximation may be achieved by taking the top bit
+of the offset in each of the five types of LD (and ST), reducing the
+offset to 4 bits and utilising the 5th bit to indicate whether "stride"
+is to be enabled. In this way it is at least possible to introduce
+that functionality.
+
+(**TODO**: *assess whether the loss of one bit from offset is worth having
+"stride" capability.*)
+
+We also assume (including for the "stride" variant) that the "width"
+parameter, which is missing, is derived and implicit, just as it is
+with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
+and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
+C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
+
+Interestingly we note that the Vectorised Simple-V variant of
+LOAD/STORE (Compressed and otherwise), due to it effectively using the
+standard register file(s), is the direct functional equivalent of
+standard load-multiple and store-multiple instructions found in other
+processors.
+
+In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
+page 76, "For virtual memory systems some data accesses could be resident
+in physical memory and some not". The interesting question then arises:
+how does RVV deal with the exact same scenario?
# Note on implementation of parallelism
# CSRs <a name="csrs"></a>
There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has precedent
-in the setting of MISA to enable / disable extensions).
+decode phase to re-interpret standard RV opcodes (a practice that has
+precedent in the setting of MISA to enable / disable extensions).
* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (key-value store)
+* Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
Notes:
The reason for setting this limit is so that predication registers, when
marked as such, may fit into a single register as opposed to fanning out
over several registers. This keeps the implementation a little simpler.
+Note that RVV on top of Simple-V may choose to over-ride this decision.
## Vector-length CSRs
predicated.
An example of how to subdivide the register file when bitwidth != default
-is given in the section "Virtual Register Reordering".
-
-# Example of vector / vector, vector / scalar, scalar / scalar => vector add
-
- register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
- register CSRpredicate[XLEN][4]; # 2^4 is max vector length
- register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
- register x[32][XLEN];
-
- function op_add(rd, rs1, rs2, predr)
- {
- /* note that this is ADD, not PADD */
- int i, id, irs1, irs2;
- # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
- # also destination makes no sense as a scalar but what the hell...
- for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
- if (CSRpredicate[predr][i]) # i *think* this is right...
- x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
- # now increment the idxs
- if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
- id += 1;
- if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
- irs1 += 1;
- if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
- irs2 += 1;
- }
-
-# V-Extension to Simple-V Comparative Analysis
-
-This section has been moved to its own page [[v_comparative_analysis]]
-
-# P-Ext ISA
-
-This section has been moved to its own page [[p_comparative_analysis]]
+is given in the section "Bitwidth Virtual Register Reordering".
# Exceptions
(caveat: anything not specified drops through to software-emulation / traps)
* TODO
-# Register reordering <a name="register_reordering"></a>
+# Appendix
+
+## V-Extension to Simple-V Comparative Analysis
-## Register File
+This section has been moved to its own page [[v_comparative_analysis]]
+
+## P-Ext ISA
+
+This section has been moved to its own page [[p_comparative_analysis]]
+
+## Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+ register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
+ register CSRpredicate[XLEN][4]; # 2^4 is max vector length
+ register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
+ register x[32][XLEN];
+
+ function op_add(rd, rs1, rs2, predr)
+ {
+ /* note that this is ADD, not PADD */
+ int i, id, irs1, irs2;
+ # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
+ # also destination makes no sense as a scalar but what the hell...
+ for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
+ if (CSRpredicate[predr][i]) # i *think* this is right...
+ x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
+ # now increment the idxs
+ if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
+ id += 1;
+ if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
+ irs1 += 1;
+ if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
+ irs2 += 1;
+ }
+
+## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
+
+One of the goals of this parallelism proposal is to avoid instruction
+duplication. However, with the base ISA having been designed explictly
+to *avoid* condition-codes entirely, shoe-horning predication into it
+bcomes quite challenging.
+
+However what if all branch instructions, if referencing a vectorised
+register, were instead given *completely new analogous meanings* that
+resulted in a parallel bit-wise predication register being set? This
+would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
+BLT and BGE.
+
+We might imagine that FEQ, FLT and FLT would also need to be converted,
+however these are effectively *already* in the precise form needed and
+do not need to be converted *at all*! The difference is that FEQ, FLT
+and FLE *specifically* write a 1 to an integer register if the condition
+holds, and 0 if not. All that needs to be done here is to say, "if
+the integer register is tagged with a bit that says it is a predication
+register, the **bit** in the integer register is set based on the
+current vector index" instead.
+
+There is, in the standard Conditional Branch instruction, more than
+adequate space to interpret it in a similar fashion:
+
+[[!table data="""
+ 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
+imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+"""]]
+
+This would become:
+
+[[!table data="""
+31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
+"""]]
+
+Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
+with the interesting side-effect that there is space within what is presently
+the "immediate offset" field to reinterpret that to add in not only a bit
+field to distinguish between floating-point compare and integer compare,
+not only to add in a second source register, but also use some of the bits as
+a predication target as well.
+
+[[!table data="""
+15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
+ funct3 | imm | rs10 | imm | op |
+ 3 | 3 | 3 | 5 | 2 |
+ C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+"""]]
+
+Now uses the CS format:
+
+[[!table data="""
+15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
+ funct3 | imm | rs10 | imm | | op |
+ 3 | 3 | 3 | 2 | 3 | 2 |
+ C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+"""]]
+
+Bit 6 would be decoded as "operation refers to Integer or Float" including
+interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
+"C" Standard, version 2.0,
+whilst Bit 5 would allow the operation to be extended, in combination with
+funct3 = 110 or 111: a combination of four distinct (predicated) comparison
+operators. In both floating-point and integer cases those could be
+EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
+
+## Register reordering <a name="register_reordering"></a>
+
+### Register File
| Reg Num | Bits |
| ------- | ---- |
| .. | (32..0) |
| r31| (32..0) |
-## Vectorised CSR
+### Vectorised CSR
May not be an actual CSR: may be generated from Vector Length CSR:
single-bit is less burdensome on instruction decode phase.
| - | - | - | - | - | - | - | - |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
-## Vector Length CSR
+### Vector Length CSR
| Reg Num | (3..0) |
| ------- | ---- |
| r6 | 0 |
| r7 | 1 |
-## Virtual Register Reordering:
+### Virtual Register Reordering
This example assumes the above Vector Length CSR table
| r4 | (32..0) | (32..0) | (32..0) |
| r7 | (32..0) |
+### Bitwidth Virtual Register Reordering
+
This example goes a little further and illustrates the effect that a
-bitwidth CSR has been set on a register
+bitwidth CSR has been set on a register. Preconditions:
* RV32 assumed
* CSRintbitwidth[2] = 010 # integer r2 is 16-bit
* Given that the context is RV32, ELEN=32.
* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
* Therefore the actual vector length is up to *six* elements
+* However vsetl sets a length 5 therefore the last "element" is skipped
So when using an operation that uses r2 as a source (or destination)
the operation is carried out as follows:
Regardless of the internal parallelism choice, *predication must
still be respected*, making Simple-V in effect the "consistent public API".
-## Example Instruction translation: <a name="example_translation"></a>
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default |
+| 001 | 8 |
+| 010 | 16 |
+| 011 | 32 |
+| 100 | 64 |
+| 101 | 128 |
+| 110 | rsvd |
+| 111 | rsvd |
+
+Pseudocode for vector length taking CSR SIMD-bitwidth into account:
+
+ vew = CSRbitwidth[rs1]
+ if (vew == 0)
+ bytesperreg = (XLEN/8) # or FLEN as appropriate
+ else:
+ bytesperreg = bytestable[vew] # 1 2 4 8 16
+ simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+ vlen = CSRvectorlen[rs1] * simdmult
+
+To index an element in a register rnum where the vector element index is i:
+
+ function regoffs(rnum, i):
+ regidx = floor(i / simdmult) # integer-div rounded down
+ byteidx = i % simdmult # integer-remainder
+ return rnum + regidx, # actual real register
+ byteidx * 8, # low
+ byteidx * 8 + (vew-1), # high
+
+### Example Instruction translation: <a name="example_translation"></a>
Instructions "ADD r2 r4 r4" would result in three instructions being
generated and placed into the FILO:
* ADD r2 r5 r5
* ADD r2 r6 r6
-## Insights
+### Insights
SIMD register file splitting still to consider. For RV64, benefits of doubling
(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
of underlying hardware is an implementor-choice that could just as
equally be applied *without* Simple-V even being implemented.
-# Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
+## Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
It could indeed have been logically deduced (or expected), that there
would be additional decode latency in this proposal, because if
parallel ALUs) is only equal to one ("virtual" parallelism), or is
greater than one, should not be underestimated.
-# Appendix
-
-# Reducing Register Bank porting
+## Reducing Register Bank porting
This looks quite reasonable.
<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
by *introducing* deliberate latency into the execution phase.
-
-
# References
* SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>