*Actual* internal hardware-level parallelism is *not* required, such
that Simple-V may be viewed as providing a "compact" or "consolidated"
means of issuing multiple near-identical arithmetic instructions to an
-instruction queue (FILO), pending execution.
+instruction queue (FIFO), pending execution.
*Actual* parallelism, if added independently of Simple-V in the form
of Out-of-order restructuring (including parallel ALU lanes) or VLIW
designs all pull an ISA and its implementation in different conflicting
directions, as do the specific intended uses for any given implementation.
-Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
+The existing P (SIMD) proposal and the V (Vector) proposals,
whilst each extremely powerful in their own right and clearly desirable,
are also:
analysis and review purposes) prohibitively expensive
* Both contain partial duplication of pre-existing RISC-V instructions
(an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
- at the instruction level.
+* Both have independent, incompatible and disparate methods for introducing
+ parallelism at the instruction level
* Both require that their respective parallelism paradigm be implemented
along-side and integral to their respective functionality *or not at all*.
* Both independently have methods for introducing parallelism that
* Vectorisation typically includes much more comprehensive memory load
and store schemes (unit stride, constant-stride and indexed), which
in turn have ramifications: virtual memory misses (TLB cache misses)
- and even multiple page-faults... all caused by a *single instruction*.
+ and even multiple page-faults... all caused by a *single instruction*,
+ yet with a clear benefit that the regularisation of LOAD/STOREs can
+ be optimised for minimal impact on caches and maximised throughput.
* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
to pages), and these load/stores have absolutely nothing to do with the
- SIMD / ALU engine, no matter how wide the operand.
+ SIMD / ALU engine, no matter how wide the operand. Simplicity but with
+ more impact on instruction and data caches.
Overall it makes a huge amount of sense to have a means and method
of introducing instruction parallelism in a flexible way that provides
to direct the operation to a correctly-sized width ALU engine, anyway.
Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
+whatever reason, including limitations of the available operand space),
implicit bit-width allows the meaning of certain operations to be
type-overloaded *without* pollution or alteration of frozen and immutable
instructions, in a fully backwards-compatible fashion.
but then they do not get the opportunity to optimise for highly-streamlined
memory accesses either.
-With the "bang-per-buck" ratio being so high and the direct improvement
-in L1 Instruction Cache usage, as well as the opportunity to optimise
-L1 and L2 cache usage, the case for including Vector LOAD/STORE is
-compelling.
+With the "bang-per-buck" ratio being so high and the indirect improvement
+in L1 Instruction Cache usage (reduced instruction count), as well as
+the opportunity to optimise L1 and L2 cache usage, the case for including
+Vector LOAD/STORE is compelling.
## Mask and Tagging (Predication)
* explicit compare and branch: BNE x, y -> offs would jump offs
instructions if x was not equal to y
* explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) ADD results in a carry, carry bit implicitly
- (or sometimes explicitly) goes into a "tag" (mask) register
+* implicit (condition-code) such as ADD results in a carry, carry bit
+ implicitly (or sometimes explicitly) goes into a "tag" (mask) register
The first of these is a "normal" branch method, which is flat-out impossible
to parallelise without look-ahead and effectively rewriting instructions.
i.e. *without* requiring a super-scalar or out-of-order architecture,
but doing a proper, full job (ZOLC) is an entirely different matter.
-Constructing a SIMD/Simple-Vector proposal based around four of these five
+Constructing a SIMD/Simple-Vector proposal based around four of these six
requirements would therefore seem to be a logical thing to do.
-# Instruction Format
+# Note on implementation of parallelism
+
+One extremely important aspect of this proposal is to respect and support
+implementors desire to focus on power, area or performance. In that regard,
+it is proposed that implementors be free to choose whether to implement
+the Vector (or variable-width SIMD) parallelism as sequential operations
+with a single ALU, fully parallel (if practical) with multiple ALUs, or
+a hybrid combination of both.
+
+In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
+Parallelism". They achieve a 16-way SIMD at an **instruction** level
+by providing a combination of a 4-way parallel ALU *and* an externally
+transparent loop that feeds 4 sequential sets of data into each of the
+4 ALUs.
+
+Also in the same core, it is worth noting that particularly uncommon
+but essential operations (Reciprocal-Square-Root for example) are
+*not* part of the 4-way parallel ALU but instead have a *single* ALU.
+Under the proposed Vector (varible-width SIMD) implementors would
+be free to do precisely that: i.e. free to choose *on a per operation
+basis* whether and how much "Virtual Parallelism" to deploy.
+
+It is absolutely critical to note that it is proposed that such choices MUST
+be **entirely transparent** to the end-user and the compiler. Whilst
+a Vector (varible-width SIMD) may not precisely match the width of the
+parallelism within the implementation, the end-user **should not care**
+and in this way the performance benefits are gained but the ISA remains
+straightforward. All that happens at the end of an instruction run is: some
+parallel units (if there are any) would remain offline, completely
+transparently to the ISA, the program, and the compiler.
+
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
+instructions to deal with corner-cases is thus avoided, and implementors
+get to choose precisely where to focus and target the benefits of their
+implementation efforts, without "extra baggage".
+
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism. Options are covered in the Appendix.
+
+# CSRs <a name="csrs"></a>
+
+There are a number of CSRs needed, which are used at the instruction
+decode phase to re-interpret RV opcodes (a practice that has
+precedent in the setting of MISA to enable / disable extensions).
+
+* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+ marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+ of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+ "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+ as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+ state information.
+* TODO: assess whether the same technique could be applied to the other
+ Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+ V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+ needed for context-switches (empty slots need never be stored).
+
+## Predication CSR
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated. The first entry is whether predication
+is enabled. The second entry is whether the register index refers to a
+floating-point or an integer register. The third entry is the index
+of that register which is to be predicated (if referred to). The fourth entry
+is the integer register that is treated as a bitfield, indexable by the
+vector element index.
+
+| RegNo | 6 | 5 | (4..0) | (4..0) |
+| ----- | - | - | ------- | ------- |
+| r0 | pren0 | i/f | regidx | predidx |
+| r1 | pren1 | i/f | regidx | predidx |
+| .. | pren.. | i/f | regidx | predidx |
+| r15 | pren15 | i/f | regidx | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+ fp_pred_enabled[32];
+ int_pred_enabled[32];
+ for (i = 0; i < 16; i++)
+ if CSRpred[i].pren:
+ idx = CSRpred[i].regidx
+ predidx = CSRpred[i].predidx
+ if CSRpred[i].type == 0: # integer
+ int_pred_enabled[idx] = 1
+ int_pred_reg[idx] = predidx
+ else:
+ fp_pred_enabled[idx] = 1
+ fp_pred_reg[idx] = predidx
+
+So when an operation is to be predicated, it is the internal state that
+is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ (d ? vreg[rd][i] : sreg[rd]) =
+ iop(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+This instead becomes an *indirect* reference using the *internal* state
+table generated from the Predication CSR key-value store:
+
+ if type(iop) == INT:
+ pred_enabled = int_pred_enabled
+ preg = int_pred_reg[rd]
+ else:
+ pred_enabled = fp_pred_enabled
+ preg = fp_pred_reg[rd]
+
+ for (int i=0; i<vl; ++i)
+ if (preg_enabled[rd] && [!]preg[i])
+ (d ? vreg[rd][i] : sreg[rd]) =
+ iop(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+## MAXVECTORDEPTH
+
+MAXVECTORDEPTH is the same concept as MVL in RVV. However in Simple-V,
+given that its primary (base, unextended) purpose is for 3D, Video and
+other purposes (not requiring supercomputing capability), it makes sense
+to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
+and so on).
+
+The reason for setting this limit is so that predication registers, when
+marked as such, may fit into a single register as opposed to fanning out
+over several registers. This keeps the implementation a little simpler.
+Note that RVV on top of Simple-V may choose to over-ride this decision.
+
+## Vector-length CSRs
+
+Vector lengths are interpreted as meaning "any instruction referring to
+r(N) generates implicit identical instructions referring to registers
+r(N+M-1) where M is the Vector Length". Vector Lengths may be set to
+use up to 16 registers in the register file.
+
+One separate CSR table is needed for each of the integer and floating-point
+register files:
+
+| RegNo | (3..0) |
+| ----- | ------ |
+| r0 | vlen0 |
+| r1 | vlen1 |
+| .. | vlen.. |
+| r31 | vlen31 |
+
+An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
+whether a register was, if referred to in any standard instructions,
+implicitly to be treated as a vector.
+
+Note:
+
+* A vector length of 1 indicates that it is to be treated as a scalar.
+ Bitwidths (on the same register) are interpreted and meaningful.
+* A vector length of 0 indicates that the parallelism is to be switched
+ off for this register (treated as a scalar). When length is 0,
+ the bitwidth CSR for the register is *ignored*.
+
+Internally, implementations may choose to use the non-zero vector length
+to set a bit-field per register, to be used in the instruction decode phase.
+In this way any standard (current or future) operation involving
+register operands may detect if the operation is to be vector-vector,
+vector-scalar or scalar-scalar (standard) simply through a single
+bit test.
+
+Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
+bitwidth is specifically not set) it becomes:
+
+ CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
+
+This is in contrast to RVV:
+
+ CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
+
+## Element (SIMD) bitwidth CSRs
+
+Element bitwidths may be specified with a per-register CSR, and indicate
+how a register (integer or floating-point) is to be subdivided.
+
+| RegNo | (2..0) |
+| ----- | ------ |
+| r0 | vew0 |
+| r1 | vew1 |
+| .. | vew.. |
+| r31 | vew31 |
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default |
+| 001 | 8 |
+| 010 | 16 |
+| 011 | 32 |
+| 100 | 64 |
+| 101 | 128 |
+| 110 | rsvd |
+| 111 | rsvd |
+
+Extending this table (with extra bits) is covered in the section
+"Implementing RVV on top of Simple-V".
+
+Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
+into account, it becomes:
+
+ vew = CSRbitwidth[rs1]
+ if (vew == 0)
+ bytesperreg = (XLEN/8) # or FLEN as appropriate
+ else:
+ bytesperreg = bytestable[vew] # 1 2 4 8 16
+ simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+ vlen = CSRvectorlen[rs1] * simdmult
+ CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+The reason for multiplying the vector length by the number of SIMD elements
+(in each individual register) is so that each SIMD element may optionally be
+predicated.
+
+An example of how to subdivide the register file when bitwidth != default
+is given in the section "Bitwidth Virtual Register Reordering".
+
+# Instructions
+
+By being a topological remap of RVV concepts, the following RVV instructions
+remain exactly the same: VMPOP, VMFIRST, VEXTRACT, VINSERT, VMERGE, VSELECT,
+VSLIDE, VCLASS and VPOPC. Two instructions, VCLIP and VCLIPI, do not
+have RV Standard equivalents, so are left out of Simple-V.
+All other instructions from RVV are topologically re-mapped and retain
+their complete functionality, intact.
+
+## Instruction Format
The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or memory instructions.
+compare operations, *any* arithmetic, floating point or *any*
+memory instructions.
Instead it *overloads* pre-existing branch operations into predicated
variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on implicit CSR configurations for both vector length and
-bitwidth. *This includes Compressed instructions* as well as future ones.
+depending on CSR configurations for vector length, bitwidth and
+predication. *This includes Compressed instructions* as well as any
+future instructions and Custom Extensions.
* For analysis of RVV see [[v_comparative_analysis]] which begins to
outline topologically-equivalent mappings of instructions
## Branch Instruction:
+Branch operations use standard RV opcodes that are reinterpreted to be
+"predicate variants" in the instance where either of the two src registers
+have their corresponding CSRvectorlen[src] entry as non-zero. When this
+reinterpretation is enabled the predicate target register rs3 is to be
+treated as a bitfield (up to a maximum of XLEN bits corresponding to a
+maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector. Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
This is the overloaded table for Integer-base Branch operations. Opcode
(bits 6..0) is set in all cases to 1100011.
[[!table data="""
31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
-imm[12|10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
7 | 5 | 5 | 3 | 4 | 1 | 7 |
reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
"""]]
-This is the overloaded table for Floating-point Predication operations.
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
+Below is the overloaded table for Floating-point Predication operations.
Interestingly no change is needed to the instruction format because
FP Compare already stores a 1 or a zero in its "rd" integer register
target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield.
+The target needs to simply change to be a predication bitfield (done
+implicitly).
As with
Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
RVF compare can always be followed up with an integer BEQ or a BNE (or
a compressed comparison to zero or non-zero), in predication terms that
becomes more of an impact as an explicit (scalar) instruction is needed
-to invert the predicate. An additional encoding funct3=011 is therefore
-proposed to cater for this.
+to invert the predicate bitmask. An additional encoding funct3=011 is
+therefore proposed to cater for this.
[[!table data="""
31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
5 | 2 | 5 | 5 | 3 | 4 | 7 |
10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
-10100 | 00/01/11 | src2 | src1 | *011* | pred rs3 | FNE |
+10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | FNE |
10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
"""]]
if I/F == INT: # integer type cmp
pred_enabled = int_pred_enabled # TODO: exception if not set!
preg = int_pred_reg[rd]
+ reg = int_regfile
else:
pred_enabled = fp_pred_enabled # TODO: exception if not set!
preg = fp_pred_reg[rd]
+ reg = fp_regfile
s1 = CSRvectorlen[src1] > 1;
s2 = CSRvectorlen[src2] > 1;
* Predicated SIMD comparisons would break src1 and src2 further down
into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
- Reordering") setting Vector-Length * (number of SIMD elements) bits
+ Reordering") setting Vector-Length times (number of SIMD elements) bits
in Predicate Register rs3 as opposed to just Vector-Length bits.
* Predicated Branches do not actually have an adjustment to the Program
Counter, so all of bits 25 through 30 in every case are not needed.
Notes:
* Bits 5 13 14 and 15 make up the comparator type
+* Bit 6 indicates whether to use integer or floating-point comparisons
* In both floating-point and integer cases there are four predication
comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
src1 and src2).
For full analysis of topological adaptation of RVV LOAD/STORE
see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction.
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
Revised LOAD:
* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
* **TODO**: clarify where width maps to elsize
-Pseudo-code (excludes CSR SIMD bitwidth):
+Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
if (unit-strided) stride = elsize;
else stride = areg[as2]; // constant-strided
in physical memory and some not". The interesting question then arises:
how does RVV deal with the exact same scenario?
Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults.
-
-# Note on implementation of parallelism
-
-One extremely important aspect of this proposal is to respect and support
-implementors desire to focus on power, area or performance. In that regard,
-it is proposed that implementors be free to choose whether to implement
-the Vector (or variable-width SIMD) parallelism as sequential operations
-with a single ALU, fully parallel (if practical) with multiple ALUs, or
-a hybrid combination of both.
-
-In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
-Parallelism". They achieve a 16-way SIMD at an **instruction** level
-by providing a combination of a 4-way parallel ALU *and* an externally
-transparent loop that feeds 4 sequential sets of data into each of the
-4 ALUs.
-
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
-
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler. Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward. All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
-
-The "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
-
-# CSRs <a name="csrs"></a>
-
-There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
-precedent in the setting of MISA to enable / disable extensions).
-
-* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-* Vector Length CSR (VSETVL, VGETVL)
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
- marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
- of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
- "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
- as opposed to having the predicate register explicitly in the instruction.
-
-## Predication CSR
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated. The first entry is whether predication
-is enabled. The second entry is whether the register index refers to a
-floating-point or an integer register. The third entry is the index
-of that register which is to be predicated (if referred to). The fourth entry
-is the integer register that is treated as a bitfield, indexable by the
-vector element index.
-
-| RegNo | 6 | 5 | (4..0) | (4..0) |
-| ----- | - | - | ------- | ------- |
-| r0 | pren0 | i/f | regidx | predidx |
-| r1 | pren1 | i/f | regidx | predidx |
-| .. | pren.. | i/f | regidx | predidx |
-| r15 | pren15 | i/f | regidx | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
- fp_pred_enabled[32];
- int_pred_enabled[32];
- for (i = 0; i < 16; i++)
- if CSRpred[i].pren:
- idx = CSRpred[i].regidx
- predidx = CSRpred[i].predidx
- if CSRpred[i].type == 0: # integer
- int_pred_enabled[idx] = 1
- int_pred_reg[idx] = predidx
- else:
- fp_pred_enabled[idx] = 1
- fp_pred_reg[idx] = predidx
-
-So when an operation is to be predicated, it is the internal state that
-is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
- for (int i=0; i<vl; ++i)
- if ([!]preg[p][i])
- (d ? vreg[rd][i] : sreg[rd]) =
- iop(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store:
-
- if type(iop) == INT:
- pred_enabled = int_pred_enabled
- preg = int_pred_reg[rd]
- else:
- pred_enabled = fp_pred_enabled
- preg = fp_pred_reg[rd]
-
- for (int i=0; i<vl; ++i)
- if (preg_enabled[rd] && [!]preg[i])
- (d ? vreg[rd][i] : sreg[rd]) =
- iop(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-## MAXVECTORDEPTH
-
-MAXVECTORDEPTH is the same concept as MVL in RVV. However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
-
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers. This keeps the implementation a little simpler.
-Note that RVV on top of Simple-V may choose to over-ride this decision.
-
-## Vector-length CSRs
-
-Vector lengths are interpreted as meaning "any instruction referring to
-r(N) generates implicit identical instructions referring to registers
-r(N+M-1) where M is the Vector Length". Vector Lengths may be set to
-use up to 16 registers in the register file.
-
-One separate CSR table is needed for each of the integer and floating-point
-register files:
-
-| RegNo | (3..0) |
-| ----- | ------ |
-| r0 | vlen0 |
-| r1 | vlen1 |
-| .. | vlen.. |
-| r31 | vlen31 |
-
-An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
-whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector. A vector length of 1 indicates
-that it is to be treated as a scalar. Vector lengths of 0 are reserved.
-
-Internally, implementations may choose to use the non-zero vector length
-to set a bit-field per register, to be used in the instruction decode phase.
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
-
-Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
-bitwidth is specifically not set) it becomes:
-
- CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
-
-This is in contrast to RVV:
-
- CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
-
-## Element (SIMD) bitwidth CSRs
-
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
-
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0 | vew0 |
-| r1 | vew1 |
-| .. | vew.. |
-| r31 | vew31 |
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
-
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
-
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
-
- vew = CSRbitwidth[rs1]
- if (vew == 0)
- bytesperreg = (XLEN/8) # or FLEN as appropriate
- else:
- bytesperreg = bytestable[vew] # 1 2 4 8 16
- simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
- vlen = CSRvectorlen[rs1] * simdmult
- CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
+of detecting early page / segmentation faults and adjusting the TLB
+in advance, accordingly: other strategies are explored in the Appendix
+Section "Virtual Memory Page Faults".
# Exceptions
# Impementing V on top of Simple-V
-* Number of Offset CSRs extends from 2
-* Extra register file: vector-file
-* Setup of Vector length and bitwidth CSRs now can specify vector-file
- as well as integer or float file.
-* Extend CSR tables (bitwidth) with extra bits
-* TODO
+With Simple-V converting the original RVV draft concept-for-concept
+from explicit opcodes to implicit overloading of existing RV Standard
+Extensions, certain features were (deliberately) excluded that need
+to be added back in for RVV to reach its full potential. This is
+made slightly complicated by the fact that RVV itself has two
+levels: Base and reserved future functionality.
+
+* Representation Encoding is entirely left out of Simple-V in favour of
+ implicitly taking the exact (explicit) meaning from RV Standard Extensions.
+* VCLIP and VCLIPI do not have corresponding RV Standard Extension
+ opcodes (and are the only such operations).
+* Extended Element bitwidths (1 through to 24576 bits) were left out
+ of Simple-V as, again, there is no corresponding RV Standard Extension
+ that covers anything even below 32-bit operands.
+* Polymorphism was entirely left out of Simple-V due to the inherent
+ complexity of automatic type-conversion.
+* Vector Register files were specifically left out of Simple-V in favour
+ of fitting on top of the integer and floating-point files. An
+ "RVV re-retro-fit" needs to be able to mark (implicitly marked)
+ registers as being actually in a separate *vector* register file.
+* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
+ register file size is 5 bits (32 registers), whilst the "Extended"
+ variant of RVV specifies 8 bits (256 registers) and has yet to
+ be published.
+* One big difference: Sections 17.12 and 17.17, there are only two possible
+ predication registers in RVV "Base". Through the "indirect" method,
+ Simple-V provides a key-value CSR table that allows (arbitrarily)
+ up to 16 (TBD) of either the floating-point or integer registers to
+ be marked as "predicated" (key), and if so, which integer register to
+ use as the predication mask (value).
+
+**TODO**
# Implementing P (renamed to DSP) on top of Simple-V
including traditional SIMD, in terms of features, ease of implementation,
complexity, flexibility, and die area.
+### [[harmonised_rvv_rvp]]
+
+This is an interesting proposal under development to retro-fit the AndesStar
+P-Ext into V-Ext.
+
### [[alt_rvp]]
Primary benefit of Alt-RVP is the simplicity with which parallelism
designed to be slotted in to an existing implementation (just after
instruction decode) with minimum disruption and effort.
-* minus: the complexity of having to use register renames, OoO, VLIW,
- register file cacheing, all of which has been done before but is a
- pain
+* minus: the complexity (if full parallelism is to be exploited)
+ of having to use register renames, OoO, VLIW, register file cacheing,
+ all of which has been done before but is a pain
* plus: transparent re-use of existing opcodes as-is just indirectly
saying "this register's now a vector" which
* plus: means that future instructions also get to be inherently
a SIMD architecture where the ALU becomes responsible for the parallelism,
Alt-RVP ALUs would likewise be so responsible... with *additional*
(lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
at least one dimension are avoided (architectural upgrades introducing
128-bit then 256-bit then 512-bit variants of the exact same 64-bit
SIMD block)
operations, all the while keeping a consistent ISA-level "API" irrespective
of implementor design choices (or indeed actual implementations).
+### Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FIFO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
## Example of vector / vector, vector / scalar, scalar / scalar => vector add
register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
adequate space to interpret it in a similar fashion:
[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
+imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
"""]]
This would become:
a predication target as well.
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | op |
+3 | 3 | 3 | 5 | 2 |
+C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
"""]]
Now uses the CS format:
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
"""]]
Bit 6 would be decoded as "operation refers to Integer or Float" including
vew may be one of the following (giving a table "bytestable", used below):
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
Pseudocode for vector length taking CSR SIMD-bitwidth into account:
byteidx * 8, # low
byteidx * 8 + (vew-1), # high
-### Example Instruction translation: <a name="example_translation"></a>
-
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FILO:
-
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
-
### Insights
SIMD register file splitting still to consider. For RV64, benefits of doubling
Whilst the above may seem to be severe minuses, there are some strong
pluses:
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
* Smaller reduction of P's opcode space: around 10%.
* The potential to use Compressed instructions in both Vector and SIMD
due to the overloading of register meaning (implicit vectorisation,
makes proposing it quite challenging given that the relevant (Base) RV
sections are frozen. Consequently it makes sense to forgo this feature.
-## Virtual Memory page-faults
+## Virtual Memory page-faults on LOAD/STORE
+
+
+### Notes from conversations
> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
> riscv-isa-manual in order to work out how to re-map RVV onto the standard
(particularly ones already decoded and moved into the execution FIFO)
would still be there (and stalled). hmmm.
+----
+
+ > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
+ > > VSETL r0, 8
+ > > FADD x1, x2, x3
+ >
+ > > x3[0]: ok
+ > > x3[1]: exception
+ > > x3[2]: ok
+ > > ...
+ > > ...
+ > > x3[7]: ok
+ >
+ > > what happens to result elements 2-7? those may be *big* results
+ > > (RV128)
+ > > or in the RVV-Extended may be arbitrary bit-widths far greater.
+ >
+ > (you replied:)
+ >
+ > Thrown away.
+
+discussion then led to the question of OoO architectures
+
+> The costs of the imprecise-exception model are greater than the benefit.
+> Software doesn't want to cope with it. It's hard to debug. You can't
+> migrate state between different microarchitectures--unless you force all
+> implementations to support the same imprecise-exception model, which would
+> greatly limit implementation flexibility. (Less important, but still
+> relevant, is that the imprecise model increases the size of the context
+> structure, as the microarchitectural guts have to be spilled to memory.)
+
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
+
+TODO: assess various implementation paradigms. These are listed roughly
+in order of simplicity (minimum compliance, for ultra-light-weight
+embedded systems or to reduce design complexity and the burden of
+design implementation and compliance, in non-critical areas), right the
+way to high-performance systems.
+
+* Full (or partial) software-emulated (via traps): full support for CSRs
+ required, however when a register is used that is detected (in hardware)
+ to be vectorised, an exception is thrown.
+* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
+* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
+* Out-of-order with instruction FIFOs and aggressive register-renaming
+* VLIW
+
+Also to be taken into consideration:
+
+* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
+* Comphrensive vectorisation: FIFOs and internal parallelism
+* Hybrid Parallelism
+
+# TODO Research
+
+> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
+
+Idea: basic simple butterfly swap on a few element indices, primarily targetted
+at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping,
+perhaps allow reindexing of permutations up to 4 elements? 8? Reason:
+such operations are less costly than a full indexed-shuffle, which requires
+a separate instruction cycle.
+
+Predication "all zeros" needs to be "leave alone". Detection of
+ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
+ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
+Destruction of destination indices requires a copy of the entire vector
+in advance to avoid.
+
+TBD: floating-point compare and other exception handling
+
# References
* SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
* Discussion on RVV "re-entrant" capabilities allowing operations to be
restarted if an exception occurs (VM page-table miss)
<https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
+* Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>
+* RVV slides 2017 <https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
+* Wavefront skipping using BRAMS <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2015.pdf>
+* Streaming Pipelines <http://www.ece.ubc.ca/~lemieux/publications/severance-fpga2014.pdf>