*Actual* internal hardware-level parallelism is *not* required, such
that Simple-V may be viewed as providing a "compact" or "consolidated"
means of issuing multiple near-identical arithmetic instructions to an
-instruction queue (FILO), pending execution.
+instruction queue (FIFO), pending execution.
*Actual* parallelism, if added independently of Simple-V in the form
of Out-of-order restructuring (including parallel ALU lanes) or VLIW
designs all pull an ISA and its implementation in different conflicting
directions, as do the specific intended uses for any given implementation.
-Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
+The existing P (SIMD) proposal and the V (Vector) proposals,
whilst each extremely powerful in their own right and clearly desirable,
are also:
analysis and review purposes) prohibitively expensive
* Both contain partial duplication of pre-existing RISC-V instructions
(an undesirable characteristic)
-* Both have independent and disparate methods for introducing parallelism
- at the instruction level.
+* Both have independent, incompatible and disparate methods for introducing
+ parallelism at the instruction level
* Both require that their respective parallelism paradigm be implemented
along-side and integral to their respective functionality *or not at all*.
* Both independently have methods for introducing parallelism that
* Vectorisation typically includes much more comprehensive memory load
and store schemes (unit stride, constant-stride and indexed), which
in turn have ramifications: virtual memory misses (TLB cache misses)
- and even multiple page-faults... all caused by a *single instruction*.
+ and even multiple page-faults... all caused by a *single instruction*,
+ yet with a clear benefit that the regularisation of LOAD/STOREs can
+ be optimised for minimal impact on caches and maximised throughput.
* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
to pages), and these load/stores have absolutely nothing to do with the
- SIMD / ALU engine, no matter how wide the operand.
+ SIMD / ALU engine, no matter how wide the operand. Simplicity but with
+ more impact on instruction and data caches.
Overall it makes a huge amount of sense to have a means and method
of introducing instruction parallelism in a flexible way that provides
# Analysis and discussion of Vector vs SIMD
-There are five combined areas between the two proposals that help with
-parallelism without over-burdening the ISA with a huge proliferation of
+There are six combined areas between the two proposals that help with
+parallelism (increased performance, reduced power / area) without
+over-burdening the ISA with a huge proliferation of
instructions:
* Fixed vs variable parallelism (fixed or variable "M" in SIMD)
* Implicit vs fixed instruction bit-width (integral to instruction or not)
* Implicit vs explicit type-conversion (compounded on bit-width)
* Implicit vs explicit inner loops.
+* Single-instruction LOAD/STORE.
* Masks / tagging (selecting/preventing certain indexed elements from execution)
The pros and cons of each are discussed and analysed below.
to direct the operation to a correctly-sized width ALU engine, anyway.
Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand spcace),
+whatever reason, including limitations of the available operand space),
implicit bit-width allows the meaning of certain operations to be
type-overloaded *without* pollution or alteration of frozen and immutable
instructions, in a fully backwards-compatible fashion.
inner loop seems inadequate, tending to suggest that ZOLC may be
better off being proposed as an entirely separate Extension.
+## Single-instruction LOAD/STORE
+
+In traditional Vector Architectures there are instructions which
+result in multiple register-memory transfer operations resulting
+from a single instruction. They're complicated to implement in hardware,
+yet the benefits are a huge consistent regularisation of memory accesses
+that can be highly optimised with respect to both actual memory and any
+L1, L2 or other caches. In Hwacha EECS-2015-263 it is explicitly made
+clear the consequences of getting this architecturally wrong:
+L2 cache-thrashing at the very least.
+
+Complications arise when Virtual Memory is involved: TLB cache misses
+need to be dealt with, as do page faults. Some of the tradeoffs are
+discussed in <http://people.eecs.berkeley.edu/~krste/thesis.pdf>, Section
+4.6, and an article by Jeff Bush when faced with some of these issues
+is particularly enlightening
+<https://jbush001.github.io/2015/11/03/lost-in-translation.html>
+
+Interestingly, none of this complexity is faced in SIMD architectures...
+but then they do not get the opportunity to optimise for highly-streamlined
+memory accesses either.
+
+With the "bang-per-buck" ratio being so high and the indirect improvement
+in L1 Instruction Cache usage (reduced instruction count), as well as
+the opportunity to optimise L1 and L2 cache usage, the case for including
+Vector LOAD/STORE is compelling.
+
## Mask and Tagging (Predication)
Tagging (aka Masks aka Predication) is a pseudo-method of implementing
* explicit compare and branch: BNE x, y -> offs would jump offs
instructions if x was not equal to y
* explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) ADD results in a carry, carry bit implicitly
- (or sometimes explicitly) goes into a "tag" (mask) register
+* implicit (condition-code) such as ADD results in a carry, carry bit
+ implicitly (or sometimes explicitly) goes into a "tag" (mask) register
The first of these is a "normal" branch method, which is flat-out impossible
to parallelise without look-ahead and effectively rewriting instructions.
* Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
* Implicit vs explicit type-conversion: <b>explicit</b>
* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
+* Single-instruction Vector LOAD/STORE: <b>Complex but highly beneficial</b>
* Tag or no-tag: <b>Complex but highly beneficial</b>
In particular:
i.e. *without* requiring a super-scalar or out-of-order architecture,
but doing a proper, full job (ZOLC) is an entirely different matter.
-Constructing a SIMD/Simple-Vector proposal based around four of these five
+Constructing a SIMD/Simple-Vector proposal based around four of these six
requirements would therefore seem to be a logical thing to do.
-# Instruction Format
-
-The instruction format for Simple-V does not actually have *any* compare
-operations, *any* arithmetic, floating point or memory instructions.
-Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on implicit CSR configurations for both vector length and
-bitwidth. This includes Compressed instructions.
-
-* For analysis of RVV see [[v_comparative_analysis]] which begins to
- outline topologically-equivalent mappings of instructions
-* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
- for format of Branch opcodes.
-
-**TODO**: *analyse and decide whether the implicit nature of predication
-as proposed is or is not a lot of hassle, and if explicit prefixes are
-a better idea instead. Parallelism therefore effectively may end up
-as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
-with some opportunities for to use Compressed bringing it down to 48.
-Also to consider is whether one or both of the last two remaining Compressed
-instruction codes in Quadrant 1 could be used as a parallelism prefix,
-bringing parallelised opcodes down to 32-bit and having the benefit of
-being explicit.*
-
-## Branch Instruction:
-
-[[!table data="""
-31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
-imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
-1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
-I/F | reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
-0 | reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
-0 | reserved | src2 | src1 | 001 | predicate rs3 || BNE |
-0 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
-0 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
-0 | reserved | src2 | src1 | 100 | predicate rs3 || BLE |
-0 | reserved | src2 | src1 | 101 | predicate rs3 || BGE |
-0 | reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
-0 | reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
-1 | reserved | src2 | src1 | 000 | predicate rs3 || FEQ |
-1 | reserved | src2 | src1 | 001 | predicate rs3 || FNE |
-1 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
-1 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
-1 | reserved | src2 | src1 | 100 | predicate rs3 || FLT |
-1 | reserved | src2 | src1 | 101 | predicate rs3 || FLE |
-1 | reserved | src2 | src1 | 110 | predicate rs3 || rsvd |
-1 | reserved | src2 | src1 | 111 | predicate rs3 || rsvd |
-"""]]
-
-In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
-for predicated compare operations of function "cmp":
-
- for (int i=0; i<vl; ++i)
- if ([!]preg[p][i])
- preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
- if I/F == INT: # integer type cmp
- pred_enabled = int_pred_enabled # TODO: exception if not set!
- preg = int_pred_reg[rd]
- else:
- pred_enabled = fp_pred_enabled # TODO: exception if not set!
- preg = fp_pred_reg[rd]
-
- s1 = CSRvectorlen[src1] > 1;
- s2 = CSRvectorlen[src2] > 1;
- for (int i=0; i<vl; ++i)
- preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
- s2 ? reg[src2+i] : reg[src2]);
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
- into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
- Reordering") setting Vector-Length * (number of SIMD elements) bits
- in Predicate Register rs3 as opposed to just Vector-Length bits.
-* Predicated Branches do not actually have an adjustment to the Program
- Counter, so all of bits 25 through 30 in every case are not needed.
-* There are plenty of reserved opcodes for which bits 25 through 30 could
- be put to good use if there is a suitable use-case.
-* FEQ and FNE (and BEQ and BNE) are included in order to save one
- instruction having to invert the resultant predicate bitfield.
- FLT and FLE may be inverted to FGT and FGE if needed by swapping
- src1 and src2 (likewise the integer counterparts).
-
-## Compressed Branch Instruction:
-
-[[!table data="""
-15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
-funct3 | imm | rs10 | imm | | op | |
-3 | 3 | 3 | 2 | 3 | 2 | |
-C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
-110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
-111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
-110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
-111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
-"""]]
-
-Notes:
-
-* Bits 5 13 14 and 15 make up the comparator type
-* In both floating-point and integer cases there are four predication
- comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
- src1 and src2).
-
-## LOAD / STORE Instructions
-
-For full analysis of topological adaptation of RVV LOAD/STORE
-see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction.
-
-Revised LOAD:
-
-[[!table data="""
-31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
-imm[11:0] |||| rs1 | funct3 | rd | opcode |
-1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
-? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
-"""]]
-
-The exact same corresponding adaptation is also carried out on the single,
-double and quad precision floating-point LOAD-FP and STORE-FP operations,
-which fit the exact same instruction format. Thus all three types
-(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
-as well as FSW, FSD and FSQ.
-
-Notes:
-
-* LOAD remains functionally (topologically) identical to RVV LOAD
- (for both integer and floating-point variants).
-* Predication CSR-marking register is not explicitly shown in instruction, it's
- implicit based on the CSR predicate state for the rd (destination) register
-* rs2, the source, may *also be marked as a vector*, which implicitly
- is taken to indicate "Indexed Load" (LD.X)
-* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
-* Bit 31 is reserved (ideas under consideration: auto-increment)
-* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
-* **TODO**: clarify where width maps to elsize
-
-Pseudo-code (excludes CSR SIMD bitwidth):
-
- if (unit-strided) stride = elsize;
- else stride = areg[as2]; // constant-strided
-
- pred_enabled = int_pred_enabled
- preg = int_pred_reg[rd]
-
- for (int i=0; i<vl; ++i)
- if (preg_enabled[rd] && [!]preg[i])
- for (int j=0; j<seglen+1; j++)
- {
- if CSRvectorised[rs2])
- offs = vreg[rs2][i]
- else
- offs = i*(seglen+1)*stride;
- vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
- }
-
-Taking CSR (SIMD) bitwidth into account involves using the vector
-length and register encoding according to the "Bitwidth Virtual Register
-Reordering" scheme shown in the Appendix (see function "regoffs").
-
-A similar instruction exists for STORE, with identical topological
-translation of all features. **TODO**
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are of the same format, where bits 2-4 are
-a src register instead of dest:
-
-[[!table data="""
-15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
-funct3 | imm | rs10 | imm | rd0 | op |
-3 | 3 | 3 | 2 | 3 | 2 |
-C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
-"""]]
-
-Unfortunately it is not possible to fit the full functionality
-of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
-require another operand (rs2) in addition to the operand width
-(which is also missing), offset, base, and src/dest.
-
-However a close approximation may be achieved by taking the top bit
-of the offset in each of the five types of LD (and ST), reducing the
-offset to 4 bits and utilising the 5th bit to indicate whether "stride"
-is to be enabled. In this way it is at least possible to introduce
-that functionality.
-
-(**TODO**: *assess whether the loss of one bit from offset is worth having
-"stride" capability.*)
-
-We also assume (including for the "stride" variant) that the "width"
-parameter, which is missing, is derived and implicit, just as it is
-with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
-and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
-C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
-
-Interestingly we note that the Vectorised Simple-V variant of
-LOAD/STORE (Compressed and otherwise), due to it effectively using the
-standard register file(s), is the direct functional equivalent of
-standard load-multiple and store-multiple instructions found in other
-processors.
-
-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not". The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults.
-
# Note on implementation of parallelism
One extremely important aspect of this proposal is to respect and support
It is absolutely critical to note that it is proposed that such choices MUST
be **entirely transparent** to the end-user and the compiler. Whilst
-a Vector (varible-width SIM) may not precisely match the width of the
+a Vector (varible-width SIMD) may not precisely match the width of the
parallelism within the implementation, the end-user **should not care**
and in this way the performance benefits are gained but the ISA remains
straightforward. All that happens at the end of an instruction run is: some
parallel units (if there are any) would remain offline, completely
transparently to the ISA, the program, and the compiler.
-The "SIMD considered harmful" trap of having huge complexity and extra
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
instructions to deal with corner-cases is thus avoided, and implementors
get to choose precisely where to focus and target the benefits of their
implementation efforts, without "extra baggage".
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism. Options are covered in the Appendix.
+
# CSRs <a name="csrs"></a>
There are a number of CSRs needed, which are used at the instruction
-decode phase to re-interpret standard RV opcodes (a practice that has
+decode phase to re-interpret RV opcodes (a practice that has
precedent in the setting of MISA to enable / disable extensions).
* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
"bitwidth" may fit into an XLEN-sized register file.
* Predication is a key-value store due to the implicit referencing,
as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+ state information.
+* TODO: assess whether the same technique could be applied to the other
+ Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+ V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+ needed for context-switches (empty slots need never be stored).
## Predication CSR
An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector. A vector length of 1 indicates
-that it is to be treated as a scalar. Vector lengths of 0 are reserved.
+implicitly to be treated as a vector.
+
+Note:
+
+* A vector length of 1 indicates that it is to be treated as a scalar.
+ Bitwidths (on the same register) are interpreted and meaningful.
+* A vector length of 0 indicates that the parallelism is to be switched
+ off for this register (treated as a scalar). When length is 0,
+ the bitwidth CSR for the register is *ignored*.
Internally, implementations may choose to use the non-zero vector length
to set a bit-field per register, to be used in the instruction decode phase.
An example of how to subdivide the register file when bitwidth != default
is given in the section "Bitwidth Virtual Register Reordering".
+# Instructions
+
+By being a topological remap of RVV concepts, the following RVV instructions
+remain exactly the same: VMPOP, VMFIRST, VEXTRACT, VINSERT, VMERGE, VSELECT,
+VSLIDE, VCLASS and VPOPC. Two instructions, VCLIP and VCLIPI, do not
+have RV Standard equivalents, so are left out of Simple-V.
+All other instructions from RVV are topologically re-mapped and retain
+their complete functionality, intact.
+
+## Instruction Format
+
+The instruction format for Simple-V does not actually have *any* explicit
+compare operations, *any* arithmetic, floating point or *any*
+memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations and LOAD/STORE
+depending on CSR configurations for vector length, bitwidth and
+predication. *This includes Compressed instructions* as well as any
+future instructions and Custom Extensions.
+
+* For analysis of RVV see [[v_comparative_analysis]] which begins to
+ outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+ for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead. Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit (when combined with C)
+and having the benefit of being explicit.*
+
+## Branch Instruction:
+
+Branch operations use standard RV opcodes that are reinterpreted to be
+"predicate variants" in the instance where either of the two src registers
+have their corresponding CSRvectorlen[src] entry as non-zero. When this
+reinterpretation is enabled the predicate target register rs3 is to be
+treated as a bitfield (up to a maximum of XLEN bits corresponding to a
+maximum of XLEN elements).
+
+If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
+goes ahead as vector-scalar or scalar-vector. Implementors should note that
+this could require considerable multi-porting of the register file in order
+to parallelise properly, so may have to involve the use of register cacheing
+and transparent copying (see Multiple-Banked Register File Architectures
+paper).
+
+In instances where no vectorisation is detected on either src registers
+the operation is treated as an absolutely standard scalar branch operation.
+
+This is the overloaded table for Integer-base Branch operations. Opcode
+(bits 6..0) is set in all cases to 1100011.
+
+[[!table data="""
+31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+7 | 5 | 5 | 3 | 4 | 1 | 7 |
+reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
+reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
+reserved | src2 | src1 | 001 | predicate rs3 || BNE |
+reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+reserved | src2 | src1 | 100 | predicate rs3 || BLE |
+reserved | src2 | src1 | 101 | predicate rs3 || BGE |
+reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
+reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
+"""]]
+
+Note that just as with the standard (scalar, non-predicated) branch
+operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
+src1 and src2.
+
+Below is the overloaded table for Floating-point Predication operations.
+Interestingly no change is needed to the instruction format because
+FP Compare already stores a 1 or a zero in its "rd" integer register
+target, i.e. it's not actually a Branch at all: it's a compare.
+The target needs to simply change to be a predication bitfield (done
+implicitly).
+
+As with
+Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
+Likewise Single-precision, fmt bits 26..25) is still set to 00.
+Double-precision is still set to 01, whilst Quad-precision
+appears not to have a definition in V2.3-Draft (but should be unaffected).
+
+It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
+and whilst in ordinary branch code this is fine because the standard
+RVF compare can always be followed up with an integer BEQ or a BNE (or
+a compressed comparison to zero or non-zero), in predication terms that
+becomes more of an impact as an explicit (scalar) instruction is needed
+to invert the predicate bitmask. An additional encoding funct3=011 is
+therefore proposed to cater for this.
+
+[[!table data="""
+31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
+funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
+5 | 2 | 5 | 5 | 3 | 4 | 7 |
+10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
+10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | FNE |
+10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
+10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
+"""]]
+
+Note (**TBD**): floating-point exceptions will need to be extended
+to cater for multiple exceptions (and statuses of the same). The
+usual approach is to have an array of status codes and bit-fields,
+and one exception, rather than throw separate exceptions for each
+Vector element.
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]);
+
+With associated predication, vector-length adjustments and so on,
+and temporarily ignoring bitwidth (which makes the comparisons more
+complex), this becomes:
+
+ if I/F == INT: # integer type cmp
+ pred_enabled = int_pred_enabled # TODO: exception if not set!
+ preg = int_pred_reg[rd]
+ reg = int_regfile
+ else:
+ pred_enabled = fp_pred_enabled # TODO: exception if not set!
+ preg = fp_pred_reg[rd]
+ reg = fp_regfile
+
+ s1 = CSRvectorlen[src1] > 1;
+ s2 = CSRvectorlen[src2] > 1;
+ for (int i=0; i<vl; ++i)
+ preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
+ s2 ? reg[src2+i] : reg[src2]);
+
+Notes:
+
+* Predicated SIMD comparisons would break src1 and src2 further down
+ into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
+ Reordering") setting Vector-Length times (number of SIMD elements) bits
+ in Predicate Register rs3 as opposed to just Vector-Length bits.
+* Predicated Branches do not actually have an adjustment to the Program
+ Counter, so all of bits 25 through 30 in every case are not needed.
+* There are plenty of reserved opcodes for which bits 25 through 30 could
+ be put to good use if there is a suitable use-case.
+* FEQ and FNE (and BEQ and BNE) are included in order to save one
+ instruction having to invert the resultant predicate bitfield.
+ FLT and FLE may be inverted to FGT and FGE if needed by swapping
+ src1 and src2 (likewise the integer counterparts).
+
+## Compressed Branch Instruction:
+
+[[!table data="""
+15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
+funct3 | imm | rs10 | imm | | op | |
+3 | 3 | 3 | 2 | 3 | 2 | |
+C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
+110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
+111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
+110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
+111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
+"""]]
+
+Notes:
+
+* Bits 5 13 14 and 15 make up the comparator type
+* Bit 6 indicates whether to use integer or floating-point comparisons
+* In both floating-point and integer cases there are four predication
+ comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
+ src1 and src2).
+
+## LOAD / STORE Instructions
+
+For full analysis of topological adaptation of RVV LOAD/STORE
+see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
+may be implicitly overloaded into the one base RV LOAD instruction,
+and likewise for STORE.
+
+Revised LOAD:
+
+[[!table data="""
+31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+imm[11:0] |||| rs1 | funct3 | rd | opcode |
+1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
+? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
+"""]]
+
+The exact same corresponding adaptation is also carried out on the single,
+double and quad precision floating-point LOAD-FP and STORE-FP operations,
+which fit the exact same instruction format. Thus all three types
+(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
+as well as FSW, FSD and FSQ.
+
+Notes:
+
+* LOAD remains functionally (topologically) identical to RVV LOAD
+ (for both integer and floating-point variants).
+* Predication CSR-marking register is not explicitly shown in instruction, it's
+ implicit based on the CSR predicate state for the rd (destination) register
+* rs2, the source, may *also be marked as a vector*, which implicitly
+ is taken to indicate "Indexed Load" (LD.X)
+* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
+* Bit 31 is reserved (ideas under consideration: auto-increment)
+* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
+* **TODO**: clarify where width maps to elsize
+
+Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
+
+ if (unit-strided) stride = elsize;
+ else stride = areg[as2]; // constant-strided
+
+ pred_enabled = int_pred_enabled
+ preg = int_pred_reg[rd]
+
+ for (int i=0; i<vl; ++i)
+ if (preg_enabled[rd] && [!]preg[i])
+ for (int j=0; j<seglen+1; j++)
+ {
+ if CSRvectorised[rs2])
+ offs = vreg[rs2][i]
+ else
+ offs = i*(seglen+1)*stride;
+ vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
+ }
+
+Taking CSR (SIMD) bitwidth into account involves using the vector
+length and register encoding according to the "Bitwidth Virtual Register
+Reordering" scheme shown in the Appendix (see function "regoffs").
+
+A similar instruction exists for STORE, with identical topological
+translation of all features. **TODO**
+
+## Compressed LOAD / STORE Instructions
+
+Compressed LOAD and STORE are of the same format, where bits 2-4 are
+a src register instead of dest:
+
+[[!table data="""
+15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
+funct3 | imm | rs10 | imm | rd0 | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
+"""]]
+
+Unfortunately it is not possible to fit the full functionality
+of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
+require another operand (rs2) in addition to the operand width
+(which is also missing), offset, base, and src/dest.
+
+However a close approximation may be achieved by taking the top bit
+of the offset in each of the five types of LD (and ST), reducing the
+offset to 4 bits and utilising the 5th bit to indicate whether "stride"
+is to be enabled. In this way it is at least possible to introduce
+that functionality.
+
+(**TODO**: *assess whether the loss of one bit from offset is worth having
+"stride" capability.*)
+
+We also assume (including for the "stride" variant) that the "width"
+parameter, which is missing, is derived and implicit, just as it is
+with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
+and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
+C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
+
+Interestingly we note that the Vectorised Simple-V variant of
+LOAD/STORE (Compressed and otherwise), due to it effectively using the
+standard register file(s), is the direct functional equivalent of
+standard load-multiple and store-multiple instructions found in other
+processors.
+
+In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
+page 76, "For virtual memory systems some data accesses could be resident
+in physical memory and some not". The interesting question then arises:
+how does RVV deal with the exact same scenario?
+Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
+of detecting early page / segmentation faults and adjusting the TLB
+in advance, accordingly: other strategies are explored in the Appendix
+Section "Virtual Memory Page Faults".
+
# Exceptions
> What does an ADD of two different-sized vectors do in simple-V?
# Impementing V on top of Simple-V
-* Number of Offset CSRs extends from 2
-* Extra register file: vector-file
-* Setup of Vector length and bitwidth CSRs now can specify vector-file
- as well as integer or float file.
-* Extend CSR tables (bitwidth) with extra bits
-* TODO
+With Simple-V converting the original RVV draft concept-for-concept
+from explicit opcodes to implicit overloading of existing RV Standard
+Extensions, certain features were (deliberately) excluded that need
+to be added back in for RVV to reach its full potential. This is
+made slightly complicated by the fact that RVV itself has two
+levels: Base and reserved future functionality.
+
+* Representation Encoding is entirely left out of Simple-V in favour of
+ implicitly taking the exact (explicit) meaning from RV Standard Extensions.
+* VCLIP and VCLIPI do not have corresponding RV Standard Extension
+ opcodes (and are the only such operations).
+* Extended Element bitwidths (1 through to 24576 bits) were left out
+ of Simple-V as, again, there is no corresponding RV Standard Extension
+ that covers anything even below 32-bit operands.
+* Polymorphism was entirely left out of Simple-V due to the inherent
+ complexity of automatic type-conversion.
+* Vector Register files were specifically left out of Simple-V in favour
+ of fitting on top of the integer and floating-point files. An
+ "RVV re-retro-fit" needs to be able to mark (implicitly marked)
+ registers as being actually in a separate *vector* register file.
+* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
+ register file size is 5 bits (32 registers), whilst the "Extended"
+ variant of RVV specifies 8 bits (256 registers) and has yet to
+ be published.
+* One big difference: Sections 17.12 and 17.17, there are only two possible
+ predication registers in RVV "Base". Through the "indirect" method,
+ Simple-V provides a key-value CSR table that allows (arbitrarily)
+ up to 16 (TBD) of either the floating-point or integer registers to
+ be marked as "predicated" (key), and if so, which integer register to
+ use as the predication mask (value).
+
+**TODO**
# Implementing P (renamed to DSP) on top of Simple-V
designed to be slotted in to an existing implementation (just after
instruction decode) with minimum disruption and effort.
-* minus: the complexity of having to use register renames, OoO, VLIW,
- register file cacheing, all of which has been done before but is a
- pain
+* minus: the complexity (if full parallelism is to be exploited)
+ of having to use register renames, OoO, VLIW, register file cacheing,
+ all of which has been done before but is a pain
* plus: transparent re-use of existing opcodes as-is just indirectly
saying "this register's now a vector" which
* plus: means that future instructions also get to be inherently
a SIMD architecture where the ALU becomes responsible for the parallelism,
Alt-RVP ALUs would likewise be so responsible... with *additional*
(lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
at least one dimension are avoided (architectural upgrades introducing
128-bit then 256-bit then 512-bit variants of the exact same 64-bit
SIMD block)
operations, all the while keeping a consistent ISA-level "API" irrespective
of implementor design choices (or indeed actual implementations).
+### Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FIFO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
## Example of vector / vector, vector / scalar, scalar / scalar => vector add
register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
adequate space to interpret it in a similar fashion:
[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 |
+imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
"""]]
This would become:
a predication target as well.
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | op |
+3 | 3 | 3 | 5 | 2 |
+C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
"""]]
Now uses the CS format:
[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 |
+funct3 | imm | rs10 | imm | | op |
+3 | 3 | 3 | 2 | 3 | 2 |
+C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 |
"""]]
Bit 6 would be decoded as "operation refers to Integer or Float" including
vew may be one of the following (giving a table "bytestable", used below):
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
+| vew | bitwidth | bytestable |
+| --- | -------- | ---------- |
+| 000 | default | XLEN/8 |
+| 001 | 8 | 1 |
+| 010 | 16 | 2 |
+| 011 | 32 | 4 |
+| 100 | 64 | 8 |
+| 101 | 128 | 16 |
+| 110 | rsvd | rsvd |
+| 111 | rsvd | rsvd |
Pseudocode for vector length taking CSR SIMD-bitwidth into account:
byteidx * 8, # low
byteidx * 8 + (vew-1), # high
-### Example Instruction translation: <a name="example_translation"></a>
-
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FILO:
-
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
-
### Insights
SIMD register file splitting still to consider. For RV64, benefits of doubling
Whilst the above may seem to be severe minuses, there are some strong
pluses:
-* Significant reduction of V's opcode space: over 85%.
+* Significant reduction of V's opcode space: over 95%.
* Smaller reduction of P's opcode space: around 10%.
* The potential to use Compressed instructions in both Vector and SIMD
due to the overloading of register meaning (implicit vectorisation,
makes proposing it quite challenging given that the relevant (Base) RV
sections are frozen. Consequently it makes sense to forgo this feature.
-## Virtual Memory page-faults
+## Virtual Memory page-faults on LOAD/STORE
+
+
+### Notes from conversations
> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
> riscv-isa-manual in order to work out how to re-map RVV onto the standard
(particularly ones already decoded and moved into the execution FIFO)
would still be there (and stalled). hmmm.
+----
+
+ > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
+ > > VSETL r0, 8
+ > > FADD x1, x2, x3
+ >
+ > > x3[0]: ok
+ > > x3[1]: exception
+ > > x3[2]: ok
+ > > ...
+ > > ...
+ > > x3[7]: ok
+ >
+ > > what happens to result elements 2-7? those may be *big* results
+ > > (RV128)
+ > > or in the RVV-Extended may be arbitrary bit-widths far greater.
+ >
+ > (you replied:)
+ >
+ > Thrown away.
+
+discussion then led to the question of OoO architectures
+
+> The costs of the imprecise-exception model are greater than the benefit.
+> Software doesn't want to cope with it. It's hard to debug. You can't
+> migrate state between different microarchitectures--unless you force all
+> implementations to support the same imprecise-exception model, which would
+> greatly limit implementation flexibility. (Less important, but still
+> relevant, is that the imprecise model increases the size of the context
+> structure, as the microarchitectural guts have to be spilled to memory.)
+
+
+## Implementation Paradigms <a name="implementation_paradigms"></a>
+
+TODO: assess various implementation paradigms. These are listed roughly
+in order of simplicity (minimum compliance, for ultra-light-weight
+embedded systems or to reduce design complexity and the burden of
+design implementation and compliance, in non-critical areas), right the
+way to high-performance systems.
+
+* Full (or partial) software-emulated (via traps): full support for CSRs
+ required, however when a register is used that is detected (in hardware)
+ to be vectorised, an exception is thrown.
+* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
+* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
+* Out-of-order with instruction FIFOs and aggressive register-renaming
+* VLIW
+
+Also to be taken into consideration:
+
+* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
+* Comphrensive vectorisation: FIFOs and internal parallelism
+* Hybrid Parallelism
+
+# TODO Research
+
+> For great floating point DSPs check TI’s C3x, C4X, and C6xx DSPs
+
+Idea: basic simple butterfly swap on a few element indices, primarily targetted
+at SIMD / DSP. High-byte low-byte swapping, high-word low-word swapping,
+perhaps allow reindexing of permutations up to 4 elements? 8? Reason:
+such operations are less costly than a full indexed-shuffle, which requires
+a separate instruction cycle.
+
+Predication "all zeros" needs to be "leave alone". Detection of
+ADD r1, rs1, rs0 cases result in nop on predication index 0, whereas
+ADD r0, rs1, rs2 is actually a desirable copy from r2 into r0.
+Destruction of destination indices requires a copy of the entire vector
+in advance to avoid.
+
+TBD: floating-point compare and other exception handling
+
# References
* SIMD considered harmful <https://www.sigarch.org/simd-instructions-considered-harmful/>
* Discussion on RVV "re-entrant" capabilities allowing operations to be
restarted if an exception occurs (VM page-table miss)
<https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/IuNFitTw9fM/CCKBUlzsAAAJ>
+* Dot Product Vector <https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf>