X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=1f479badad0209ee49ebd85864e2376ee9613fb4;hb=a537f8e87eaf740e6eadb4517e1f93c8112bb3cb;hp=ad8fd3a5d13949a3f17c0b66937f8bb23c51a32a;hpb=4d66d4c9900081d26aa4a9fe7c84464e629a805b;p=libreriscv.git
diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn
index ad8fd3a5d..1f479bada 100644
--- a/simple_v_extension.mdwn
+++ b/simple_v_extension.mdwn
@@ -1,5 +1,21 @@
# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
+Key insight: Simple-V is intended as an abstraction layer to provide
+a consistent "API" to parallelisation of existing *and future* operations.
+*Actual* internal hardware-level parallelism is *not* required, such
+that Simple-V may be viewed as providing a "compact" or "consolidated"
+means of issuing multiple near-identical arithmetic instructions to an
+instruction queue (FILO), pending execution.
+
+*Actual* parallelism, if added independently of Simple-V in the form
+of Out-of-order restructuring (including parallel ALU lanes) or VLIW
+implementations, or SIMD, or anything else, would then benefit *if*
+Simple-V was added on top.
+
+[[!toc ]]
+
+# Introduction
+
This proposal exists so as to be able to satisfy several disparate
requirements: power-conscious, area-conscious, and performance-conscious
designs all pull an ISA and its implementation in different conflicting
@@ -9,7 +25,7 @@ Additionally, the existing P (SIMD) proposal and the V (Vector) proposals,
whilst each extremely powerful in their own right and clearly desirable,
are also:
-* Clearly independent in their origins (Cray and AndeStar v3 respectively)
+* Clearly independent in their origins (Cray and AndesStar v3 respectively)
so need work to adapt to the RISC-V ethos and paradigm
* Are sufficiently large so as to make adoption (and exploration for
analysis and review purposes) prohibitively expensive
@@ -32,40 +48,27 @@ would provide even more flexibility).
Additionally it makes sense to *split out* the parallelism inherent within
each of P and V, and to see if each of P and V then, in *combination* with
-a "best-of-both" parallelism extension, would work well.
-
-**TODO**: reword this to better suit this document:
+a "best-of-both" parallelism extension, could be added on *on top* of
+this proposal, to topologically provide the exact same functionality of
+each of P and V. Each of P and V then can focus on providing the best
+operations possible for their respective target areas, without being
+hugely concerned about the actual parallelism.
-Having looked at both P and V as they stand, they're _both_ very much
-"separate engines" that, despite both their respective merits and
-extremely powerful features, don't really cleanly fit into the RV design
-ethos (or the flexible extensibility) and, as such, are both in danger
-of not being widely adopted. I'm inclined towards recommending:
-
-* splitting out the DSP aspects of P-SIMD to create a single-issue DSP
-* splitting out the polymorphism, esoteric data types (GF, complex
- numbers) and unusual operations of V to create a single-issue "Esoteric
- Floating-Point" extension
-* splitting out the loop-aspects, vector aspects and data-width aspects
- of both P and V to a *new* "P-SIMD / Simple-V" and requiring that they
- apply across *all* Extensions, whether those be DSP, M, Base, V, P -
- everything.
+Furthermore, an additional goal of this proposal is to reduce the number
+of opcodes utilised by each of P and V as they currently stand, leveraging
+existing RISC-V opcodes where possible, and also potentially allowing
+P and V to make use of Compressed Instructions as a result.
**TODO**: propose overflow registers be actually one of the integer regs
(flowing to multiple regs).
**TODO**: propose "mask" (predication) registers likewise. combination with
-standard RV instructions and overflow registers extremely powerful
-
-**TODO**: propose two LOAD/STORE offset CSRs, which mark a particular
-register as being "if you use this reg in LOAD/STORE, use the offset
-amount CSRoffsN (N=0,1) instead of treating LOAD/STORE as contiguous".
-can be used for matrix spanning.
-
+standard RV instructions and overflow registers extremely powerful, see
+Aspex ASP.
# Analysis and discussion of Vector vs SIMD
-There are four combined areas between the two proposals that help with
+There are five combined areas between the two proposals that help with
parallelism without over-burdening the ISA with a huge proliferation of
instructions:
@@ -90,6 +93,13 @@ Thus, SIMD, no matter what width is chosen, is never going to be acceptable
for general-purpose computation, and in the context of developing a
general-purpose ISA, is never going to satisfy 100 percent of implementors.
+To explain this further: for increased workloads over time, as the
+performance requirements increase for new target markets, implementors
+choose to extend the SIMD width (so as to again avoid mixing parallelism
+into the instruction issue phases: the primary "simplicity" benefit of
+SIMD in the first place), with the result that the entire opcode space
+effectively doubles with each new SIMD width that's added to the ISA.
+
That basically leaves "variable-length vector" as the clear *general-purpose*
winner, at least in terms of greatly simplifying the instruction set,
reducing the number of instructions required for any given task, and thus
@@ -125,13 +135,16 @@ integer (and floating point) of various sizes is automatically inferred
due to "type tagging" that is set with a special instruction. A register
will be *specifically* marked as "16-bit Floating-Point" and, if added
to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take placce *without* requiring that type-conversion
+type-conversion will take place *without* requiring that type-conversion
to be explicitly done with its own separate instruction.
However, implicit type-conversion is not only quite burdensome to
implement (explosion of inferred type-to-type conversion) but also is
never really going to be complete. It gets even worse when bit-widths
-also have to be taken into consideration.
+also have to be taken into consideration. Each new type results in
+an increased O(N^2) conversion space that, as anyone who has examined
+python's source code (which has built-in polymorphic type-conversion),
+knows that the task is more complex than it first seems.
Overall, type-conversion is generally best to leave to explicit
type-conversion instructions, or in definite specific use-cases left to
@@ -144,27 +157,89 @@ contains an extremely interesting feature: zero-overhead loops. This
proposal would basically allow an inner loop of instructions to be
repeated indefinitely, a fixed number of times.
-Its specific advantage over explicit loops is that the pipeline in a
-DSP can potentially be kept completely full *even in an in-order
+Its specific advantage over explicit loops is that the pipeline in a DSP
+can potentially be kept completely full *even in an in-order single-issue
implementation*. Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in order
-to keep ALU pipelines 100% occupied.
-
-This very simple proposal offers a way to increase pipeline activity in the
-one key area which really matters: the inner loop.
-
-## Mask and Tagging
-
-*TODO: research masks as they can be superb and extremely powerful.
-If B-Extension is implemented and provides Bit-Gather-Scatter it
-becomes really cool and easy to switch out certain indexed values
-from an array of data, but actually BGS **on its own** might be
-sufficient. Bottom line, this is complex, and needs a proper analysis.
-The other sections are pretty straightforward.*
+out-of-order execution capabilities to "pre-process" instructions in
+order to keep ALU pipelines 100% occupied.
+
+By bringing that capability in, this proposal could offer a way to increase
+pipeline activity even in simpler implementations in the one key area
+which really matters: the inner loop.
+
+However when looking at much more comprehensive schemes
+"A portable specification of zero-overhead loop control hardware
+applied to embedded processors" (ZOLC), optimising only the single
+inner loop seems inadequate, tending to suggest that ZOLC may be
+better off being proposed as an entirely separate Extension.
+
+## Mask and Tagging (Predication)
+
+Tagging (aka Masks aka Predication) is a pseudo-method of implementing
+simplistic branching in a parallel fashion, by allowing execution on
+elements of a vector to be switched on or off depending on the results
+of prior operations in the same array position.
+
+The reason for considering this is simple: by *definition* it
+is not possible to perform individual parallel branches in a SIMD
+(Single-Instruction, **Multiple**-Data) context. Branches (modifying
+of the Program Counter) will result in *all* parallel data having
+a different instruction executed on it: that's just the definition of
+SIMD, and it is simply unavoidable.
+
+So these are the ways in which conditional execution may be implemented:
+
+* explicit compare and branch: BNE x, y -> offs would jump offs
+ instructions if x was not equal to y
+* explicit store of tag condition: CMP x, y -> tagbit
+* implicit (condition-code) ADD results in a carry, carry bit implicitly
+ (or sometimes explicitly) goes into a "tag" (mask) register
+
+The first of these is a "normal" branch method, which is flat-out impossible
+to parallelise without look-ahead and effectively rewriting instructions.
+This would defeat the purpose of RISC.
+
+The latter two are where parallelism becomes easy to do without complexity:
+every operation is modified to be "conditionally executed" (in an explicit
+way directly in the instruction format *or* implicitly).
+
+RVV (Vector-Extension) proposes to have *explicit* storing of the compare
+in a tag/mask register, and to *explicitly* have every vector operation
+*require* that its operation be "predicated" on the bits within an
+explicitly-named tag/mask register.
+
+SIMD (P-Extension) has not yet published precise documentation on what its
+schema is to be: there is however verbal indication at the time of writing
+that:
+
+> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will
+> be executed using the same compare ALU logic for the base ISA with some
+> minor modifications to handle smaller data types. The function will not
+> be duplicated.
+
+This is an *implicit* form of predication as the base RV ISA does not have
+condition-codes or predication. By adding a CSR it becomes possible
+to also tag certain registers as "predicated if referenced as a destination".
+Example:
+
+ // in future operations from now on, if r0 is the destination use r5 as
+ // the PREDICATION register
+ SET_IMPLICIT_CSRPREDICATE r0, r5
+ // store the compares in r5 as the PREDICATION register
+ CMPEQ8 r5, r1, r2
+ // r0 is used here. ah ha! that means it's predicated using r5!
+ ADD8 r0, r1, r3
+
+With enough registers (and in RISC-V there are enough registers) some fairly
+complex predication can be set up and yet still execute without significant
+stalling, even in a simple non-superscalar architecture.
+
+(For details on how Branch Instructions would be retro-fitted to indirectly
+predicated equivalents, see Appendix)
## Conclusions
-In the above sections the four different ways where parallel instruction
+In the above sections the five different ways where parallel instruction
execution has closely and loosely inter-related implications for the ISA and
for implementors, were outlined. The pluses and minuses came out as
follows:
@@ -172,29 +247,236 @@ follows:
* Fixed vs variable parallelism: variable
* Implicit (indirect) vs fixed (integral) instruction bit-width: indirect
* Implicit vs explicit type-conversion: explicit
-* Implicit vs explicit inner loops: implicit
-* Tag or no-tag: TODO
+* Implicit vs explicit inner loops: implicit but best done separately
+* Tag or no-tag: Complex but highly beneficial
+
+In particular:
-In particular: variable-length vectors came out on top because of the
-high setup, teardown and corner-cases associated with the fixed width
-of SIMD. Implicit bit-width helps to extend the ISA to escape from
-former limitations and restrictions (in a backwards-compatible fashion),
-and implicit (zero-overhead) loops provide a means to keep pipelines
-potentially 100% occupied *without* requiring a super-scalar or out-of-order
-architecture.
+* variable-length vectors came out on top because of the high setup, teardown
+ and corner-cases associated with the fixed width of SIMD.
+* Implicit bit-width helps to extend the ISA to escape from
+ former limitations and restrictions (in a backwards-compatible fashion),
+ whilst also leaving implementors free to simmplify implementations
+ by using actual explicit internal parallelism.
+* Implicit (zero-overhead) loops provide a means to keep pipelines
+ potentially 100% occupied in a single-issue in-order implementation
+ i.e. *without* requiring a super-scalar or out-of-order architecture,
+ but doing a proper, full job (ZOLC) is an entirely different matter.
-Constructing a SIMD/Simple-Vector proposal based around even only these four
-(five?) requirements would therefore seem to be a logical thing to do.
+Constructing a SIMD/Simple-Vector proposal based around four of these five
+requirements would therefore seem to be a logical thing to do.
# Instruction Format
-**TODO** *basically borrow from both P and V, which should be quite simple
-to do, with the exception of Tag/no-tag, which needs a bit more
-thought. V's Section 17.19 of Draft V2.3 spec is reminiscent of B's BGS
-gather-scatterer, and, if implemented, could actually be a really useful
-way to span 8-bit up to 64-bit groups of data, where BGS as it stands
-and described by Clifford does **bits** of up to 16 width. Lots to
-look at and investigate!*
+The instruction format for Simple-V does not actually have *any* compare
+operations, *any* arithmetic, floating point or memory instructions.
+Instead it *overloads* pre-existing branch operations into predicated
+variants, and implicitly overloads arithmetic operations and LOAD/STORE
+depending on implicit CSR configurations for both vector length and
+bitwidth. This includes Compressed instructions.
+
+* For analysis of RVV see [[v_comparative_analysis]] which begins to
+ outline topologically-equivalent mappings of instructions
+* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
+ for format of Branch opcodes.
+
+**TODO**: *analyse and decide whether the implicit nature of predication
+as proposed is or is not a lot of hassle, and if explicit prefixes are
+a better idea instead. Parallelism therefore effectively may end up
+as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
+with some opportunities for to use Compressed bringing it down to 48.
+Also to consider is whether one or both of the last two remaining Compressed
+instruction codes in Quadrant 1 could be used as a parallelism prefix,
+bringing parallelised opcodes down to 32-bit and having the benefit of
+being explicit.*
+
+## Branch Instruction:
+
+[[!table data="""
+31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+I/F | reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
+0 | reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
+0 | reserved | src2 | src1 | 001 | predicate rs3 || BNE |
+0 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+0 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+0 | reserved | src2 | src1 | 100 | predicate rs3 || BLE |
+0 | reserved | src2 | src1 | 101 | predicate rs3 || BGE |
+0 | reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
+0 | reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
+1 | reserved | src2 | src1 | 000 | predicate rs3 || FEQ |
+1 | reserved | src2 | src1 | 001 | predicate rs3 || FNE |
+1 | reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 100 | predicate rs3 || FLT |
+1 | reserved | src2 | src1 | 101 | predicate rs3 || FLE |
+1 | reserved | src2 | src1 | 110 | predicate rs3 || rsvd |
+1 | reserved | src2 | src1 | 111 | predicate rs3 || rsvd |
+"""]]
+
+In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
+for predicated compare operations of function "cmp":
+
+ for (int i=0; i 1;
+ s2 = CSRvectorlen[src2] > 1;
+ for (int i=0; i Ok so this is an aspect of Simple-V that I hadn't thought through,
-> yet (proposal / idea only a few days old!). in V2.3-Draft ISA Section
-> 17.10 the CSRs are listed. I note that there's some general-purpose
-> CSRs (including a global/active vector-length) and 16 vcfgN CSRs. i
-> don't precisely know what those are for.
-
-> Â In the Simple-V proposal, *every* register in both the integer
-> register-file *and* the floating-point register-file would have at
-> least a 2-bit "data-width" CSR and probably something like an 8-bit
-> "vector-length" CSR (less in RV32E, by exactly one bit).
-
-> Â What I *don't* know is whether that would be considered perfectly
-> reasonable or completely insane. If it turns out that the proposed
-> Simple-V CSRs can indeed be stored in SRAM then I would imagine that
-> adding somewhere in the region of 10 bits per register would be... okay?Â
-> I really don't honestly know.
-
-> Â Would these proposed 10-or-so-bit per-register Simple-V CSRs need to
-> be multi-ported? No I don't believe they would.
-
-## 17.11 Maximum Vector Length (MVL)
-
-Basically implicitly this is set to the maximum size of the register
-file multiplied by the number of 8-bit packed ints that can fit into
-a register (4 for RV32, 8 for RV64 and 16 for RV128).
-
-## !7.12 Vector Instruction Formats
-
-No equivalent in Simple-V because *all* instructions of *all* Extensions
-are implicitly parallelised (and packed).
-
-## 17.13 Polymorphic Vector Instructions
-
-Polymorphism (implicit type-casting) is deliberately not supported
-in Simple-V.
-
-## 17.14 Rapid Configuration Instructions
-
-TODO: analyse if this is useful to have an equivalent in Simple-V
-
-## 17.15 Vector-Type-Change Instructions
-
-TODO: analyse if this is useful to have an equivalent in Simple-V
-
-## 17.16 Vector Length
-
-Has a direct corresponding equivalent.
-
-## 17.17 Predicated Execution
-
-Predicated Execution is another name for "masking" or "tagging". Masked
-(or tagged) implies that there is a bit field which is indexed, and each
-bit associated with the corresponding indexed offset register within
-the "Vector". If the tag / mask bit is 1, when a parallel operation is
-issued, the indexed element of the vector has the operation carried out.
-However if the tag / mask bit is *zero*, that particular indexed element
-of the vector does *not* have the requested operation carried out.
-
-In V2.3-draft V, there is a significant (not recommended) difference:
-the zero-tagged elements are *set to zero*. This loses a *significant*
-advantage of mask / tagging, particularly if the entire mask register
-is itself a general-purpose register, as that general-purpose register
-can be inverted, shifted, and'ed, or'ed and so on. In other words
-it becomes possible, especially if Carry/Overflow from each vector
-operation is also accessible, to do conditional (step-by-step) vector
-operations including things like turn vectors into 1024-bit or greater
-operands with very few instructions, by treating the "carry" from
-one instruction as a way to do "Conditional add of 1 to the register
-next door". If V2.3-draft V sets zero-tagged elements to zero, such
-extremely powerful techniques are simply not possible.
-
-It is noted that there is no mention of an equivalent to BEXT (element
-skipping) which would be particularly fascinating and powerful to have.
-In this mode, the "mask" would skip elements where its mask bit was zero
-in either the source or the destination operand.
-
-Lots to be discussed.
-
-## 17.18 Vector Load/Store Instructions
-
-These may not have a direct equivalent in Simple-V, except if mask/tagging
-is to be deployed.
-
-To be discussed.
-
-## 17.19 Vector Register Gather
-
-TODO
-
-## TODO, sort
-
-> However, there are also several features that go beyond simply attaching VL
-> to a scalar operation and are crucial to being able to vectorize a lot of
-> code. To name a few:
-> - Conditional execution (i.e., predicated operations)
-> - Inter-lane data movement (e.g. SLIDE, SELECT)
-> - Reductions (e.g., VADD with a scalar destination)
-
- Ok so the Conditional and also the Reductions is one of the reasons
- why as part of SimpleV / variable-SIMD / parallelism (gah gotta think
- of a decent name) i proposed that it be implemented as "if you say r0
- is to be a vector / SIMD that means operations actually take place on
- r0,r1,r2... r(N-1)".
-
- Consequently any parallel operation could be paused (or... more
- specifically: vectors disabled by resetting it back to a default /
- scalar / vector-length=1) yet the results would actually be in the
- *main register file* (integer or float) and so anything that wasn't
- possible to easily do in "simple" parallel terms could be done *out*
- of parallel "mode" instead.
-
- I do appreciate that the above does imply that there is a limit to the
- length that SimpleV (whatever) can be parallelised, namely that you
- run out of registers! my thought there was, "leave space for the main
- V-Ext proposal to extend it to the length that V currently supports".
- Honestly i had not thought through precisely how that would work.
-
- Inter-lane (SELECT) i saw 17.19 in V2.3-Draft p117, I liked that,
- it reminds me of the discussion with Clifford on bit-manipulation
- (gather-scatter except not Bit Gather Scatter, *data* gather scatter): if
- applied "globally and outside of V and P" SLIDE and SELECT might become
- an extremely powerful way to do fast memory copy and reordering [2[.
-
- However I haven't quite got my head round how that would work: i am
- used to the concept of register "tags" (the modern term is "masks")
- and i *think* if "masks" were applied to a Simple-V-enhanced LOAD /
- STORE you would get the exact same thing as SELECT.
-
- SLIDE you could do simply by setting say r0 vector-length to say 16
- (meaning that if referred to in any operation it would be an implicit
- parallel operation on *all* registers r0 through r15), and temporarily
- set say.... r7 vector-length to say... 5. Do a LOAD on r7 and it would
- implicitly mean "load from memory into r7 through r11". Then you go
- back and do an operation on r0 and ta-daa, you're actually doing an
- operation on a SLID {SLIDED?) vector.
-
- The advantage of Simple-V (whatever) over V would be that you could
- actually do *operations* in the middle of vectors (not just SLIDEs)
- simply by (as above) setting r0 vector-length to 16 and r7 vector-length
- to 5. There would be nothing preventing you from doing an ADD on r0
- (which meant do an ADD on r0 through r15) followed *immediately in the
- next instruction with no setup cost* a MUL on r7 (which actually meant
- "do a parallel MUL on r7 through r11").
-
- btw it's worth mentioning that you'd get scalar-vector and vector-scalar
- implicitly by having one of the source register be vector-length 1
- (the default) and one being N > 1. but without having special opcodes
- to do it. i *believe* (or more like "logically infer or deduce" as
- i haven't got access to the spec) that that would result in a further
- opcode reduction when comparing [draft] V-Ext to [proposed] Simple-V.
-
- Also, Reduction *might* be possible by specifying that the destination be
- a scalar (vector-length=1) whilst the source be a vector. However... it
- would be an awful lot of work to go through *every single instruction*
- in *every* Extension, working out which ones could be parallelised (ADD,
- MUL, XOR) and those that definitely could not (DIV, SUB). Is that worth
- the effort? maybe. Would it result in huge complexity? probably.
- Could an implementor just go "I ain't doing *that* as parallel!
- let's make it virtual-parallelism (sequential reduction) instead"?
- absolutely. So, now that I think it through, Simple-V (whatever)
- covers Reduction as well. huh, that's a surprise.
-
-
-> - Vector-length speculation (making it possible to vectorize some loops with
-> unknown trip count) - I don't think this part of the proposal is written
-> down yet.
-
- Now that _is_ an interesting concept. A little scary, i imagine, with
- the possibility of putting a processor into a hard infinite execution
- loop... :)
-
-
-> Also, note the vector ISA consumes relatively little opcode space (all the
-> arithmetic fits in 7/8ths of a major opcode). This is mainly because data
-> type and size is a function of runtime configuration, rather than of opcode.
-
- yes. i love that aspect of V, i am a huge fan of polymorphism [1]
- which is why i am keen to advocate that the same runtime principle be
- extended to the rest of the RISC-V ISA [3]
-
- Yikes that's a lot. I'm going to need to pull this into the wiki to
- make sure it's not lost.
-
-[1] inherent data type conversion: 25 years ago i designed a hypothetical
-hyper-hyper-hyper-escape-code-sequencing ISA based around 2-bit
-(escape-extended) opcodes and 2-bit (escape-extended) operands that
-only required a fixed 8-bit instruction length. that relied heavily
-on polymorphism and runtime size configurations as well. At the time
-I thought it would have meant one HELL of a lot of CSRs... but then I
-met RISC-V and was cured instantly of that delusion^Wmisapprehension :)
-
-[2] Interestingly if you then also add in the other aspect of Simple-V
-(the data-size, which is effectively functionally orthogonal / identical
-to "Packed" of Packed-SIMD), masked and packed *and* vectored LOAD / STORE
-operations become byte / half-word / word augmenters of B-Ext's proposed
-"BGS" i.e. where B-Ext's BGS dealt with bits, masked-packed-vectored
-LOAD / STORE would deal with 8 / 16 / 32 bits at a time. Where it
-would get really REALLY interesting would be masked-packed-vectored
-B-Ext BGS instructions. I can't even get my head fully round that,
-which is a good sign that the combination would be *really* powerful :)
-
-[3] ok sadly maybe not the polymorphism, it's too complicated and I
-think would be much too hard for implementors to easily "slide in" to an
-existing non-Simple-V implementation. i say that despite really *really*
-wanting IEEE 704 FP Half-precision to end up somewhere in RISC-V in some
-fashion, for optimising 3D Graphics. *sigh*.
-
-## TODO: instructions (based on Hwacha) V-Ext duplication analysis
-
-This is partly speculative due to lack of access to an up-to-date
-V-Ext Spec (V2.3-draft RVV 0.4-Draft at the time of writing). However
-basin an analysis instead on Hwacha, a cursory examination shows over
-an **85%** duplication of V-Ext operand-related instructions when
-compared to Simple-V on a standard RG64G base. Even Vector Fetch
-is analogous to "zero-overhead loop".
-
-Exceptions are:
-
-* Vector Indexed Memory Instructions (non-contiguous)
-* Vector Atomic Memory Instructions.
-* Some of the Vector Arithmetic ops: MADD, MSUB,
- VSRL, VSRA, VEIDX, VFIRST, VSGNJN, VFSGNJX and potentially more.
-* Consensual Jump
-
-## TODO: sort
-
-> I suspect that the "hardware loop" in question is actually a zero-overhead
-> loop unit that diverts execution from address X to address Y if a certain
-> condition is met.
-
- not quite. The zero-overhead loop unit interestingly would be at
-an [independent] level above vector-length. The distinctions are
-as follows:
-
-* Vector-length issues *virtual* instructions where the register
- operands are *specifically* altered (to cover a range of registers),
- whereas zero-overhead loops *specifically* do *NOT* alter the operands
- in *ANY* way.
-
-* Vector-length-driven "virtual" instructions are driven by *one*
- and *only* one instruction (whether it be a LOAD, STORE, or pure
- one/two/three-operand opcode) whereas zero-overhead loop units
- specifically apply to *multiple* instructions.
-
-Where vector-length-driven "virtual" instructions might get conceptually
-blurred with zero-overhead loops is LOAD / STORE. In the case of LOAD /
-STORE, to actually be useful, vector-length-driven LOAD / STORE should
-increment the LOAD / STORE memory address to correspondingly match the
-increment in the register bank. example:
-
-* set vector-length for r0 to 4
-* issue RV32 LOAD from addr 0x1230 to r0
-
-translates effectively to:
-
-* RV32 LOAD from addr 0x1230 to r0
-* ...
-* ...
-* RV32 LOAD from addr 0x123B to r3
-
-# P-Ext ISA
-
-| Mnemonic | 16-bit Instruction |
-| ------------------ | ------------------------- |
-| ADD16 rt, ra, rb | add |
-| RADD16 rt, ra, rb | Signed Halving add |
-| URADD16 rt, ra, rb | Unsigned Halving add |
-| KADD16 rt, ra, rb | Signed Saturating add |
-| UKADD16 rt, ra, rb | Unsigned Saturating add |
-| SUB16 rt, ra, rb | sub |
-| RSUB16 rt, ra, rb | Signed Halving sub |
-
+implementation efforts, without "extra baggage".
+
+# CSRs
+
+There are a number of CSRs needed, which are used at the instruction
+decode phase to re-interpret standard RV opcodes (a practice that has
+precedent in the setting of MISA to enable / disable extensions).
+
+* Integer Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+* Vector Length CSR (VSETVL, VGETVL)
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+ marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+ of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+ "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+ as opposed to having the predicate register explicitly in the instruction.
+
+## Predication CSR
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated. The first entry is whether predication
+is enabled. The second entry is whether the register index refers to a
+floating-point or an integer register. The third entry is the index
+of that register which is to be predicated (if referred to). The fourth entry
+is the integer register that is treated as a bitfield, indexable by the
+vector element index.
+
+| RegNo | 6 | 5 | (4..0) | (4..0) |
+| ----- | - | - | ------- | ------- |
+| r0 | pren0 | i/f | regidx | predidx |
+| r1 | pren1 | i/f | regidx | predidx |
+| .. | pren.. | i/f | regidx | predidx |
+| r15 | pren15 | i/f | regidx | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+ fp_pred_enabled[32];
+ int_pred_enabled[32];
+ for (i = 0; i < 16; i++)
+ if CSRpred[i].pren:
+ idx = CSRpred[i].regidx
+ predidx = CSRpred[i].predidx
+ if CSRpred[i].type == 0: # integer
+ int_pred_enabled[idx] = 1
+ int_pred_reg[idx] = predidx
+ else:
+ fp_pred_enabled[idx] = 1
+ fp_pred_reg[idx] = predidx
+
+So when an operation is to be predicated, it is the internal state that
+is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+ for (int i=0; i What does an ADD of two different-sized vectors do in simple-V?
+
+* if the two source operands are not the same, throw an exception.
+* if the destination operand is also a vector, and the source is longer
+ than the destination, throw an exception.
+
+> And what about instructions like JALR?Â
+> What does jumping to a vector do?
+
+* Throw an exception. Whether that actually results in spawning threads
+ as part of the trap-handling remains to be seen.
+
+# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals
+
+This section compares the various parallelism proposals as they stand,
+including traditional SIMD, in terms of features, ease of implementation,
+complexity, flexibility, and die area.
+
+## [[alt_rvp]]
+
+Primary benefit of Alt-RVP is the simplicity with which parallelism
+may be introduced (effective multiplication of regfiles and associated ALUs).
+
+* plus: the simplicity of the lanes (combined with the regularity of
+ allocating identical opcodes multiple independent registers) meaning
+ that SRAM or 2R1W can be used for entire regfile (potentially).
+* minus: a more complex instruction set where the parallelism is much
+ more explicitly directly specified in the instruction and
+* minus: if you *don't* have an explicit instruction (opcode) and you
+ need one, the only place it can be added is... in the vector unit and
+* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
+ not useable or accessible in other Extensions.
+* plus-and-minus: Lanes may be utilised for high-speed context-switching
+ but with the down-side that they're an all-or-nothing part of the Extension.
+ No Alt-RVP: no fast register-bank switching.
+* plus: Lane-switching would mean that complex operations not suited to
+ parallelisation can be carried out, followed by further parallel Lane-based
+ work, without moving register contents down to memory (and back)
+* minus: Access to registers across multiple lanes is challenging. "Solution"
+ is to drop data into memory and immediately back in again (like MMX).
+
+## Simple-V
+
+Primary benefit of Simple-V is the OO abstraction of parallel principles
+from actual (internal) parallel hardware. It's an API in effect that's
+designed to be slotted in to an existing implementation (just after
+instruction decode) with minimum disruption and effort.
+
+* minus: the complexity of having to use register renames, OoO, VLIW,
+ register file cacheing, all of which has been done before but is a
+ pain
+* plus: transparent re-use of existing opcodes as-is just indirectly
+ saying "this register's now a vector" which
+* plus: means that future instructions also get to be inherently
+ parallelised because there's no "separate vector opcodes"
+* plus: Compressed instructions may also be (indirectly) parallelised
+* minus: the indirect nature of Simple-V means that setup (setting
+ a CSR register to indicate vector length, a separate one to indicate
+ that it is a predicate register and so on) means a little more setup
+ time than Alt-RVP or RVV's "direct and within the (longer) instruction"
+ approach.
+* plus: shared register file meaning that, like Alt-RVP, complex
+ operations not suited to parallelisation may be carried out interleaved
+ between parallelised instructions *without* requiring data to be dropped
+ down to memory and back (into a separate vectorised register engine).
+* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
+ files means that huge parallel workloads would use up considerable
+ chunks of the register file. However in the case of RV64 and 32-bit
+ operations, that effectively means 64 slots are available for parallel
+ operations.
+* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
+ be added, yet the instruction opcodes remain unchanged (and still appear
+ to be parallel). consistent "API" regardless of actual internal parallelism:
+ even an in-order single-issue implementation with a single ALU would still
+ appear to have parallel vectoristion.
+* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
+ hard to say if there would be pluses or minuses (on die area). At worse it
+ would be "no worse" than existing register renaming, OoO, VLIW and register
+ file cacheing schemes.
+
+## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
+
+RVV is extremely well-designed and has some amazing features, including
+2D reorganisation of memory through LOAD/STORE "strides".
+
+* plus: regular predictable workload means that implementations may
+ streamline effects on L1/L2 Cache.
+* plus: regular and clear parallel workload also means that lanes
+ (similar to Alt-RVP) may be used as an implementation detail,
+ using either SRAM or 2R1W registers.
+* plus: separate engine with no impact on the rest of an implementation
+* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
+ really feasible.
+* minus: no ISA abstraction or re-use either: additions to other Extensions
+ do not gain parallelism, resulting in prolific duplication of functionality
+ inside RVV *and out*.
+* minus: when operations require a different approach (scalar operations
+ using the standard integer or FP regfile) an entire vector must be
+ transferred out to memory, into standard regfiles, then back to memory,
+ then back to the vector unit, this to occur potentially multiple times.
+* minus: will never fit into Compressed instruction space (as-is. May
+ be able to do so if "indirect" features of Simple-V are partially adopted).
+* plus-and-slight-minus: extended variants may address up to 256
+ vectorised registers (requires 48/64-bit opcodes to do it).
+* minus-and-partial-plus: separate engine plus complexity increases
+ implementation time and die area, meaning that adoption is likely only
+ to be in high-performance specialist supercomputing (where it will
+ be absolutely superb).
+
+## Traditional SIMD
+
+The only really good things about SIMD are how easy it is to implement and
+get good performance. Unfortunately that makes it quite seductive...
+
+* plus: really straightforward, ALU basically does several packed operations
+ at once. Parallelism is inherent at the ALU, making the addition of
+ SIMD-style parallelism an easy decision that has zero significant impact
+ on the rest of any given architectural design and layout.
+* plus (continuation): SIMD in simple in-order single-issue designs can
+ therefore result in superb throughput, easily achieved even with a very
+ simple execution model.
+* minus: ridiculously complex setup and corner-cases that disproportionately
+ increase instruction count on what would otherwise be a "simple loop",
+ should the number of elements in an array not happen to exactly match
+ the SIMD group width.
+* minus: getting data usefully out of registers (if separate regfiles
+ are used) means outputting to memory and back.
+* minus: quite a lot of supplementary instructions for bit-level manipulation
+ are needed in order to efficiently extract (or prepare) SIMD operands.
+* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
+ dimension and parallelism (width): an at least O(N^2) and quite probably
+ O(N^3) ISA proliferation that often results in several thousand
+ separate instructions. all requiring separate and distinct corner-case
+ algorithms!
+* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
+ 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
+ For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
+ four separate and distinct instructions: one for (r1:low r2:high),
+ one for (r1:high r2:low), one for (r1:high r2:high) and one for
+ (r1:low r2:low) *per function*.
+* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
+ between operand and result bit-widths. In combination with high/low
+ proliferation the situation is made even worse.
+* minor-saving-grace: some implementations *may* have predication masks
+ that allow control over individual elements within the SIMD block.
+
+# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals
+
+This section compares the various parallelism proposals as they stand,
+*against* traditional SIMD as opposed to *alongside* SIMD. In other words,
+the question is asked "How can each of the proposals effectively implement
+(or replace) SIMD, and how effective would they be"?
+
+## [[alt_rvp]]
+
+* Alt-RVP would not actually replace SIMD but would augment it: just as with
+ a SIMD architecture where the ALU becomes responsible for the parallelism,
+ Alt-RVP ALUs would likewise be so responsible... with *additional*
+ (lane-based) parallelism on top.
+* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+ at least one dimension are avoided (architectural upgrades introducing
+ 128-bit then 256-bit then 512-bit variants of the exact same 64-bit
+ SIMD block)
+* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
+ of instructions as SIMD, albeit not quite as badly (due to Lanes).
+* In the same discussion for Alt-RVP, an additional proposal was made to
+ be able to subdivide the bits of each register lane (columns) down into
+ arbitrary bit-lengths (RGB 565 for example).
+* A recommendation was given instead to make the subdivisions down to 32-bit,
+ 16-bit or even 8-bit, effectively dividing the registerfile into
+ Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane
+ "swapping" instructions were then introduced, some of the disadvantages
+ of SIMD could be mitigated.
+
+## RVV
+
+* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
+ parallelism.
+* However whilst SIMD is usually designed for single-issue in-order simple
+ DSPs with a focus on Multimedia (Audio, Video and Image processing),
+ RVV's primary focus appears to be on Supercomputing: optimisation of
+ mathematical operations that fit into the OpenCL space.
+* Adding functions (operations) that would normally fit (in parallel)
+ into a SIMD instruction requires an equivalent to be added to the
+ RVV Extension, if one does not exist. Given the specialist nature of
+ some SIMD instructions (8-bit or 16-bit saturated or halving add),
+ this possibility seems extremely unlikely to occur, even if the
+ implementation overhead of RVV were acceptable (compared to
+ normal SIMD/DSP-style single-issue in-order simplicity).
+
+## Simple-V
+
+* Simple-V borrows hugely from RVV as it is intended to be easy to
+ topologically transplant every single instruction from RVV (as
+ designed) into Simple-V equivalents, with *zero loss of functionality
+ or capability*.
+* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
+ Extension which contained the basic primitives (non-parallelised
+ 8, 16 or 32-bit SIMD operations) inherently *become* parallel,
+ automatically.
+* Additionally, standard operations (ADD, MUL) that would normally have
+ to have special SIMD-parallel opcodes added need no longer have *any*
+ of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
+ 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
+ *standard* RV opcodes (present and future) and automatically parallelises
+ them.
+* By inheriting the RVV feature of arbitrary vector-length, then just as
+ with RVV the corner-cases and ISA proliferation of SIMD is avoided.
+* Whilst not entirely finalised, registers are expected to be
+ capable of being subdivided down to an implementor-chosen bitwidth
+ in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
+ and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
+ choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
+ ALUs that perform twin 8-bit operations as they see fit, or anything
+ else including no subdivisions at all.
+* Even though implementors have that choice even to have full 64-bit
+ (with RV64) SIMD, they *must* provide predication that transparently
+ switches off appropriate units on the last loop, thus neatly fitting
+ underlying SIMD ALU implementations *into* the arbitrary vector-length
+ RVV paradigm, keeping the uniform consistent API that is a key strategic
+ feature of Simple-V.
+* With Simple-V fitting into the standard register files, certain classes
+ of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
+ can be done by applying *Parallelised* Bit-manipulation operations
+ followed by parallelised *straight* versions of element-to-element
+ arithmetic operations, even if the bit-manipulation operations require
+ changing the bitwidth of the "vectors" to do so. Predication can
+ be utilised to skip high words (or low words) in source or destination.
+* In essence, the key downside of SIMD - massive duplication of
+ identical functions over time as an architecture evolves from 32-bit
+ wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
+ vector-style parallelism being dropped on top of 8-bit or 16-bit
+ operations, all the while keeping a consistent ISA-level "API" irrespective
+ of implementor design choices (or indeed actual implementations).
+
+# Impementing V on top of Simple-V
+
+* Number of Offset CSRs extends from 2
+* Extra register file: vector-file
+* Setup of Vector length and bitwidth CSRs now can specify vector-file
+ as well as integer or float file.
+* Extend CSR tables (bitwidth) with extra bits
+* TODO
+
+# Implementing P (renamed to DSP) on top of Simple-V
+
+* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
+ (caveat: anything not specified drops through to software-emulation / traps)
+* TODO
+
+# Appendix
+
+## V-Extension to Simple-V Comparative Analysis
+
+This section has been moved to its own page [[v_comparative_analysis]]
+
+## P-Ext ISA
+
+This section has been moved to its own page [[p_comparative_analysis]]
+
+## Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+ register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
+ register CSRpredicate[XLEN][4]; # 2^4 is max vector length
+ register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
+ register x[32][XLEN];
+
+ function op_add(rd, rs1, rs2, predr)
+ {
+ Â Â /* note that this is ADD, not PADD */
+ Â Â int i, id, irs1, irs2;
+ Â Â # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
+ Â Â # also destination makes no sense as a scalar but what the hell...
+ Â Â for (i = 0, id=0, irs1=0, irs2=0; i
+
+One of the goals of this parallelism proposal is to avoid instruction
+duplication. However, with the base ISA having been designed explictly
+to *avoid* condition-codes entirely, shoe-horning predication into it
+bcomes quite challenging.
+
+However what if all branch instructions, if referencing a vectorised
+register, were instead given *completely new analogous meanings* that
+resulted in a parallel bit-wise predication register being set? This
+would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
+BLT and BGE.
+
+We might imagine that FEQ, FLT and FLT would also need to be converted,
+however these are effectively *already* in the precise form needed and
+do not need to be converted *at all*! The difference is that FEQ, FLT
+and FLE *specifically* write a 1 to an integer register if the condition
+holds, and 0 if not. All that needs to be done here is to say, "if
+the integer register is tagged with a bit that says it is a predication
+register, the **bit** in the integer register is set based on the
+current vector index" instead.
+
+There is, in the standard Conditional Branch instruction, more than
+adequate space to interpret it in a similar fashion:
+
+[[!table data="""
+ 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
+imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+ 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+ offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
+"""]]
+
+This would become:
+
+[[!table data="""
+31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
+imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
+1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
+reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
+"""]]
+
+Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
+with the interesting side-effect that there is space within what is presently
+the "immediate offset" field to reinterpret that to add in not only a bit
+field to distinguish between floating-point compare and integer compare,
+not only to add in a second source register, but also use some of the bits as
+a predication target as well.
+
+[[!table data="""
+15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
+ funct3 | imm | rs10 | imm | op |
+ 3 | 3 | 3 | 5 | 2 |
+ C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
+"""]]
+
+Now uses the CS format:
+
+[[!table data="""
+15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
+ funct3 | imm | rs10 | imm | | op |
+ 3 | 3 | 3 | 2 | 3 | 2 |
+ C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
+"""]]
+
+Bit 6 would be decoded as "operation refers to Integer or Float" including
+interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
+"C" Standard, version 2.0,
+whilst Bit 5 would allow the operation to be extended, in combination with
+funct3 = 110 or 111: a combination of four distinct (predicated) comparison
+operators. In both floating-point and integer cases those could be
+EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
+
+## Register reordering
+
+### Register File
+
+| Reg Num | Bits |
+| ------- | ---- |
+| r0 | (32..0) |
+| r1 | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) |
+| r5 | (32..0) |
+| r6 | (32..0) |
+| r7 | (32..0) |
+| .. | (32..0) |
+| r31| (32..0) |
+
+### Vectorised CSR
+
+May not be an actual CSR: may be generated from Vector Length CSR:
+single-bit is less burdensome on instruction decode phase.
+
+| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
+| - | - | - | - | - | - | - | - |
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
+
+### Vector Length CSR
+
+| Reg Num | (3..0) |
+| ------- | ---- |
+| r0 | 2 |
+| r1 | 0 |
+| r2 | 1 |
+| r3 | 1 |
+| r4 | 3 |
+| r5 | 0 |
+| r6 | 0 |
+| r7 | 1 |
+
+### Virtual Register Reordering
+
+This example assumes the above Vector Length CSR table
+
+| Reg Num | Bits (0) | Bits (1) | Bits (2) |
+| ------- | -------- | -------- | -------- |
+| r0 | (32..0) | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) | (32..0) | (32..0) |
+| r7 | (32..0) |
+
+### Bitwidth Virtual Register Reordering
+
+This example goes a little further and illustrates the effect that a
+bitwidth CSR has been set on a register. Preconditions:
+
+* RV32 assumed
+* CSRintbitwidth[2] = 010 # integer r2 is 16-bit
+* CSRintvlength[2] = 3 # integer r2 is a vector of length 3
+* vsetl rs1, 5 # set the vector length to 5
+
+This is interpreted as follows:
+
+* Given that the context is RV32, ELEN=32.
+* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
+* Therefore the actual vector length is up to *six* elements
+* However vsetl sets a length 5 therefore the last "element" is skipped
+
+So when using an operation that uses r2 as a source (or destination)
+the operation is carried out as follows:
+
+* 16-bit operation on r2(15..0) - vector element index 0
+* 16-bit operation on r2(31..16) - vector element index 1
+* 16-bit operation on r3(15..0) - vector element index 2
+* 16-bit operation on r3(31..16) - vector element index 3
+* 16-bit operation on r4(15..0) - vector element index 4
+* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5
+
+Predication has been left out of the above example for simplicity, however
+predication is ANDed with the latter stages (vsetl not equal to maximum
+capacity).
+
+Note also that it is entirely an implementor's choice as to whether to have
+actual separate ALUs down to the minimum bitwidth, or whether to have something
+more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD
+operations carried out 32-bits at a time is perfectly acceptable, as is
+8-bit SIMD operations carried out 16-bits at a time requiring two ALUs).
+Regardless of the internal parallelism choice, *predication must
+still be respected*, making Simple-V in effect the "consistent public API".
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | -------- |
+| 000 | default |
+| 001 | 8 |
+| 010 | 16 |
+| 011 | 32 |
+| 100 | 64 |
+| 101 | 128 |
+| 110 | rsvd |
+| 111 | rsvd |
+
+Pseudocode for vector length taking CSR SIMD-bitwidth into account:
+
+ vew = CSRbitwidth[rs1]
+ if (vew == 0)
+ bytesperreg = (XLEN/8) # or FLEN as appropriate
+ else:
+ bytesperreg = bytestable[vew] # 1 2 4 8 16
+ simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+ vlen = CSRvectorlen[rs1] * simdmult
+
+To index an element in a register rnum where the vector element index is i:
+
+ function regoffs(rnum, i):
+ regidx = floor(i / simdmult) # integer-div rounded down
+ byteidx = i % simdmult # integer-remainder
+ return rnum + regidx, # actual real register
+ byteidx * 8, # low
+ byteidx * 8 + (vew-1), # high
+
+### Example Instruction translation:
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FILO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
+### Insights
+
+SIMD register file splitting still to consider. For RV64, benefits of doubling
+(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
+size of the floating point register file to 64 (128 in the case of HP)
+seem pretty clear and worth the complexity.
+
+64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
+done on 64-bit registers it's not so conceptually difficult. May even
+be achieved by *actually* splitting the regfile into 64 virtual 32-bit
+registers such that a 64-bit FP scalar operation is dropped into (r0.H
+r0.L) tuples. Implementation therefore hidden through register renaming.
+
+Implementations intending to introduce VLIW, OoO and parallelism
+(even without Simple-V) would then find that the instructions are
+generated quicker (or in a more compact fashion that is less heavy
+on caches). Interestingly we observe then that Simple-V is about
+"consolidation of instruction generation", where actual parallelism
+of underlying hardware is an implementor-choice that could just as
+equally be applied *without* Simple-V even being implemented.
+
+## Analysis of CSR decoding on latency
+
+It could indeed have been logically deduced (or expected), that there
+would be additional decode latency in this proposal, because if
+overloading the opcodes to have different meanings, there is guaranteed
+to be some state, some-where, directly related to registers.
+
+There are several cases:
+
+* All operands vector-length=1 (scalars), all operands
+ packed-bitwidth="default": instructions are passed through direct as if
+ Simple-V did not exist. Simple-V is, in effect, completely disabled.
+* At least one operand vector-length > 1, all operands
+ packed-bitwidth="default": any parallel vector ALUs placed on "alert",
+ virtual parallelism looping may be activated.
+* All operands vector-length=1 (scalars), at least one
+ operand packed-bitwidth != default: degenerate case of SIMD,
+ implementation-specific complexity here (packed decode before ALUs or
+ *IN* ALUs)
+* At least one operand vector-length > 1, at least one operand
+ packed-bitwidth != default: parallel vector ALUs (if any)
+ placed on "alert", virtual parallelsim looping may be activated,
+ implementation-specific SIMD complexity kicks in (packed decode before
+ ALUs or *IN* ALUs).
+
+Bear in mind that the proposal includes that the decision whether
+to parallelise in hardware or whether to virtual-parallelise (to
+dramatically simplify compilers and also not to run into the SIMD
+instruction proliferation nightmare) *or* a transprent combination
+of both, be done on a *per-operand basis*, so that implementors can
+specifically choose to create an application-optimised implementation
+that they believe (or know) will sell extremely well, without having
+"Extra Standards-Mandated Baggage" that would otherwise blow their area
+or power budget completely out the window.
+
+Additionally, two possible CSR schemes have been proposed, in order to
+greatly reduce CSR space:
+
+* per-register CSRs (vector-length and packed-bitwidth)
+* a smaller number of CSRs with the same information but with an *INDEX*
+ specifying WHICH register in one of three regfiles (vector, fp, int)
+ the length and bitwidth applies to.
+
+(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
+
+In addition, LOAD/STORE has its own associated proposed CSRs that
+mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
+V (and Hwacha).
+
+Also bear in mind that, for reasons of simplicity for implementors,
+I was coming round to the idea of permitting implementors to choose
+exactly which bitwidths they would like to support in hardware and which
+to allow to fall through to software-trap emulation.
+
+So the question boils down to:
+
+* whether either (or both) of those two CSR schemes have significant
+ latency that could even potentially require an extra pipeline decode stage
+* whether there are implementations that can be thought of which do *not*
+ introduce significant latency
+* whether it is possible to explicitly (through quite simply
+ disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
+ all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
+ the extreme of skipping an entire pipeline stage (if one is needed)
+* whether packed bitwidth and associated regfile splitting is so complex
+ that it should definitely, definitely be made mandatory that implementors
+ move regfile splitting into the ALU, and what are the implications of that
+* whether even if that *is* made mandatory, is software-trapped
+ "unsupported bitwidths" still desirable, on the basis that SIMD is such
+ a complete nightmare that *even* having a software implementation is
+ better, making Simple-V have more in common with a software API than
+ anything else.
+
+Whilst the above may seem to be severe minuses, there are some strong
+pluses:
+
+* Significant reduction of V's opcode space: over 85%.
+* Smaller reduction of P's opcode space: around 10%.
+* The potential to use Compressed instructions in both Vector and SIMD
+ due to the overloading of register meaning (implicit vectorisation,
+ implicit packing)
+* Not only present but also future extensions automatically gain parallelism.
+* Already mentioned but worth emphasising: the simplification to compiler
+ writers and assembly-level writers of having the same consistent ISA
+ regardless of whether the internal level of parallelism (number of
+ parallel ALUs) is only equal to one ("virtual" parallelism), or is
+ greater than one, should not be underestimated.
+
+## Reducing Register Bank porting
+
+This looks quite reasonable.
+
+
+The main details are outlined on page 4. They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
# References
@@ -583,3 +1327,14 @@ translates effectively to:
Figure 2 P17 and Section 3 on P16.
* Hwacha
* Hwacha
+* Vector Workshop
+* Predication
+* Branch Divergence
+* Life of Triangles (3D)
+* Videocore-IV
+* Discussion proposing CSRs that change ISA definition
+
+* Zero-overhead loops
+* Multi-ported VLIW Register File Implementation
+* Fast context save/restore proposal
+* Register File Bank Cacheing