[[!toc ]]
+# Summary
+
+Key insight: Simple-V is intended as an abstraction layer to provide
+a consistent "API" to parallelisation of existing *and future* operations.
+*Actual* internal hardware-level parallelism is *not* required, such
+that Simple-V may be viewed as providing a "compact" or "consolidated"
+means of issuing multiple near-identical arithmetic instructions to an
+instruction queue (FILO), pending execution.
+
+*Actual* parallelism, if added independently of Simple-V in the form
+of Out-of-order restructuring (including parallel ALU lanes) or VLIW
+implementations, or SIMD, or anything else, would then benefit *if*
+Simple-V was added on top.
+
+# Introduction
+
This proposal exists so as to be able to satisfy several disparate
requirements: power-conscious, area-conscious, and performance-conscious
designs all pull an ISA and its implementation in different conflicting
whilst each extremely powerful in their own right and clearly desirable,
are also:
-* Clearly independent in their origins (Cray and AndeStar v3 respectively)
+* Clearly independent in their origins (Cray and AndesStar v3 respectively)
so need work to adapt to the RISC-V ethos and paradigm
* Are sufficiently large so as to make adoption (and exploration for
analysis and review purposes) prohibitively expensive
Operation involving (referring to) register M:
-> bitwidth = default # default for opcode?
-> vectorlen = 1 # scalar
->
-> for (o = 0, o < 2, o++)
-> if (CSR-Vector_registernum[o] == M)
-> bitwidth = CSR-Vector_bitwidth[o]
-> vectorlen = CSR-Vector_len[o]
-> break
+ bitwidth = default # default for opcode?
+ vectorlen = 1 # scalar
+
+ for (o = 0, o < 2, o++)
+ if (CSR-Vector_registernum[o] == M)
+ bitwidth = CSR-Vector_bitwidth[o]
+ vectorlen = CSR-Vector_len[o]
+ break
and for the former it would simply be:
-> bitwidth = CSR-Vector_bitwidth[M]
-> vectorlen = CSR-Vector_len[M]
+ bitwidth = CSR-Vector_bitwidth[M]
+ vectorlen = CSR-Vector_len[M]
Alternatives:
LOAD rN, ldoffs(rM) would then be (assuming packed bit-width not set):
-> offs = 0
-> stride = 1
-> vector-len = CSR-Vector-length register N
->
-> for (o = 0, o < 2, o++)
-> if (CSR-Offset register o == M)
-> offs = CSR-Offset amount register o
-> if CSR-Offset Stride-mode == offset:
-> stride = ldoffs
-> break
->
-> for (i = 0, i < vector-len; i++)
-> r[N+i] = mem[(offs*i + r[M+i])*stride]
+ offs = 0
+ stride = 1
+ vector-len = CSR-Vector-length register N
+
+ for (o = 0, o < 2, o++)
+ if (CSR-Offset register o == M)
+ offs = CSR-Offset amount register o
+ if CSR-Offset Stride-mode == offset:
+ stride = ldoffs
+ break
+
+ for (i = 0, i < vector-len; i++)
+ r[N+i] = mem[(offs*i + r[M+i])*stride]
# Analysis and discussion of Vector vs SIMD
proposal would basically allow an inner loop of instructions to be
repeated indefinitely, a fixed number of times.
-Its specific advantage over explicit loops is that the pipeline in a
-DSP can potentially be kept completely full *even in an in-order
+Its specific advantage over explicit loops is that the pipeline in a DSP
+can potentially be kept completely full *even in an in-order single-issue
implementation*. Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in order
-to keep ALU pipelines 100% occupied.
+out-of-order execution capabilities to "pre-process" instructions in
+order to keep ALU pipelines 100% occupied.
-This very simple proposal offers a way to increase pipeline activity in the
-one key area which really matters: the inner loop.
+By bringing that capability in, this proposal could offer a way to increase
+pipeline activity even in simpler implementations in the one key area
+which really matters: the inner loop.
+
+However when looking at much more comprehensive schemes
+"A portable specification of zero-overhead loop control hardware
+applied to embedded processors" (ZOLC), optimising only the single
+inner loop seems inadequate, tending to suggest that ZOLC may be
+better off being proposed as an entirely separate Extension.
## Mask and Tagging (Predication)
to also tag certain registers as "predicated if referenced as a destination".
Example:
-> // in future operations if r0 is the destination use r5 as
-> // the PREDICATION register
-> IMPLICICSRPREDICATE r0, r5
-> // store the compares in r5 as the PREDICATION register
-> CMPEQ8 r5, r1, r2
-> // r0 is used here. ah ha! that means it's predicated using r5!
-> ADD8 r0, r1, r3
+ // in future operations if r0 is the destination use r5 as
+ // the PREDICATION register
+ IMPLICICSRPREDICATE r0, r5
+ // store the compares in r5 as the PREDICATION register
+ CMPEQ8 r5, r1, r2
+ // r0 is used here. ah ha! that means it's predicated using r5!
+ ADD8 r0, r1, r3
With enough registers (and there are enough registers) some fairly
complex predication can be set up and yet still execute without significant
* Fixed vs variable parallelism: <b>variable</b>
* Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
* Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit</b>
-* Tag or no-tag: <b>Complex and needs further thought</b>
+* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
+* Tag or no-tag: <b>Complex but highly beneficial</b>
+
+In particular:
-In particular: variable-length vectors came out on top because of the
-high setup, teardown and corner-cases associated with the fixed width
-of SIMD. Implicit bit-width helps to extend the ISA to escape from
-former limitations and restrictions (in a backwards-compatible fashion),
-and implicit (zero-overhead) loops provide a means to keep pipelines
-potentially 100% occupied *without* requiring a super-scalar or out-of-order
-architecture.
+* variable-length vectors came out on top because of the high setup, teardown
+ and corner-cases associated with the fixed width of SIMD.
+* Implicit bit-width helps to extend the ISA to escape from
+ former limitations and restrictions (in a backwards-compatible fashion),
+ whilst also leaving implementors free to simmplify implementations
+ by using actual explicit internal parallelism.
+* Implicit (zero-overhead) loops provide a means to keep pipelines
+ potentially 100% occupied in a single-issue in-order implementation
+ i.e. *without* requiring a super-scalar or out-of-order architecture,
+ but doing a proper, full job (ZOLC) is an entirely different matter.
-Constructing a SIMD/Simple-Vector proposal based around even only these four
-(five?) requirements would therefore seem to be a logical thing to do.
+Constructing a SIMD/Simple-Vector proposal based around four of these five
+requirements would therefore seem to be a logical thing to do.
# Instruction Format
get to choose precisely where to focus and target the benefits of their
implementation efforts, without "extra baggage".
+# Example of vector / vector, vector / scalar, scalar / scalar => vector add
+
+ register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
+ register CSRpredicate[XLEN][4]; # 2^4 is max vector length
+ register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
+ register x[32][XLEN];
+
+ function op_add(rd, rs1, rs2, predr)
+ {
+ /* note that this is ADD, not PADD */
+ int i, id, irs1, irs2;
+ # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
+ # also destination makes no sense as a scalar but what the hell...
+ for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
+ if (CSRpredicate[predr][i]) # i *think* this is right...
+ x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
+ # now increment the idxs
+ if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
+ id += 1;
+ if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
+ irs1 += 1;
+ if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
+ irs2 += 1;
+ }
+
# V-Extension to Simple-V Comparative Analysis
This section covers the ways in which Simple-V is comparable
## 17.18 Vector Load/Store Instructions
-These may not have a direct equivalent in Simple-V, except if mask/tagging
-is to be deployed.
+The Vector Load/Store instructions as proposed in V are extremely powerful
+and can be used for reordering and regular restructuring.
+
+Vector Load:
+
+ if (unit-strided) stride = elsize;
+ else stride = areg[as2]; // constant-strided
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ for (int j=0; j<seglen+1; j++)
+ vreg[vd+j][i] = mem[areg[as1] + (i*(seglen+1)+j)*stride];
+
+Store:
+
+ if (unit-strided) stride = elsize;
+ else stride = areg[as2]; // constant-strided
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ for (int j=0; j<seglen+1; j++)
+ mem[areg[base] + (i*(seglen+1)+j)*stride] = vreg[vd+j][i];
+
+Indexed Load:
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ for (int j=0; j<seglen+1; j++)
+ vreg[vd+j][i] = mem[sreg[base] + vreg[vs2][i] + j*elsize];
+
+Indexed Store:
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ for (int j=0; j<seglen+1; j++)
+ mem[sreg[base] + vreg[vs2][i] + j*elsize] = vreg[vd+j][i];
+
+Keeping these instructions as-is for Simple-V is highly recommended.
+However: one of the goals of this Extension is to retro-fit (re-use)
+existing RV Load/Store:
+
+[[!table data="""
+31 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+ imm[11:0] | rs1 | funct3 | rd | opcode |
+ 12 | 5 | 3 | 5 | 7 |
+ offset[11:0] | base | width | dest | LOAD |
+"""]]
+
+[[!table data="""
+31 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+ imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode |
+ 7 | 5 | 5 | 3 | 5 | 7 |
+ offset[11:5] | src | base | width | offset[4:0] | STORE |
+"""]]
+
+The RV32 instruction opcodes as follows:
+
+[[!table data="""
+31 28 27 | 26 25 | 24 20 |19 15 |14| 13 12 | 11 7 | 6 0 | op |
+imm[4:0] | 00 | 00000 | rs1 | 1| m | vd | 0000111 | VLD |
+imm[4:0] | 01 | rs2 | rs1 | 1| m | vd | 0000111 | VLDS|
+imm[4:0] | 11 | vs2 | rs1 | 1| m | vd | 0000111 | VLDX|
+vs3 | 00 | 00000 | rs1 |1 | m |imm[4:0]| 0100111 |VST |
+vs3 | 01 | rs2 | rs1 |1 | m |imm[4:0]| 0100111 |VSTS |
+vs3 | 11 | vs2 | rs1 |1 | m |imm[4:0]| 0100111 |VSTX |
+"""]]
+
+Conversion on LOAD as follows:
-To be discussed.
+* rd or rs1 are CSR-vectorised indicating "Vector Mode"
+* rd equivalent to vd
+* rs1 equivalent to rs1
+* imm[4:0] from RV format (11..7]) is same
+* imm[9:5] from RV format (29..25] is rs2 (rs2=00000 for VLD)
+* imm[11:10] from RV format (31..30] is opcode (VLD, VLDS, VLDX)
+* width from RV format (14..12) is same (width and zero/sign extend)
+
+[[!table data="""
+31 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+imm[11:0] ||| rs1 | funct3 | rd | opcode |
+2 | 5 | 5 | 5 | 3 | 5 | 7 |
+00 | 00000 | imm[4:0] | base | width | dest | LOAD |
+01 | rs2 | imm[4:0] | base | width | dest | LOAD.S |
+11 | rs2 | imm[4:0] | base | width | dest | LOAD.X |
+"""]]
+
+Similar conversion on STORE as follows:
+
+[[!table data="""
+31 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+imm[11:0] ||| rs1 | funct3 | rd | opcode |
+2 | 5 | 5 | 5 | 3 | 5 | 7 |
+00 | 00000 | src | base | width | offs[4:0] | LOAD |
+01 | rs3 | src | base | width | offs[4:0] | LOAD.S |
+11 | rs3 | src | base | width | offs[4:0] | LOAD.X |
+"""]]
+
+Notes:
+
+* Predication CSR-marking register is not explicitly shown in instruction
+* In both LOAD and STORE, it is possible now to rs2 (or rs3) as a vector.
+* That in turn means that Indexed Load need not have an explicit opcode
+* That in turn means that bit 30 may indicate "stride" and bit 31 is free
+
+Revised LOAD:
+
+[[!table data="""
+31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
+ imm[11:0] |||| rs1 | funct3 | rd | opcode |
+ 1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
+ ? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
+"""]]
+
+Where in turn the pseudo-code may now combine the two:
+
+ if (unit-strided) stride = elsize;
+ else stride = areg[as2]; // constant-strided
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ for (int j=0; j<seglen+1; j++)
+ {
+ if CSRvectorised[rs2])
+ offs = vreg[rs2][i]
+ else
+ offs = i*(seglen+1)*stride;
+ vreg[vd+j][i] = mem[sreg[base] + offs + j*stride];
+ }
+
+Notes:
+
+* j is multiplied by stride, not elsize, including in the rs2 vectorised case.
+* There may be more sophisticated variants involving the 31st bit, however
+ it would be nice to reserve that bit for post-increment of address registers
## 17.19 Vector Register Gather
providing a means to interact between the zero-overhead loop and the
vsetvl instruction. a sort-of pseudo-assembly of that would look like:
-> # a2 to be auto-incremented by t0*4
-> zero-overhead-set-auto-increment a2, t0, 4
-> # a2 to be auto-incremented by t0*4
-> zero-overhead-set-auto-increment a3, t0, 4
-> zero-overhead-set-loop-terminator-condition a0 zero
-> zero-overhead-set-start-end stripmine, stripmine+endoffset
-> stripmine:
-> vsetvl t0,a0
-> vlw v0, a2
-> vlw v1, a3
-> vfma v1, a1, v0, v1
-> vsw v1, a3
-> sub a0, a0, t0
->stripmine+endoffset:
+ # a2 to be auto-incremented by t0 times 4
+ zero-overhead-set-auto-increment a2, t0, 4
+ # a2 to be auto-incremented by t0 times 4
+ zero-overhead-set-auto-increment a3, t0, 4
+ zero-overhead-set-loop-terminator-condition a0 zero
+ zero-overhead-set-start-end stripmine, stripmine+endoffset
+ stripmine:
+ vsetvl t0,a0
+ vlw v0, a2
+ vlw v1, a3
+ vfma v1, a1, v0, v1
+ vsw v1, a3
+ sub a0, a0, t0
+ stripmine+endoffset:
the question is: would something like this even be desirable? it's a
variant of auto-increment [1]. last time i saw any hint of auto-increment
* Throw an exception. Whether that actually results in spawning threads
as part of the trap-handling remains to be seen.
+# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
+
+This section compares the various parallelism proposals as they stand,
+including traditional SIMD, in terms of features, ease of implementation,
+complexity, flexibility, and die area.
+
+## [[alt_rvp]]
+
+Primary benefit of Alt-RVP is the simplicity with which parallelism
+may be introduced (effective multiplication of regfiles and associated ALUs).
+
+* plus: the simplicity of the lanes (combined with the regularity of
+ allocating identical opcodes multiple independent registers) meaning
+ that SRAM or 2R1W can be used for entire regfile (potentially).
+* minus: a more complex instruction set where the parallelism is much
+ more explicitly directly specified in the instruction and
+* minus: if you *don't* have an explicit instruction (opcode) and you
+ need one, the only place it can be added is... in the vector unit and
+* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
+ not useable or accessible in other Extensions.
+* plus-and-minus: Lanes may be utilised for high-speed context-switching
+ but with the down-side that they're an all-or-nothing part of the Extension.
+ No Alt-RVP: no fast register-bank switching.
+* plus: Lane-switching would mean that complex operations not suited to
+ parallelisation can be carried out, followed by further parallel Lane-based
+ work, without moving register contents down to memory (and back)
+* minus: Access to registers across multiple lanes is challenging. "Solution"
+ is to drop data into memory and immediately back in again (like MMX).
+
+## Simple-V
+
+Primary benefit of Simple-V is the OO abstraction of parallel principles
+from actual (internal) parallel hardware. It's an API in effect that's
+designed to be slotted in to an existing implementation (just after
+instruction decode) with minimum disruption and effort.
+
+* minus: the complexity of having to use register renames, OoO, VLIW,
+ register file cacheing, all of which has been done before but is a
+ pain
+* plus: transparent re-use of existing opcodes as-is just indirectly
+ saying "this register's now a vector" which
+* plus: means that future instructions also get to be inherently
+ parallelised because there's no "separate vector opcodes"
+* plus: Compressed instructions may also be (indirectly) parallelised
+* minus: the indirect nature of Simple-V means that setup (setting
+ a CSR register to indicate vector length, a separate one to indicate
+ that it is a predicate register and so on) means a little more setup
+ time than Alt-RVP or RVV's "direct and within the (longer) instruction"
+ approach.
+* plus: shared register file meaning that, like Alt-RVP, complex
+ operations not suited to parallelisation may be carried out interleaved
+ between parallelised instructions *without* requiring data to be dropped
+ down to memory and back (into a separate vectorised register engine).
+* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
+ files means that huge parallel workloads would use up considerable
+ chunks of the register file. However in the case of RV64 and 32-bit
+ operations, that effectively means 64 slots are available for parallel
+ operations.
+* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
+ be added, yet the instruction opcodes remain unchanged (and still appear
+ to be parallel). consistent "API" regardless of actual internal parallelism:
+ even an in-order single-issue implementation with a single ALU would still
+ appear to have parallel vectoristion.
+* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
+ hard to say if there would be pluses or minuses (on die area). At worse it
+ would be "no worse" than existing register renaming, OoO, VLIW and register
+ file cacheing schemes.
+
+## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
+
+RVV is extremely well-designed and has some amazing features, including
+2D reorganisation of memory through LOAD/STORE "strides".
+
+* plus: regular predictable workload means that implementations may
+ streamline effects on L1/L2 Cache.
+* plus: regular and clear parallel workload also means that lanes
+ (similar to Alt-RVP) may be used as an implementation detail,
+ using either SRAM or 2R1W registers.
+* plus: separate engine with no impact on the rest of an implementation
+* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
+ really feasible.
+* minus: no ISA abstraction or re-use either: additions to other Extensions
+ do not gain parallelism, resulting in prolific duplication of functionality
+ inside RVV *and out*.
+* minus: when operations require a different approach (scalar operations
+ using the standard integer or FP regfile) an entire vector must be
+ transferred out to memory, into standard regfiles, then back to memory,
+ then back to the vector unit, this to occur potentially multiple times.
+* minus: will never fit into Compressed instruction space (as-is. May
+ be able to do so if "indirect" features of Simple-V are partially adopted).
+* plus-and-slight-minus: extended variants may address up to 256
+ vectorised registers (requires 48/64-bit opcodes to do it).
+* minus-and-partial-plus: separate engine plus complexity increases
+ implementation time and die area, meaning that adoption is likely only
+ to be in high-performance specialist supercomputing (where it will
+ be absolutely superb).
+
+## Traditional SIMD
+
+The only really good things about SIMD are how easy it is to implement and
+get good performance. Unfortunately that makes it quite seductive...
+
+* plus: really straightforward, ALU basically does several packed operations
+ at once. Parallelism is inherent at the ALU, making the addition of
+ SIMD-style parallelism an easy decision that has zero significant impact
+ on the rest of any given architectural design and layout.
+* plus (continuation): SIMD in simple in-order single-issue designs can
+ therefore result in superb throughput, easily achieved even with a very
+ simple execution model.
+* minus: ridiculously complex setup and corner-cases that disproportionately
+ increase instruction count on what would otherwise be a "simple loop",
+ should the number of elements in an array not happen to exactly match
+ the SIMD group width.
+* minus: getting data usefully out of registers (if separate regfiles
+ are used) means outputting to memory and back.
+* minus: quite a lot of supplementary instructions for bit-level manipulation
+ are needed in order to efficiently extract (or prepare) SIMD operands.
+* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
+ dimension and parallelism (width): an at least O(N^2) and quite probably
+ O(N^3) ISA proliferation that often results in several thousand
+ separate instructions. all requiring separate and distinct corner-case
+ algorithms!
+* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
+ 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
+ For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
+ four separate and distinct instructions: one for (r1:low r2:high),
+ one for (r1:high r2:low), one for (r1:high r2:high) and one for
+ (r1:low r2:low) *per function*.
+* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
+ between operand and result bit-widths. In combination with high/low
+ proliferation the situation is made even worse.
+* minor-saving-grace: some implementations *may* have predication masks
+ that allow control over individual elements within the SIMD block.
+
+# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
+
+This section compares the various parallelism proposals as they stand,
+*against* traditional SIMD as opposed to *alongside* SIMD. In other words,
+the question is asked "How can each of the proposals effectively implement
+(or replace) SIMD, and how effective would they be"?
+
+## [[alt_rvp]]
+
+* Alt-RVP would not actually replace SIMD but would augment it: just as with
+ a SIMD architecture where the ALU becomes responsible for the parallelism,
+ Alt-RVP ALUs would likewise be so responsible... with *additional*
+ (lane-based) parallelism on top.
+* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
+ at least one dimension are avoided (architectural upgrades introducing
+ 128-bit then 256-bit then 512-bit variants of the exact same 64-bit
+ SIMD block)
+* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
+ of instructions as SIMD, albeit not quite as badly (due to Lanes).
+* In the same discussion for Alt-RVP, an additional proposal was made to
+ be able to subdivide the bits of each register lane (columns) down into
+ arbitrary bit-lengths (RGB 565 for example).
+* A recommendation was given instead to make the subdivisions down to 32-bit,
+ 16-bit or even 8-bit, effectively dividing the registerfile into
+ Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane
+ "swapping" instructions were then introduced, some of the disadvantages
+ of SIMD could be mitigated.
+
+## RVV
+
+* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
+ parallelism.
+* However whilst SIMD is usually designed for single-issue in-order simple
+ DSPs with a focus on Multimedia (Audio, Video and Image processing),
+ RVV's primary focus appears to be on Supercomputing: optimisation of
+ mathematical operations that fit into the OpenCL space.
+* Adding functions (operations) that would normally fit (in parallel)
+ into a SIMD instruction requires an equivalent to be added to the
+ RVV Extension, if one does not exist. Given the specialist nature of
+ some SIMD instructions (8-bit or 16-bit saturated or halving add),
+ this possibility seems extremely unlikely to occur, even if the
+ implementation overhead of RVV were acceptable (compared to
+ normal SIMD/DSP-style single-issue in-order simplicity).
+
+## Simple-V
+
+* Simple-V borrows hugely from RVV as it is intended to be easy to
+ topologically transplant every single instruction from RVV (as
+ designed) into Simple-V equivalents, with *zero loss of functionality
+ or capability*.
+* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
+ Extension which contained the basic primitives (non-parallelised
+ 8, 16 or 32-bit SIMD operations) inherently *become* parallel,
+ automatically.
+* Additionally, standard operations (ADD, MUL) that would normally have
+ to have special SIMD-parallel opcodes added need no longer have *any*
+ of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
+ 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
+ *standard* RV opcodes (present and future) and automatically parallelises
+ them.
+* By inheriting the RVV feature of arbitrary vector-length, then just as
+ with RVV the corner-cases and ISA proliferation of SIMD is avoided.
+* Whilst not entirely finalised, registers are expected to be
+ capable of being subdivided down to an implementor-chosen bitwidth
+ in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
+ and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
+ choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
+ ALUs that perform twin 8-bit operations as they see fit, or anything
+ else including no subdivisions at all.
+* Even though implementors have that choice even to have full 64-bit
+ (with RV64) SIMD, they *must* provide predication that transparently
+ switches off appropriate units on the last loop, thus neatly fitting
+ underlying SIMD ALU implementations *into* the arbitrary vector-length
+ RVV paradigm, keeping the uniform consistent API that is a key strategic
+ feature of Simple-V.
+* With Simple-V fitting into the standard register files, certain classes
+ of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
+ can be done by applying *Parallelised* Bit-manipulation operations
+ followed by parallelised *straight* versions of element-to-element
+ arithmetic operations, even if the bit-manipulation operations require
+ changing the bitwidth of the "vectors" to do so. Predication can
+ be utilised to skip high words (or low words) in source or destination.
+* In essence, the key downside of SIMD - massive duplication of
+ identical functions over time as an architecture evolves from 32-bit
+ wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
+ vector-style parallelism being dropped on top of 8-bit or 16-bit
+ operations, all the while keeping a consistent ISA-level "API" irrespective
+ of implementor design choices (or indeed actual implementations).
+
# Impementing V on top of Simple-V
* Number of Offset CSRs extends from 2
(caveat: anything not specified drops through to software-emulation / traps)
* TODO
-# Analysis of CSR decoding on latency
-
-<a name="csr_decoding_analysis"></a>
+# Register reordering <a name="register_reordering"></a>
+
+## Register File
+
+| Reg Num | Bits |
+| ------- | ---- |
+| r0 | (32..0) |
+| r1 | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) |
+| r5 | (32..0) |
+| r6 | (32..0) |
+| r7 | (32..0) |
+
+## Vectorised CSR
+
+May not be an actual CSR: may be generated from Vector Length CSR:
+single-bit is less burdensome on instruction decode phase.
+
+| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
+| - | - | - | - | - | - | - | - |
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
+
+## Vector Length CSR
+
+| Reg Num | (3..0) |
+| ------- | ---- |
+| r0 | 2 |
+| r1 | 0 |
+| r2 | 1 |
+| r3 | 1 |
+| r4 | 3 |
+| r5 | 0 |
+| r6 | 0 |
+| r7 | 1 |
+
+## Virtual Register Reordering:
+
+| Reg Num | Bits (0) | Bits (1) | Bits (2) |
+| ------- | -------- | -------- | -------- |
+| r0 | (32..0) | (32..0) |
+| r2 | (32..0) |
+| r3 | (32..0) |
+| r4 | (32..0) | (32..0) | (32..0) |
+| r7 | (32..0) |
+
+## Example Instruction translation: <a name="example_translation"></a>
+
+Instructions "ADD r2 r4 r4" would result in three instructions being
+generated and placed into the FILO:
+
+* ADD r2 r4 r4
+* ADD r2 r5 r5
+* ADD r2 r6 r6
+
+## Insights
+
+SIMD register file splitting still to consider. For RV64, benefits of doubling
+(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
+size of the floating point register file to 64 (128 in the case of HP)
+seem pretty clear and worth the complexity.
+
+64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
+done on 64-bit registers it's not so conceptually difficult. May even
+be achieved by *actually* splitting the regfile into 64 virtual 32-bit
+registers such that a 64-bit FP scalar operation is dropped into (r0.H
+r0.L) tuples. Implementation therefore hidden through register renaming.
+
+Implementations intending to introduce VLIW, OoO and parallelism
+(even without Simple-V) would then find that the instructions are
+generated quicker (or in a more compact fashion that is less heavy
+on caches). Interestingly we observe then that Simple-V is about
+"consolidation of instruction generation", where actual parallelism
+of underlying hardware is an implementor-choice that could just as
+equally be applied *without* Simple-V even being implemented.
+
+# Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
It could indeed have been logically deduced (or expected), that there
would be additional decode latency in this proposal, because if
parallel ALUs) is only equal to one ("virtual" parallelism), or is
greater than one, should not be underestimated.
+# Appendix
+
+# Reducing Register Bank porting
+
+This looks quite reasonable.
+<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
+
+The main details are outlined on page 4. They propose a 2-level register
+cache hierarchy, note that registers are typically only read once, that
+you never write back from upper to lower cache level but always go in a
+cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
+a scheme where you look ahead by only 2 instructions to determine which
+registers to bring into the cache.
+
+The nice thing about a vector architecture is that you *know* that
+*even more* registers are going to be pulled in: Hwacha uses this fact
+to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
+by *introducing* deliberate latency into the execution phase.
+
+
# References
* Branch Divergence <https://jbush001.github.io/2014/12/07/branch-divergence-in-parallel-kernels.html>
* Life of Triangles (3D) <https://jbush001.github.io/2016/02/27/life-of-triangle.html>
* Videocore-IV <https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-3d-Graphics-Pipeline>
+* Discussion proposing CSRs that change ISA definition
+ <https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/InzQ1wr_3Ak>
+* Zero-overhead loops <https://pdfs.semanticscholar.org/dbaa/66985cc730d4b44d79f519e96ec9c43ab5b7.pdf>
+* Multi-ported VLIW Register File Implementation <https://ce-publications.et.tudelft.nl/publications/1517_multiple_contexts_in_a_multiported_vliw_register_file_impl.pdf>
+* Fast context save/restore proposal <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57F823FA.6030701%40gmail.com>
+* Register File Bank Cacheing <https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>