-Vector lengths are interpreted as meaning "any instruction referring to
-r(N) generates implicit identical instructions referring to registers
-r(N+M-1) where M is the Vector Length". Vector Lengths may be set to
-use up to 16 registers in the register file.
-
-One separate CSR table is needed for each of the integer and floating-point
-register files:
-
-| RegNo | (3..0) |
-| ----- | ------ |
-| r0 | vlen0 |
-| r1 | vlen1 |
-| .. | vlen.. |
-| r31 | vlen31 |
-
-An array of 32 4-bit CSRs is needed (4 bits per register) to indicate
-whether a register was, if referred to in any standard instructions,
-implicitly to be treated as a vector. A vector length of 1 indicates
-that it is to be treated as a scalar. Vector lengths of 0 are reserved.
-
-Internally, implementations may choose to use the non-zero vector length
-to set a bit-field per register, to be used in the instruction decode phase.
-In this way any standard (current or future) operation involving
-register operands may detect if the operation is to be vector-vector,
-vector-scalar or scalar-scalar (standard) simply through a single
-bit test.
-
-Note that when using the "vsetl rs1, rs2" instruction (caveat: when the
-bitwidth is specifically not set) it becomes:
-
- CSRvlength = MIN(MIN(CSRvectorlen[rs1], MAXVECTORDEPTH), rs2)
-
-This is in contrast to RVV:
-
- CSRvlength = MIN(MIN(rs1, MAXVECTORDEPTH), rs2)
-
-## Element (SIMD) bitwidth CSRs
-
-Element bitwidths may be specified with a per-register CSR, and indicate
-how a register (integer or floating-point) is to be subdivided.
-
-| RegNo | (2..0) |
-| ----- | ------ |
-| r0 | vew0 |
-| r1 | vew1 |
-| .. | vew.. |
-| r31 | vew31 |
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
-
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
-
-Note that when using the "vsetl rs1, rs2" instruction, taking bitwidth
-into account, it becomes:
-
- vew = CSRbitwidth[rs1]
- if (vew == 0)
- bytesperreg = (XLEN/8) # or FLEN as appropriate
- else:
- bytesperreg = bytestable[vew] # 1 2 4 8 16
- simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
- vlen = CSRvectorlen[rs1] * simdmult
- CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
-
-# Exceptions
-
-> What does an ADD of two different-sized vectors do in simple-V?
-
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
- than the destination, throw an exception.
-
-> And what about instructions like JALR?
-> What does jumping to a vector do?
-
-* Throw an exception. Whether that actually results in spawning threads
- as part of the trap-handling remains to be seen.
-
-# Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
-
-This section compares the various parallelism proposals as they stand,
-including traditional SIMD, in terms of features, ease of implementation,
-complexity, flexibility, and die area.
-
-## [[alt_rvp]]
-
-Primary benefit of Alt-RVP is the simplicity with which parallelism
-may be introduced (effective multiplication of regfiles and associated ALUs).
-
-* plus: the simplicity of the lanes (combined with the regularity of
- allocating identical opcodes multiple independent registers) meaning
- that SRAM or 2R1W can be used for entire regfile (potentially).
-* minus: a more complex instruction set where the parallelism is much
- more explicitly directly specified in the instruction and
-* minus: if you *don't* have an explicit instruction (opcode) and you
- need one, the only place it can be added is... in the vector unit and
-* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
- not useable or accessible in other Extensions.
-* plus-and-minus: Lanes may be utilised for high-speed context-switching
- but with the down-side that they're an all-or-nothing part of the Extension.
- No Alt-RVP: no fast register-bank switching.
-* plus: Lane-switching would mean that complex operations not suited to
- parallelisation can be carried out, followed by further parallel Lane-based
- work, without moving register contents down to memory (and back)
-* minus: Access to registers across multiple lanes is challenging. "Solution"
- is to drop data into memory and immediately back in again (like MMX).
-
-## Simple-V
-
-Primary benefit of Simple-V is the OO abstraction of parallel principles
-from actual (internal) parallel hardware. It's an API in effect that's
-designed to be slotted in to an existing implementation (just after
-instruction decode) with minimum disruption and effort.
-
-* minus: the complexity of having to use register renames, OoO, VLIW,
- register file cacheing, all of which has been done before but is a
- pain
-* plus: transparent re-use of existing opcodes as-is just indirectly
- saying "this register's now a vector" which
-* plus: means that future instructions also get to be inherently
- parallelised because there's no "separate vector opcodes"
-* plus: Compressed instructions may also be (indirectly) parallelised
-* minus: the indirect nature of Simple-V means that setup (setting
- a CSR register to indicate vector length, a separate one to indicate
- that it is a predicate register and so on) means a little more setup
- time than Alt-RVP or RVV's "direct and within the (longer) instruction"
- approach.
-* plus: shared register file meaning that, like Alt-RVP, complex
- operations not suited to parallelisation may be carried out interleaved
- between parallelised instructions *without* requiring data to be dropped
- down to memory and back (into a separate vectorised register engine).
-* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
- files means that huge parallel workloads would use up considerable
- chunks of the register file. However in the case of RV64 and 32-bit
- operations, that effectively means 64 slots are available for parallel
- operations.
-* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
- be added, yet the instruction opcodes remain unchanged (and still appear
- to be parallel). consistent "API" regardless of actual internal parallelism:
- even an in-order single-issue implementation with a single ALU would still
- appear to have parallel vectoristion.
-* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
- hard to say if there would be pluses or minuses (on die area). At worse it
- would be "no worse" than existing register renaming, OoO, VLIW and register
- file cacheing schemes.
-
-## RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
-
-RVV is extremely well-designed and has some amazing features, including
-2D reorganisation of memory through LOAD/STORE "strides".
-
-* plus: regular predictable workload means that implementations may
- streamline effects on L1/L2 Cache.
-* plus: regular and clear parallel workload also means that lanes
- (similar to Alt-RVP) may be used as an implementation detail,
- using either SRAM or 2R1W registers.
-* plus: separate engine with no impact on the rest of an implementation
-* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
- really feasible.
-* minus: no ISA abstraction or re-use either: additions to other Extensions
- do not gain parallelism, resulting in prolific duplication of functionality
- inside RVV *and out*.
-* minus: when operations require a different approach (scalar operations
- using the standard integer or FP regfile) an entire vector must be
- transferred out to memory, into standard regfiles, then back to memory,
- then back to the vector unit, this to occur potentially multiple times.
-* minus: will never fit into Compressed instruction space (as-is. May
- be able to do so if "indirect" features of Simple-V are partially adopted).
-* plus-and-slight-minus: extended variants may address up to 256
- vectorised registers (requires 48/64-bit opcodes to do it).
-* minus-and-partial-plus: separate engine plus complexity increases
- implementation time and die area, meaning that adoption is likely only
- to be in high-performance specialist supercomputing (where it will
- be absolutely superb).
-
-## Traditional SIMD
-
-The only really good things about SIMD are how easy it is to implement and
-get good performance. Unfortunately that makes it quite seductive...
-
-* plus: really straightforward, ALU basically does several packed operations
- at once. Parallelism is inherent at the ALU, making the addition of
- SIMD-style parallelism an easy decision that has zero significant impact
- on the rest of any given architectural design and layout.
-* plus (continuation): SIMD in simple in-order single-issue designs can
- therefore result in superb throughput, easily achieved even with a very
- simple execution model.
-* minus: ridiculously complex setup and corner-cases that disproportionately
- increase instruction count on what would otherwise be a "simple loop",
- should the number of elements in an array not happen to exactly match
- the SIMD group width.
-* minus: getting data usefully out of registers (if separate regfiles
- are used) means outputting to memory and back.
-* minus: quite a lot of supplementary instructions for bit-level manipulation
- are needed in order to efficiently extract (or prepare) SIMD operands.
-* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
- dimension and parallelism (width): an at least O(N^2) and quite probably
- O(N^3) ISA proliferation that often results in several thousand
- separate instructions. all requiring separate and distinct corner-case
- algorithms!
-* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
- 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
- For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
- four separate and distinct instructions: one for (r1:low r2:high),
- one for (r1:high r2:low), one for (r1:high r2:high) and one for
- (r1:low r2:low) *per function*.
-* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
- between operand and result bit-widths. In combination with high/low
- proliferation the situation is made even worse.
-* minor-saving-grace: some implementations *may* have predication masks
- that allow control over individual elements within the SIMD block.
-
-# Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
-
-This section compares the various parallelism proposals as they stand,
-*against* traditional SIMD as opposed to *alongside* SIMD. In other words,
-the question is asked "How can each of the proposals effectively implement
-(or replace) SIMD, and how effective would they be"?
-
-## [[alt_rvp]]
-
-* Alt-RVP would not actually replace SIMD but would augment it: just as with
- a SIMD architecture where the ALU becomes responsible for the parallelism,
- Alt-RVP ALUs would likewise be so responsible... with *additional*
- (lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
- at least one dimension are avoided (architectural upgrades introducing
- 128-bit then 256-bit then 512-bit variants of the exact same 64-bit
- SIMD block)
-* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
- of instructions as SIMD, albeit not quite as badly (due to Lanes).
-* In the same discussion for Alt-RVP, an additional proposal was made to
- be able to subdivide the bits of each register lane (columns) down into
- arbitrary bit-lengths (RGB 565 for example).
-* A recommendation was given instead to make the subdivisions down to 32-bit,
- 16-bit or even 8-bit, effectively dividing the registerfile into
- Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane
- "swapping" instructions were then introduced, some of the disadvantages
- of SIMD could be mitigated.
-
-## RVV
-
-* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
- parallelism.
-* However whilst SIMD is usually designed for single-issue in-order simple
- DSPs with a focus on Multimedia (Audio, Video and Image processing),
- RVV's primary focus appears to be on Supercomputing: optimisation of
- mathematical operations that fit into the OpenCL space.
-* Adding functions (operations) that would normally fit (in parallel)
- into a SIMD instruction requires an equivalent to be added to the
- RVV Extension, if one does not exist. Given the specialist nature of
- some SIMD instructions (8-bit or 16-bit saturated or halving add),
- this possibility seems extremely unlikely to occur, even if the
- implementation overhead of RVV were acceptable (compared to
- normal SIMD/DSP-style single-issue in-order simplicity).
-
-## Simple-V
-
-* Simple-V borrows hugely from RVV as it is intended to be easy to
- topologically transplant every single instruction from RVV (as
- designed) into Simple-V equivalents, with *zero loss of functionality
- or capability*.
-* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
- Extension which contained the basic primitives (non-parallelised
- 8, 16 or 32-bit SIMD operations) inherently *become* parallel,
- automatically.
-* Additionally, standard operations (ADD, MUL) that would normally have
- to have special SIMD-parallel opcodes added need no longer have *any*
- of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
- 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
- *standard* RV opcodes (present and future) and automatically parallelises
- them.
-* By inheriting the RVV feature of arbitrary vector-length, then just as
- with RVV the corner-cases and ISA proliferation of SIMD is avoided.
-* Whilst not entirely finalised, registers are expected to be
- capable of being subdivided down to an implementor-chosen bitwidth
- in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
- and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
- choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
- ALUs that perform twin 8-bit operations as they see fit, or anything
- else including no subdivisions at all.
-* Even though implementors have that choice even to have full 64-bit
- (with RV64) SIMD, they *must* provide predication that transparently
- switches off appropriate units on the last loop, thus neatly fitting
- underlying SIMD ALU implementations *into* the arbitrary vector-length
- RVV paradigm, keeping the uniform consistent API that is a key strategic
- feature of Simple-V.
-* With Simple-V fitting into the standard register files, certain classes
- of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
- can be done by applying *Parallelised* Bit-manipulation operations
- followed by parallelised *straight* versions of element-to-element
- arithmetic operations, even if the bit-manipulation operations require
- changing the bitwidth of the "vectors" to do so. Predication can
- be utilised to skip high words (or low words) in source or destination.
-* In essence, the key downside of SIMD - massive duplication of
- identical functions over time as an architecture evolves from 32-bit
- wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
- vector-style parallelism being dropped on top of 8-bit or 16-bit
- operations, all the while keeping a consistent ISA-level "API" irrespective
- of implementor design choices (or indeed actual implementations).
-
-# Impementing V on top of Simple-V
-
-* Number of Offset CSRs extends from 2
-* Extra register file: vector-file
-* Setup of Vector length and bitwidth CSRs now can specify vector-file
- as well as integer or float file.
-* Extend CSR tables (bitwidth) with extra bits
-* TODO
-
-# Implementing P (renamed to DSP) on top of Simple-V
-
-* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
- (caveat: anything not specified drops through to software-emulation / traps)
-* TODO
-
-# Appendix
-
-## V-Extension to Simple-V Comparative Analysis
-
-This section has been moved to its own page [[v_comparative_analysis]]
-
-## P-Ext ISA
-
-This section has been moved to its own page [[p_comparative_analysis]]
-
-## Example of vector / vector, vector / scalar, scalar / scalar => vector add
-
- register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
- register CSRpredicate[XLEN][4]; # 2^4 is max vector length
- register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
- register x[32][XLEN];
-
- function op_add(rd, rs1, rs2, predr)
- {
- /* note that this is ADD, not PADD */
- int i, id, irs1, irs2;
- # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
- # also destination makes no sense as a scalar but what the hell...
- for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
- if (CSRpredicate[predr][i]) # i *think* this is right...
- x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
- # now increment the idxs
- if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
- id += 1;
- if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
- irs1 += 1;
- if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
- irs2 += 1;
- }
-
-## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication. However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.
-
-However what if all branch instructions, if referencing a vectorised
-register, were instead given *completely new analogous meanings* that
-resulted in a parallel bit-wise predication register being set? This
-would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
-BLT and BGE.
-
-We might imagine that FEQ, FLT and FLT would also need to be converted,
-however these are effectively *already* in the precise form needed and
-do not need to be converted *at all*! The difference is that FEQ, FLT
-and FLE *specifically* write a 1 to an integer register if the condition
-holds, and 0 if not. All that needs to be done here is to say, "if
-the integer register is tagged with a bit that says it is a predication
-register, the **bit** in the integer register is set based on the
-current vector index" instead.
-
-There is, in the standard Conditional Branch instruction, more than
-adequate space to interpret it in a similar fashion:
-
-[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
-"""]]
-
-This would become:
-
-[[!table data="""
-31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
-imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
-1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
-reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
-"""]]
-
-Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
-with the interesting side-effect that there is space within what is presently
-the "immediate offset" field to reinterpret that to add in not only a bit
-field to distinguish between floating-point compare and integer compare,
-not only to add in a second source register, but also use some of the bits as
-a predication target as well.
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
-"""]]
-
-Now uses the CS format:
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
-"""]]
-
-Bit 6 would be decoded as "operation refers to Integer or Float" including
-interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
-"C" Standard, version 2.0,
-whilst Bit 5 would allow the operation to be extended, in combination with
-funct3 = 110 or 111: a combination of four distinct (predicated) comparison
-operators. In both floating-point and integer cases those could be
-EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
-
-## Register reordering <a name="register_reordering"></a>
-
-### Register File
-
-| Reg Num | Bits |
-| ------- | ---- |
-| r0 | (32..0) |
-| r1 | (32..0) |
-| r2 | (32..0) |
-| r3 | (32..0) |
-| r4 | (32..0) |
-| r5 | (32..0) |
-| r6 | (32..0) |
-| r7 | (32..0) |
-| .. | (32..0) |
-| r31| (32..0) |
-
-### Vectorised CSR
-
-May not be an actual CSR: may be generated from Vector Length CSR:
-single-bit is less burdensome on instruction decode phase.
-
-| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
-| - | - | - | - | - | - | - | - |
-| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
-
-### Vector Length CSR
-
-| Reg Num | (3..0) |
-| ------- | ---- |
-| r0 | 2 |
-| r1 | 0 |
-| r2 | 1 |
-| r3 | 1 |
-| r4 | 3 |
-| r5 | 0 |
-| r6 | 0 |
-| r7 | 1 |
-
-### Virtual Register Reordering
-
-This example assumes the above Vector Length CSR table
-
-| Reg Num | Bits (0) | Bits (1) | Bits (2) |
-| ------- | -------- | -------- | -------- |
-| r0 | (32..0) | (32..0) |
-| r2 | (32..0) |
-| r3 | (32..0) |
-| r4 | (32..0) | (32..0) | (32..0) |
-| r7 | (32..0) |
-
-### Bitwidth Virtual Register Reordering
-
-This example goes a little further and illustrates the effect that a
-bitwidth CSR has been set on a register. Preconditions:
-
-* RV32 assumed
-* CSRintbitwidth[2] = 010 # integer r2 is 16-bit
-* CSRintvlength[2] = 3 # integer r2 is a vector of length 3
-* vsetl rs1, 5 # set the vector length to 5
-
-This is interpreted as follows:
-
-* Given that the context is RV32, ELEN=32.
-* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
-* Therefore the actual vector length is up to *six* elements
-* However vsetl sets a length 5 therefore the last "element" is skipped
-
-So when using an operation that uses r2 as a source (or destination)
-the operation is carried out as follows:
-
-* 16-bit operation on r2(15..0) - vector element index 0
-* 16-bit operation on r2(31..16) - vector element index 1
-* 16-bit operation on r3(15..0) - vector element index 2
-* 16-bit operation on r3(31..16) - vector element index 3
-* 16-bit operation on r4(15..0) - vector element index 4
-* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5
-
-Predication has been left out of the above example for simplicity, however
-predication is ANDed with the latter stages (vsetl not equal to maximum
-capacity).
-
-Note also that it is entirely an implementor's choice as to whether to have
-actual separate ALUs down to the minimum bitwidth, or whether to have something
-more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD
-operations carried out 32-bits at a time is perfectly acceptable, as is
-8-bit SIMD operations carried out 16-bits at a time requiring two ALUs).
-Regardless of the internal parallelism choice, *predication must
-still be respected*, making Simple-V in effect the "consistent public API".
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
-
-Pseudocode for vector length taking CSR SIMD-bitwidth into account:
-
- vew = CSRbitwidth[rs1]
- if (vew == 0)
- bytesperreg = (XLEN/8) # or FLEN as appropriate
- else:
- bytesperreg = bytestable[vew] # 1 2 4 8 16
- simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
- vlen = CSRvectorlen[rs1] * simdmult
-
-To index an element in a register rnum where the vector element index is i:
-
- function regoffs(rnum, i):
- regidx = floor(i / simdmult) # integer-div rounded down
- byteidx = i % simdmult # integer-remainder
- return rnum + regidx, # actual real register
- byteidx * 8, # low
- byteidx * 8 + (vew-1), # high
-
-### Example Instruction translation: <a name="example_translation"></a>
-
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FILO:
-
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
-
-### Insights
-
-SIMD register file splitting still to consider. For RV64, benefits of doubling
-(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
-size of the floating point register file to 64 (128 in the case of HP)
-seem pretty clear and worth the complexity.
-
-64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
-done on 64-bit registers it's not so conceptually difficult. May even
-be achieved by *actually* splitting the regfile into 64 virtual 32-bit
-registers such that a 64-bit FP scalar operation is dropped into (r0.H
-r0.L) tuples. Implementation therefore hidden through register renaming.
-
-Implementations intending to introduce VLIW, OoO and parallelism
-(even without Simple-V) would then find that the instructions are
-generated quicker (or in a more compact fashion that is less heavy
-on caches). Interestingly we observe then that Simple-V is about
-"consolidation of instruction generation", where actual parallelism
-of underlying hardware is an implementor-choice that could just as
-equally be applied *without* Simple-V even being implemented.
-
-## Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
-
-It could indeed have been logically deduced (or expected), that there
-would be additional decode latency in this proposal, because if
-overloading the opcodes to have different meanings, there is guaranteed
-to be some state, some-where, directly related to registers.
-
-There are several cases:
-
-* All operands vector-length=1 (scalars), all operands
- packed-bitwidth="default": instructions are passed through direct as if
- Simple-V did not exist. Simple-V is, in effect, completely disabled.
-* At least one operand vector-length > 1, all operands
- packed-bitwidth="default": any parallel vector ALUs placed on "alert",
- virtual parallelism looping may be activated.
-* All operands vector-length=1 (scalars), at least one
- operand packed-bitwidth != default: degenerate case of SIMD,
- implementation-specific complexity here (packed decode before ALUs or
- *IN* ALUs)
-* At least one operand vector-length > 1, at least one operand
- packed-bitwidth != default: parallel vector ALUs (if any)
- placed on "alert", virtual parallelsim looping may be activated,
- implementation-specific SIMD complexity kicks in (packed decode before
- ALUs or *IN* ALUs).
-
-Bear in mind that the proposal includes that the decision whether
-to parallelise in hardware or whether to virtual-parallelise (to
-dramatically simplify compilers and also not to run into the SIMD
-instruction proliferation nightmare) *or* a transprent combination
-of both, be done on a *per-operand basis*, so that implementors can
-specifically choose to create an application-optimised implementation
-that they believe (or know) will sell extremely well, without having
-"Extra Standards-Mandated Baggage" that would otherwise blow their area
-or power budget completely out the window.
-
-Additionally, two possible CSR schemes have been proposed, in order to
-greatly reduce CSR space:
-
-* per-register CSRs (vector-length and packed-bitwidth)
-* a smaller number of CSRs with the same information but with an *INDEX*
- specifying WHICH register in one of three regfiles (vector, fp, int)
- the length and bitwidth applies to.
-
-(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
-
-In addition, LOAD/STORE has its own associated proposed CSRs that
-mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
-V (and Hwacha).
-
-Also bear in mind that, for reasons of simplicity for implementors,
-I was coming round to the idea of permitting implementors to choose
-exactly which bitwidths they would like to support in hardware and which
-to allow to fall through to software-trap emulation.
-
-So the question boils down to:
-
-* whether either (or both) of those two CSR schemes have significant
- latency that could even potentially require an extra pipeline decode stage
-* whether there are implementations that can be thought of which do *not*
- introduce significant latency
-* whether it is possible to explicitly (through quite simply
- disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
- all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
- the extreme of skipping an entire pipeline stage (if one is needed)
-* whether packed bitwidth and associated regfile splitting is so complex
- that it should definitely, definitely be made mandatory that implementors
- move regfile splitting into the ALU, and what are the implications of that
-* whether even if that *is* made mandatory, is software-trapped
- "unsupported bitwidths" still desirable, on the basis that SIMD is such
- a complete nightmare that *even* having a software implementation is
- better, making Simple-V have more in common with a software API than
- anything else.
-
-Whilst the above may seem to be severe minuses, there are some strong
-pluses:
-
-* Significant reduction of V's opcode space: over 85%.
-* Smaller reduction of P's opcode space: around 10%.
-* The potential to use Compressed instructions in both Vector and SIMD
- due to the overloading of register meaning (implicit vectorisation,
- implicit packing)
-* Not only present but also future extensions automatically gain parallelism.
-* Already mentioned but worth emphasising: the simplification to compiler
- writers and assembly-level writers of having the same consistent ISA
- regardless of whether the internal level of parallelism (number of
- parallel ALUs) is only equal to one ("virtual" parallelism), or is
- greater than one, should not be underestimated.
-
-## Reducing Register Bank porting