-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not". The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults and adjusting the TLB
-in advance, accordingly: other strategies are explored in the Appendix
-Section "Virtual Memory Page Faults".
-
-# Exceptions
-
-> What does an ADD of two different-sized vectors do in simple-V?
-
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
- than the destination, throw an exception.
-
-> And what about instructions like JALR?
-> What does jumping to a vector do?
-
-* Throw an exception. Whether that actually results in spawning threads
- as part of the trap-handling remains to be seen.
-
-# Impementing V on top of Simple-V
-
-With Simple-V converting the original RVV draft concept-for-concept
-from explicit opcodes to implicit overloading of existing RV Standard
-Extensions, certain features were (deliberately) excluded that need
-to be added back in for RVV to reach its full potential. This is
-made slightly complicated by the fact that RVV itself has two
-levels: Base and reserved future functionality.
-
-* Representation Encoding is entirely left out of Simple-V in favour of
- implicitly taking the exact (explicit) meaning from RV Standard Extensions.
-* VCLIP and VCLIPI do not have corresponding RV Standard Extension
- opcodes (and are the only such operations).
-* Extended Element bitwidths (1 through to 24576 bits) were left out
- of Simple-V as, again, there is no corresponding RV Standard Extension
- that covers anything even below 32-bit operands.
-* Polymorphism was entirely left out of Simple-V due to the inherent
- complexity of automatic type-conversion.
-* Vector Register files were specifically left out of Simple-V in favour
- of fitting on top of the integer and floating-point files. An
- "RVV re-retro-fit" needs to be able to mark (implicitly marked)
- registers as being actually in a separate *vector* register file.
-* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
- register file size is 5 bits (32 registers), whilst the "Extended"
- variant of RVV specifies 8 bits (256 registers) and has yet to
- be published.
-* One big difference: Sections 17.12 and 17.17, there are only two possible
- predication registers in RVV "Base". Through the "indirect" method,
- Simple-V provides a key-value CSR table that allows (arbitrarily)
- up to 16 (TBD) of either the floating-point or integer registers to
- be marked as "predicated" (key), and if so, which integer register to
- use as the predication mask (value).
-
-**TODO**
-
-# Implementing P (renamed to DSP) on top of Simple-V
-
-* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
- (caveat: anything not specified drops through to software-emulation / traps)
-* TODO
-
-# Appendix
-
-## V-Extension to Simple-V Comparative Analysis
-
-This section has been moved to its own page [[v_comparative_analysis]]
-
-## P-Ext ISA
-
-This section has been moved to its own page [[p_comparative_analysis]]
-
-## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
-
-This section compares the various parallelism proposals as they stand,
-including traditional SIMD, in terms of features, ease of implementation,
-complexity, flexibility, and die area.
-
-### [[alt_rvp]]
-
-Primary benefit of Alt-RVP is the simplicity with which parallelism
-may be introduced (effective multiplication of regfiles and associated ALUs).
-
-* plus: the simplicity of the lanes (combined with the regularity of
- allocating identical opcodes multiple independent registers) meaning
- that SRAM or 2R1W can be used for entire regfile (potentially).
-* minus: a more complex instruction set where the parallelism is much
- more explicitly directly specified in the instruction and
-* minus: if you *don't* have an explicit instruction (opcode) and you
- need one, the only place it can be added is... in the vector unit and
-* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
- not useable or accessible in other Extensions.
-* plus-and-minus: Lanes may be utilised for high-speed context-switching
- but with the down-side that they're an all-or-nothing part of the Extension.
- No Alt-RVP: no fast register-bank switching.
-* plus: Lane-switching would mean that complex operations not suited to
- parallelisation can be carried out, followed by further parallel Lane-based
- work, without moving register contents down to memory (and back)
-* minus: Access to registers across multiple lanes is challenging. "Solution"
- is to drop data into memory and immediately back in again (like MMX).
-
-### Simple-V
-
-Primary benefit of Simple-V is the OO abstraction of parallel principles
-from actual (internal) parallel hardware. It's an API in effect that's
-designed to be slotted in to an existing implementation (just after
-instruction decode) with minimum disruption and effort.
-
-* minus: the complexity of having to use register renames, OoO, VLIW,
- register file cacheing, all of which has been done before but is a
- pain
-* plus: transparent re-use of existing opcodes as-is just indirectly
- saying "this register's now a vector" which
-* plus: means that future instructions also get to be inherently
- parallelised because there's no "separate vector opcodes"
-* plus: Compressed instructions may also be (indirectly) parallelised
-* minus: the indirect nature of Simple-V means that setup (setting
- a CSR register to indicate vector length, a separate one to indicate
- that it is a predicate register and so on) means a little more setup
- time than Alt-RVP or RVV's "direct and within the (longer) instruction"
- approach.
-* plus: shared register file meaning that, like Alt-RVP, complex
- operations not suited to parallelisation may be carried out interleaved
- between parallelised instructions *without* requiring data to be dropped
- down to memory and back (into a separate vectorised register engine).
-* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
- files means that huge parallel workloads would use up considerable
- chunks of the register file. However in the case of RV64 and 32-bit
- operations, that effectively means 64 slots are available for parallel
- operations.
-* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
- be added, yet the instruction opcodes remain unchanged (and still appear
- to be parallel). consistent "API" regardless of actual internal parallelism:
- even an in-order single-issue implementation with a single ALU would still
- appear to have parallel vectoristion.
-* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
- hard to say if there would be pluses or minuses (on die area). At worse it
- would be "no worse" than existing register renaming, OoO, VLIW and register
- file cacheing schemes.
-
-### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
-
-RVV is extremely well-designed and has some amazing features, including
-2D reorganisation of memory through LOAD/STORE "strides".
-
-* plus: regular predictable workload means that implementations may
- streamline effects on L1/L2 Cache.
-* plus: regular and clear parallel workload also means that lanes
- (similar to Alt-RVP) may be used as an implementation detail,
- using either SRAM or 2R1W registers.
-* plus: separate engine with no impact on the rest of an implementation
-* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
- really feasible.
-* minus: no ISA abstraction or re-use either: additions to other Extensions
- do not gain parallelism, resulting in prolific duplication of functionality
- inside RVV *and out*.
-* minus: when operations require a different approach (scalar operations
- using the standard integer or FP regfile) an entire vector must be
- transferred out to memory, into standard regfiles, then back to memory,
- then back to the vector unit, this to occur potentially multiple times.
-* minus: will never fit into Compressed instruction space (as-is. May
- be able to do so if "indirect" features of Simple-V are partially adopted).
-* plus-and-slight-minus: extended variants may address up to 256
- vectorised registers (requires 48/64-bit opcodes to do it).
-* minus-and-partial-plus: separate engine plus complexity increases
- implementation time and die area, meaning that adoption is likely only
- to be in high-performance specialist supercomputing (where it will
- be absolutely superb).
-
-### Traditional SIMD
-
-The only really good things about SIMD are how easy it is to implement and
-get good performance. Unfortunately that makes it quite seductive...
-
-* plus: really straightforward, ALU basically does several packed operations
- at once. Parallelism is inherent at the ALU, making the addition of
- SIMD-style parallelism an easy decision that has zero significant impact
- on the rest of any given architectural design and layout.
-* plus (continuation): SIMD in simple in-order single-issue designs can
- therefore result in superb throughput, easily achieved even with a very
- simple execution model.
-* minus: ridiculously complex setup and corner-cases that disproportionately
- increase instruction count on what would otherwise be a "simple loop",
- should the number of elements in an array not happen to exactly match
- the SIMD group width.
-* minus: getting data usefully out of registers (if separate regfiles
- are used) means outputting to memory and back.
-* minus: quite a lot of supplementary instructions for bit-level manipulation
- are needed in order to efficiently extract (or prepare) SIMD operands.
-* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
- dimension and parallelism (width): an at least O(N^2) and quite probably
- O(N^3) ISA proliferation that often results in several thousand
- separate instructions. all requiring separate and distinct corner-case
- algorithms!
-* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
- 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
- For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
- four separate and distinct instructions: one for (r1:low r2:high),
- one for (r1:high r2:low), one for (r1:high r2:high) and one for
- (r1:low r2:low) *per function*.
-* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
- between operand and result bit-widths. In combination with high/low
- proliferation the situation is made even worse.
-* minor-saving-grace: some implementations *may* have predication masks
- that allow control over individual elements within the SIMD block.
-
-## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
-
-This section compares the various parallelism proposals as they stand,
-*against* traditional SIMD as opposed to *alongside* SIMD. In other words,
-the question is asked "How can each of the proposals effectively implement
-(or replace) SIMD, and how effective would they be"?
-
-### [[alt_rvp]]
-
-* Alt-RVP would not actually replace SIMD but would augment it: just as with
- a SIMD architecture where the ALU becomes responsible for the parallelism,
- Alt-RVP ALUs would likewise be so responsible... with *additional*
- (lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^3) proliferation by
- at least one dimension are avoided (architectural upgrades introducing
- 128-bit then 256-bit then 512-bit variants of the exact same 64-bit
- SIMD block)
-* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
- of instructions as SIMD, albeit not quite as badly (due to Lanes).
-* In the same discussion for Alt-RVP, an additional proposal was made to
- be able to subdivide the bits of each register lane (columns) down into
- arbitrary bit-lengths (RGB 565 for example).
-* A recommendation was given instead to make the subdivisions down to 32-bit,
- 16-bit or even 8-bit, effectively dividing the registerfile into
- Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane
- "swapping" instructions were then introduced, some of the disadvantages
- of SIMD could be mitigated.
-
-### RVV
-
-* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
- parallelism.
-* However whilst SIMD is usually designed for single-issue in-order simple
- DSPs with a focus on Multimedia (Audio, Video and Image processing),
- RVV's primary focus appears to be on Supercomputing: optimisation of
- mathematical operations that fit into the OpenCL space.
-* Adding functions (operations) that would normally fit (in parallel)
- into a SIMD instruction requires an equivalent to be added to the
- RVV Extension, if one does not exist. Given the specialist nature of
- some SIMD instructions (8-bit or 16-bit saturated or halving add),
- this possibility seems extremely unlikely to occur, even if the
- implementation overhead of RVV were acceptable (compared to
- normal SIMD/DSP-style single-issue in-order simplicity).
-
-### Simple-V
-
-* Simple-V borrows hugely from RVV as it is intended to be easy to
- topologically transplant every single instruction from RVV (as
- designed) into Simple-V equivalents, with *zero loss of functionality
- or capability*.
-* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
- Extension which contained the basic primitives (non-parallelised
- 8, 16 or 32-bit SIMD operations) inherently *become* parallel,
- automatically.
-* Additionally, standard operations (ADD, MUL) that would normally have
- to have special SIMD-parallel opcodes added need no longer have *any*
- of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
- 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
- *standard* RV opcodes (present and future) and automatically parallelises
- them.
-* By inheriting the RVV feature of arbitrary vector-length, then just as
- with RVV the corner-cases and ISA proliferation of SIMD is avoided.
-* Whilst not entirely finalised, registers are expected to be
- capable of being subdivided down to an implementor-chosen bitwidth
- in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
- and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
- choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
- ALUs that perform twin 8-bit operations as they see fit, or anything
- else including no subdivisions at all.
-* Even though implementors have that choice even to have full 64-bit
- (with RV64) SIMD, they *must* provide predication that transparently
- switches off appropriate units on the last loop, thus neatly fitting
- underlying SIMD ALU implementations *into* the arbitrary vector-length
- RVV paradigm, keeping the uniform consistent API that is a key strategic
- feature of Simple-V.
-* With Simple-V fitting into the standard register files, certain classes
- of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
- can be done by applying *Parallelised* Bit-manipulation operations
- followed by parallelised *straight* versions of element-to-element
- arithmetic operations, even if the bit-manipulation operations require
- changing the bitwidth of the "vectors" to do so. Predication can
- be utilised to skip high words (or low words) in source or destination.
-* In essence, the key downside of SIMD - massive duplication of
- identical functions over time as an architecture evolves from 32-bit
- wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
- vector-style parallelism being dropped on top of 8-bit or 16-bit
- operations, all the while keeping a consistent ISA-level "API" irrespective
- of implementor design choices (or indeed actual implementations).
-
-### Example Instruction translation: <a name="example_translation"></a>
-
-Instructions "ADD r2 r4 r4" would result in three instructions being
-generated and placed into the FIFO:
-
-* ADD r2 r4 r4
-* ADD r2 r5 r5
-* ADD r2 r6 r6
-
-## Example of vector / vector, vector / scalar, scalar / scalar => vector add
-
- register CSRvectorlen[XLEN][4]; # not quite decided yet about this one...
- register CSRpredicate[XLEN][4]; # 2^4 is max vector length
- register CSRreg_is_vectorised[XLEN]; # just for fun support scalars as well
- register x[32][XLEN];
-
- function op_add(rd, rs1, rs2, predr)
- {
- /* note that this is ADD, not PADD */
- int i, id, irs1, irs2;
- # checks CSRvectorlen[rd] == CSRvectorlen[rs] etc. ignored
- # also destination makes no sense as a scalar but what the hell...
- for (i = 0, id=0, irs1=0, irs2=0; i<CSRvectorlen[rd]; i++)
- if (CSRpredicate[predr][i]) # i *think* this is right...
- x[rd+id] <= x[rs1+irs1] + x[rs2+irs2];
- # now increment the idxs
- if (CSRreg_is_vectorised[rd]) # bitfield check rd, scalar/vector?
- id += 1;
- if (CSRreg_is_vectorised[rs1]) # bitfield check rs1, scalar/vector?
- irs1 += 1;
- if (CSRreg_is_vectorised[rs2]) # bitfield check rs2, scalar/vector?
- irs2 += 1;
- }
-
-## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication. However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.
-
-However what if all branch instructions, if referencing a vectorised
-register, were instead given *completely new analogous meanings* that
-resulted in a parallel bit-wise predication register being set? This
-would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE,
-BLT and BGE.
-
-We might imagine that FEQ, FLT and FLT would also need to be converted,
-however these are effectively *already* in the precise form needed and
-do not need to be converted *at all*! The difference is that FEQ, FLT
-and FLE *specifically* write a 1 to an integer register if the condition
-holds, and 0 if not. All that needs to be done here is to say, "if
-the integer register is tagged with a bit that says it is a predication
-register, the **bit** in the integer register is set based on the
-current vector index" instead.
-
-There is, in the standard Conditional Branch instruction, more than
-adequate space to interpret it in a similar fashion:
-
-[[!table data="""
- 31 |30 ..... 25 |24 ... 20 | 19 ... 15 | 14 ...... 12 | 11 ....... 8 | 7 | 6 ....... 0 |
-imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
- 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
- offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH |
-"""]]
-
-This would become:
-
-[[!table data="""
-31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
-imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
-1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 |
-reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH |
-"""]]
-
-Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted,
-with the interesting side-effect that there is space within what is presently
-the "immediate offset" field to reinterpret that to add in not only a bit
-field to distinguish between floating-point compare and integer compare,
-not only to add in a second source register, but also use some of the bits as
-a predication target as well.
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 ................. 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | op |
- 3 | 3 | 3 | 5 | 2 |
- C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 |
-"""]]
-
-Now uses the CS format:
-
-[[!table data="""
-15 ...... 13 | 12 ........... 10 | 9..... 7 | 6 .. 5 | 4......... 2 | 1 .. 0 |
- funct3 | imm | rs10 | imm | | op |
- 3 | 3 | 3 | 2 | 3 | 2 |
- C.BEQZ | predicate rs3 | src1 | I/F B | src2 | C1 |
-"""]]
-
-Bit 6 would be decoded as "operation refers to Integer or Float" including
-interpreting src1 and src2 accordingly as outlined in Table 12.2 of the
-"C" Standard, version 2.0,
-whilst Bit 5 would allow the operation to be extended, in combination with
-funct3 = 110 or 111: a combination of four distinct (predicated) comparison
-operators. In both floating-point and integer cases those could be
-EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2).
-
-## Register reordering <a name="register_reordering"></a>
-
-### Register File
-
-| Reg Num | Bits |
-| ------- | ---- |
-| r0 | (32..0) |
-| r1 | (32..0) |
-| r2 | (32..0) |
-| r3 | (32..0) |
-| r4 | (32..0) |
-| r5 | (32..0) |
-| r6 | (32..0) |
-| r7 | (32..0) |
-| .. | (32..0) |
-| r31| (32..0) |
-
-### Vectorised CSR
-
-May not be an actual CSR: may be generated from Vector Length CSR:
-single-bit is less burdensome on instruction decode phase.
-
-| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
-| - | - | - | - | - | - | - | - |
-| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
-
-### Vector Length CSR
-
-| Reg Num | (3..0) |
-| ------- | ---- |
-| r0 | 2 |
-| r1 | 0 |
-| r2 | 1 |
-| r3 | 1 |
-| r4 | 3 |
-| r5 | 0 |
-| r6 | 0 |
-| r7 | 1 |
-
-### Virtual Register Reordering
-
-This example assumes the above Vector Length CSR table
-
-| Reg Num | Bits (0) | Bits (1) | Bits (2) |
-| ------- | -------- | -------- | -------- |
-| r0 | (32..0) | (32..0) |
-| r2 | (32..0) |
-| r3 | (32..0) |
-| r4 | (32..0) | (32..0) | (32..0) |
-| r7 | (32..0) |
-
-### Bitwidth Virtual Register Reordering
-
-This example goes a little further and illustrates the effect that a
-bitwidth CSR has been set on a register. Preconditions:
-
-* RV32 assumed
-* CSRintbitwidth[2] = 010 # integer r2 is 16-bit
-* CSRintvlength[2] = 3 # integer r2 is a vector of length 3
-* vsetl rs1, 5 # set the vector length to 5
-
-This is interpreted as follows:
-
-* Given that the context is RV32, ELEN=32.
-* With ELEN=32 and bitwidth=16, the number of SIMD elements is 2
-* Therefore the actual vector length is up to *six* elements
-* However vsetl sets a length 5 therefore the last "element" is skipped
-
-So when using an operation that uses r2 as a source (or destination)
-the operation is carried out as follows:
-
-* 16-bit operation on r2(15..0) - vector element index 0
-* 16-bit operation on r2(31..16) - vector element index 1
-* 16-bit operation on r3(15..0) - vector element index 2
-* 16-bit operation on r3(31..16) - vector element index 3
-* 16-bit operation on r4(15..0) - vector element index 4
-* 16-bit operation on r4(31..16) **NOT** carried out due to length being 5
-
-Predication has been left out of the above example for simplicity, however
-predication is ANDed with the latter stages (vsetl not equal to maximum
-capacity).
-
-Note also that it is entirely an implementor's choice as to whether to have
-actual separate ALUs down to the minimum bitwidth, or whether to have something
-more akin to traditional SIMD (at any level of subdivision: 8-bit SIMD
-operations carried out 32-bits at a time is perfectly acceptable, as is
-8-bit SIMD operations carried out 16-bits at a time requiring two ALUs).
-Regardless of the internal parallelism choice, *predication must
-still be respected*, making Simple-V in effect the "consistent public API".
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth |
-| --- | -------- |
-| 000 | default |
-| 001 | 8 |
-| 010 | 16 |
-| 011 | 32 |
-| 100 | 64 |
-| 101 | 128 |
-| 110 | rsvd |
-| 111 | rsvd |
-
-Pseudocode for vector length taking CSR SIMD-bitwidth into account:
-
- vew = CSRbitwidth[rs1]
- if (vew == 0)
- bytesperreg = (XLEN/8) # or FLEN as appropriate
- else:
- bytesperreg = bytestable[vew] # 1 2 4 8 16
- simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
- vlen = CSRvectorlen[rs1] * simdmult
-
-To index an element in a register rnum where the vector element index is i:
-
- function regoffs(rnum, i):
- regidx = floor(i / simdmult) # integer-div rounded down
- byteidx = i % simdmult # integer-remainder
- return rnum + regidx, # actual real register
- byteidx * 8, # low
- byteidx * 8 + (vew-1), # high
-
-### Insights
-
-SIMD register file splitting still to consider. For RV64, benefits of doubling
-(quadrupling in the case of Half-Precision IEEE754 FP) the apparent
-size of the floating point register file to 64 (128 in the case of HP)
-seem pretty clear and worth the complexity.
-
-64 virtual 32-bit F.P. registers and given that 32-bit FP operations are
-done on 64-bit registers it's not so conceptually difficult. May even
-be achieved by *actually* splitting the regfile into 64 virtual 32-bit
-registers such that a 64-bit FP scalar operation is dropped into (r0.H
-r0.L) tuples. Implementation therefore hidden through register renaming.
-
-Implementations intending to introduce VLIW, OoO and parallelism
-(even without Simple-V) would then find that the instructions are
-generated quicker (or in a more compact fashion that is less heavy
-on caches). Interestingly we observe then that Simple-V is about
-"consolidation of instruction generation", where actual parallelism
-of underlying hardware is an implementor-choice that could just as
-equally be applied *without* Simple-V even being implemented.
-
-## Analysis of CSR decoding on latency <a name="csr_decoding_analysis"></a>
-
-It could indeed have been logically deduced (or expected), that there
-would be additional decode latency in this proposal, because if
-overloading the opcodes to have different meanings, there is guaranteed
-to be some state, some-where, directly related to registers.
-
-There are several cases:
-
-* All operands vector-length=1 (scalars), all operands
- packed-bitwidth="default": instructions are passed through direct as if
- Simple-V did not exist. Simple-V is, in effect, completely disabled.
-* At least one operand vector-length > 1, all operands
- packed-bitwidth="default": any parallel vector ALUs placed on "alert",
- virtual parallelism looping may be activated.
-* All operands vector-length=1 (scalars), at least one
- operand packed-bitwidth != default: degenerate case of SIMD,
- implementation-specific complexity here (packed decode before ALUs or
- *IN* ALUs)
-* At least one operand vector-length > 1, at least one operand
- packed-bitwidth != default: parallel vector ALUs (if any)
- placed on "alert", virtual parallelsim looping may be activated,
- implementation-specific SIMD complexity kicks in (packed decode before
- ALUs or *IN* ALUs).
-
-Bear in mind that the proposal includes that the decision whether
-to parallelise in hardware or whether to virtual-parallelise (to
-dramatically simplify compilers and also not to run into the SIMD
-instruction proliferation nightmare) *or* a transprent combination
-of both, be done on a *per-operand basis*, so that implementors can
-specifically choose to create an application-optimised implementation
-that they believe (or know) will sell extremely well, without having
-"Extra Standards-Mandated Baggage" that would otherwise blow their area
-or power budget completely out the window.
-
-Additionally, two possible CSR schemes have been proposed, in order to
-greatly reduce CSR space:
-
-* per-register CSRs (vector-length and packed-bitwidth)
-* a smaller number of CSRs with the same information but with an *INDEX*
- specifying WHICH register in one of three regfiles (vector, fp, int)
- the length and bitwidth applies to.
-
-(See "CSR vector-length and CSR SIMD packed-bitwidth" section for details)
-
-In addition, LOAD/STORE has its own associated proposed CSRs that
-mirror the STRIDE (but not yet STRIDE-SEGMENT?) functionality of
-V (and Hwacha).
-
-Also bear in mind that, for reasons of simplicity for implementors,
-I was coming round to the idea of permitting implementors to choose
-exactly which bitwidths they would like to support in hardware and which
-to allow to fall through to software-trap emulation.
-
-So the question boils down to:
-
-* whether either (or both) of those two CSR schemes have significant
- latency that could even potentially require an extra pipeline decode stage
-* whether there are implementations that can be thought of which do *not*
- introduce significant latency
-* whether it is possible to explicitly (through quite simply
- disabling Simple-V-Ext) or implicitly (detect the case all-vlens=1,
- all-simd-bitwidths=default) switch OFF any decoding, perhaps even to
- the extreme of skipping an entire pipeline stage (if one is needed)
-* whether packed bitwidth and associated regfile splitting is so complex
- that it should definitely, definitely be made mandatory that implementors
- move regfile splitting into the ALU, and what are the implications of that
-* whether even if that *is* made mandatory, is software-trapped
- "unsupported bitwidths" still desirable, on the basis that SIMD is such
- a complete nightmare that *even* having a software implementation is
- better, making Simple-V have more in common with a software API than
- anything else.
-
-Whilst the above may seem to be severe minuses, there are some strong
-pluses:
-
-* Significant reduction of V's opcode space: over 85%.
-* Smaller reduction of P's opcode space: around 10%.
-* The potential to use Compressed instructions in both Vector and SIMD
- due to the overloading of register meaning (implicit vectorisation,
- implicit packing)
-* Not only present but also future extensions automatically gain parallelism.
-* Already mentioned but worth emphasising: the simplification to compiler
- writers and assembly-level writers of having the same consistent ISA
- regardless of whether the internal level of parallelism (number of
- parallel ALUs) is only equal to one ("virtual" parallelism), or is
- greater than one, should not be underestimated.
-
-## Reducing Register Bank porting
-
-This looks quite reasonable.
-<https://www.princeton.edu/~rblee/ELE572Papers/MultiBankRegFile_ISCA2000.pdf>
-
-The main details are outlined on page 4. They propose a 2-level register
-cache hierarchy, note that registers are typically only read once, that
-you never write back from upper to lower cache level but always go in a
-cycle lower -> upper -> ALU -> lower, and at the top of page 5 propose
-a scheme where you look ahead by only 2 instructions to determine which
-registers to bring into the cache.
-
-The nice thing about a vector architecture is that you *know* that
-*even more* registers are going to be pulled in: Hwacha uses this fact
-to optimise L1/L2 cache-line usage (avoid thrashing), strangely enough
-by *introducing* deliberate latency into the execution phase.
-
-## Overflow registers in combination with predication
-
-**TODO**: propose overflow registers be actually one of the integer regs
-(flowing to multiple regs).
-
-**TODO**: propose "mask" (predication) registers likewise. combination with
-standard RV instructions and overflow registers extremely powerful, see
-Aspex ASP.
-
-When integer overflow is stored in an easily-accessible bit (or another
-register), parallelisation turns this into a group of bits which can
-potentially be interacted with in predication, in interesting and powerful
-ways. For example, by taking the integer-overflow result as a predication
-field and shifting it by one, a predicated vectorised "add one" can emulate
-"carry" on arbitrary (unlimited) length addition.
-
-However despite RVV having made room for floating-point exceptions, neither
-RVV nor base RV have taken integer-overflow (carry) into account, which
-makes proposing it quite challenging given that the relevant (Base) RV
-sections are frozen. Consequently it makes sense to forgo this feature.
-
-## Virtual Memory page-faults on LOAD/STORE
-
-
-### Notes from conversations
-
-> I was going through the C.LOAD / C.STORE section 12.3 of V2.3-Draft
-> riscv-isa-manual in order to work out how to re-map RVV onto the standard
-> ISA, and came across an interesting comments at the bottom of pages 75
-> and 76:
-
-> " A common mechanism used in other ISAs to further reduce save/restore
-> code size is load- multiple and store-multiple instructions. "
-
-> Fascinatingly, due to Simple-V proposing to use the *standard* register
-> file, both C.LOAD / C.STORE *and* LOAD / STORE would in effect be exactly
-> that: load-multiple and store-multiple instructions. Which brings us
-> on to this comment:
-
-> "For virtual memory systems, some data accesses could be resident in
-> physical memory and
-> some could not, which requires a new restart mechanism for partially
-> executed instructions."
-
-> Which then of course brings us to the interesting question: how does RVV
-> cope with the scenario when, particularly with LD.X (Indexed / indirect
-> loads), part-way through the loading a page fault occurs?
-
-> Has this been noted or discussed before?
-
-For applications-class platforms, the RVV exception model is
-element-precise (that is, if an exception occurs on element j of a
-vector instruction, elements 0..j-1 have completed execution and elements
-j+1..vl-1 have not executed).
-
-Certain classes of embedded platforms where exceptions are always fatal
-might choose to offer resumable/swappable interrupts but not precise
-exceptions.
-
-
-> Is RVV designed in any way to be re-entrant?
-
-Yes.
-
-
-> What would the implications be for instructions that were in a FIFO at
-> the time, in out-of-order and VLIW implementations, where partial decode
-> had taken place?
-
-The usual bag of tricks for maintaining precise exceptions applies to
-vector machines as well. Register renaming makes the job easier, and
-it's relatively cheaper for vectors, since the control cost is amortized
-over longer registers.
-
-
-> Would it be reasonable at least to say *bypass* (and freeze) the
-> instruction FIFO (drop down to a single-issue execution model temporarily)
-> for the purposes of executing the instructions in the interrupt (whilst
-> setting up the VM page), then re-continue the instruction with all
-> state intact?
-
-This approach has been done successfully, but it's desirable to be
-able to swap out the vector unit state to support context switches on
-exceptions that result in long-latency I/O.
-
-
-> Or would it be better to switch to an entirely separate secondary
-> hyperthread context?
-
-> Does anyone have any ideas or know if there is any academic literature
-> on solutions to this problem?
-
-The Vector VAX offered imprecise but restartable and swappable exceptions:
-http://mprc.pku.edu.cn/~liuxianhua/chn/corpus/Notes/articles/isca/1990/VAX%20vector%20architecture.pdf
-
-Sec. 4.6 of Krste's dissertation assesses some of
-the tradeoffs and references a bunch of related work:
-http://people.eecs.berkeley.edu/~krste/thesis.pdf
-
-
-----
-
-Started reading section 4.6 of Krste's thesis, noted the "IEE85 F.P
-exceptions" and thought, "hmmm that could go into a CSR, must re-read
-the section on FP state CSRs in RVV 0.4-Draft again" then i suddenly
-thought, "ah ha! what if the memory exceptions were, instead of having
-an immediate exception thrown, were simply stored in a type of predication
-bit-field with a flag "error this element failed"?
-
-Then, *after* the vector load (or store, or even operation) was
-performed, you could *then* raise an exception, at which point it
-would be possible (yes in software... I know....) to go "hmmm, these
-indexed operations didn't work, let's get them into memory by triggering
-page-loads", then *re-run the entire instruction* but this time with a
-"memory-predication CSR" that stops the already-performed operations
-(whether they be loads, stores or an arithmetic / FP operation) from
-being carried out a second time.
-
-This theoretically could end up being done multiple times in an SMP
-environment, and also for LD.X there would be the remote outside annoying
-possibility that the indexed memory address could end up being modified.
-
-The advantage would be that the order of execution need not be
-sequential, which potentially could have some big advantages.
-Am still thinking through the implications as any dependent operations
-(particularly ones already decoded and moved into the execution FIFO)
-would still be there (and stalled). hmmm.
-
-----
-
- > > # assume internal parallelism of 8 and MAXVECTORLEN of 8
- > > VSETL r0, 8
- > > FADD x1, x2, x3
- >
- > > x3[0]: ok
- > > x3[1]: exception
- > > x3[2]: ok
- > > ...
- > > ...
- > > x3[7]: ok
- >
- > > what happens to result elements 2-7? those may be *big* results
- > > (RV128)
- > > or in the RVV-Extended may be arbitrary bit-widths far greater.
- >
- > (you replied:)
- >
- > Thrown away.
-
-discussion then led to the question of OoO architectures
-
-> The costs of the imprecise-exception model are greater than the benefit.
-> Software doesn't want to cope with it. It's hard to debug. You can't
-> migrate state between different microarchitectures--unless you force all
-> implementations to support the same imprecise-exception model, which would
-> greatly limit implementation flexibility. (Less important, but still
-> relevant, is that the imprecise model increases the size of the context
-> structure, as the microarchitectural guts have to be spilled to memory.)
-
-
-## Implementation Paradigms
-
-TODO: assess various implementation paradigms. These are listed roughly
-in order of simplicity (minimum compliance, for ultra-light-weight
-embedded systems or to reduce design complexity and the burden of
-design implementation and compliance, in non-critical areas), right the
-way to high-performance systems.
-
-* Full (or partial) software-emulated (via traps): full support for CSRs
- required, however when a register is used that is detected (in hardware)
- to be vectorised, an exception is thrown.
-* Single-issue In-order, reduced pipeline depth (traditional SIMD / DSP)
-* In-order 5+ stage pipelines with instruction FIFOs and mild register-renaming
-* Out-of-order with instruction FIFOs and aggressive register-renaming
-* VLIW
-
-Also to be taken into consideration:
-
-* "Virtual" vectorisation: single-issue loop, no internal ALU parallelism
-* Comphrensive vectorisation: FIFOs and internal parallelism
-* Hybrid Parallelism
-
-# TODO Research