-# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
-
-Key insight: Simple-V is intended as an abstraction layer to provide
-a consistent "API" to parallelisation of existing *and future* operations.
-*Actual* internal hardware-level parallelism is *not* required, such
-that Simple-V may be viewed as providing a "compact" or "consolidated"
-means of issuing multiple near-identical arithmetic instructions to an
-instruction queue (FIFO), pending execution.
-
-*Actual* parallelism, if added independently of Simple-V in the form
-of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit from
-the uniformity of a consistent API.
-
-**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E.
-
-* Talk slides: <http://hands.com/~lkcl/simple_v_chennai_2018.pdf>
-* Specification: now move to its own page: [[specification]]
-
-[[!toc ]]
-
-# Introduction
-
-This proposal exists so as to be able to satisfy several disparate
-requirements: power-conscious, area-conscious, and performance-conscious
-designs all pull an ISA and its implementation in different conflicting
-directions, as do the specific intended uses for any given implementation.
-
-The existing P (SIMD) proposal and the V (Vector) proposals,
-whilst each extremely powerful in their own right and clearly desirable,
-are also:
-
-* Clearly independent in their origins (Cray and AndesStar v3 respectively)
- so need work to adapt to the RISC-V ethos and paradigm
-* Are sufficiently large so as to make adoption (and exploration for
- analysis and review purposes) prohibitively expensive
-* Both contain partial duplication of pre-existing RISC-V instructions
- (an undesirable characteristic)
-* Both have independent, incompatible and disparate methods for introducing
- parallelism at the instruction level
-* Both require that their respective parallelism paradigm be implemented
- along-side and integral to their respective functionality *or not at all*.
-* Both independently have methods for introducing parallelism that
- could, if separated, benefit
- *other areas of RISC-V not just DSP or Floating-point respectively*.
-
-There are also key differences between Vectorisation and SIMD (full
-details outlined in the Appendix), the key points being:
-
-* SIMD has an extremely seductively compelling ease of implementation argument:
- each operation is passed to the ALU, which is where the parallelism
- lies. There is *negligeable* (if any) impact on the rest of the core
- (with life instead being made hell for compiler writers and applications
- writers due to extreme ISA proliferation).
-* By contrast, Vectorisation has quite some complexity (for considerable
- flexibility, reduction in opcode proliferation and much more).
-* Vectorisation typically includes much more comprehensive memory load
- and store schemes (unit stride, constant-stride and indexed), which
- in turn have ramifications: virtual memory misses (TLB cache misses)
- and even multiple page-faults... all caused by a *single instruction*,
- yet with a clear benefit that the regularisation of LOAD/STOREs can
- be optimised for minimal impact on caches and maximised throughput.
-* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
- to pages), and these load/stores have absolutely nothing to do with the
- SIMD / ALU engine, no matter how wide the operand. Simplicity but with
- more impact on instruction and data caches.
-
-Overall it makes a huge amount of sense to have a means and method
-of introducing instruction parallelism in a flexible way that provides
-implementors with the option to choose exactly where they wish to offer
-performance improvements and where they wish to optimise for power
-and/or area (and if that can be offered even on a per-operation basis that
-would provide even more flexibility).
-
-Additionally it makes sense to *split out* the parallelism inherent within
-each of P and V, and to see if each of P and V then, in *combination* with
-a "best-of-both" parallelism extension, could be added on *on top* of
-this proposal, to topologically provide the exact same functionality of
-each of P and V. Each of P and V then can focus on providing the best
-operations possible for their respective target areas, without being
-hugely concerned about the actual parallelism.
-
-Furthermore, an additional goal of this proposal is to reduce the number
-of opcodes utilised by each of P and V as they currently stand, leveraging
-existing RISC-V opcodes where possible, and also potentially allowing
-P and V to make use of Compressed Instructions as a result.
-
-# Analysis and discussion of Vector vs SIMD
-
-There are six combined areas between the two proposals that help with
-parallelism (increased performance, reduced power / area) without
-over-burdening the ISA with a huge proliferation of
-instructions:
-
-* Fixed vs variable parallelism (fixed or variable "M" in SIMD)
-* Implicit vs fixed instruction bit-width (integral to instruction or not)
-* Implicit vs explicit type-conversion (compounded on bit-width)
-* Implicit vs explicit inner loops.
-* Single-instruction LOAD/STORE.
-* Masks / tagging (selecting/preventing certain indexed elements from execution)
-
-The pros and cons of each are discussed and analysed below.
-
-## Fixed vs variable parallelism length
-
-In David Patterson and Andrew Waterman's analysis of SIMD and Vector
-ISAs, the analysis comes out clearly in favour of (effectively) variable
-length SIMD. As SIMD is a fixed width, typically 4, 8 or in extreme cases
-16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD
-are extremely burdensome except for applications whose requirements
-*specifically* match the *precise and exact* depth of the SIMD engine.
-
-Thus, SIMD, no matter what width is chosen, is never going to be acceptable
-for general-purpose computation, and in the context of developing a
-general-purpose ISA, is never going to satisfy 100 percent of implementors.
-
-To explain this further: for increased workloads over time, as the
-performance requirements increase for new target markets, implementors
-choose to extend the SIMD width (so as to again avoid mixing parallelism
-into the instruction issue phases: the primary "simplicity" benefit of
-SIMD in the first place), with the result that the entire opcode space
-effectively doubles with each new SIMD width that's added to the ISA.
-
-That basically leaves "variable-length vector" as the clear *general-purpose*
-winner, at least in terms of greatly simplifying the instruction set,
-reducing the number of instructions required for any given task, and thus
-reducing power consumption for the same.
-
-## Implicit vs fixed instruction bit-width
-
-SIMD again has a severe disadvantage here, over Vector: huge proliferation
-of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
-have to then have operations *for each and between each*. It gets very
-messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
-proliferation profile.
-
-The V-Extension on the other hand proposes to set the bit-width of
-future instructions on a per-register basis, such that subsequent instructions
-involving that register are *implicitly* of that particular bit-width until
-otherwise changed or reset.
-
-This has some extremely useful properties, without being particularly
-burdensome to implementations, given that instruction decode already has
-to direct the operation to a correctly-sized width ALU engine, anyway.
-
-Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand space),
-implicit bit-width allows the meaning of certain operations to be
-type-overloaded *without* pollution or alteration of frozen and immutable
-instructions, in a fully backwards-compatible fashion.
-
-## Implicit and explicit type-conversion
-
-The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help
-deal with over-population of instructions, such that type-casting from
-integer (and floating point) of various sizes is automatically inferred
-due to "type tagging" that is set with a special instruction. A register
-will be *specifically* marked as "16-bit Floating-Point" and, if added
-to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take place *without* requiring that type-conversion
-to be explicitly done with its own separate instruction.
-
-However, implicit type-conversion is not only quite burdensome to
-implement (explosion of inferred type-to-type conversion) but also is
-never really going to be complete. It gets even worse when bit-widths
-also have to be taken into consideration. Each new type results in
-an increased O(N^2) conversion space that, as anyone who has examined
-python's source code (which has built-in polymorphic type-conversion),
-knows that the task is more complex than it first seems.
-
-Overall, type-conversion is generally best to leave to explicit
-type-conversion instructions, or in definite specific use-cases left to
-be part of an actual instruction (DSP or FP)
-
-## Zero-overhead loops vs explicit loops
-
-The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology
-contains an extremely interesting feature: zero-overhead loops. This
-proposal would basically allow an inner loop of instructions to be
-repeated indefinitely, a fixed number of times.
-
-Its specific advantage over explicit loops is that the pipeline in a DSP
-can potentially be kept completely full *even in an in-order single-issue
-implementation*. Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in
-order to keep ALU pipelines 100% occupied.
-
-By bringing that capability in, this proposal could offer a way to increase
-pipeline activity even in simpler implementations in the one key area
-which really matters: the inner loop.
-
-However when looking at much more comprehensive schemes
-"A portable specification of zero-overhead loop control hardware
-applied to embedded processors" (ZOLC), optimising only the single
-inner loop seems inadequate, tending to suggest that ZOLC may be
-better off being proposed as an entirely separate Extension.
-
-## Single-instruction LOAD/STORE
-
-In traditional Vector Architectures there are instructions which
-result in multiple register-memory transfer operations resulting
-from a single instruction. They're complicated to implement in hardware,
-yet the benefits are a huge consistent regularisation of memory accesses
-that can be highly optimised with respect to both actual memory and any
-L1, L2 or other caches. In Hwacha EECS-2015-263 it is explicitly made
-clear the consequences of getting this architecturally wrong:
-L2 cache-thrashing at the very least.
-
-Complications arise when Virtual Memory is involved: TLB cache misses
-need to be dealt with, as do page faults. Some of the tradeoffs are
-discussed in <http://people.eecs.berkeley.edu/~krste/thesis.pdf>, Section
-4.6, and an article by Jeff Bush when faced with some of these issues
-is particularly enlightening
-<https://jbush001.github.io/2015/11/03/lost-in-translation.html>
-
-Interestingly, none of this complexity is faced in SIMD architectures...
-but then they do not get the opportunity to optimise for highly-streamlined
-memory accesses either.
-
-With the "bang-per-buck" ratio being so high and the indirect improvement
-in L1 Instruction Cache usage (reduced instruction count), as well as
-the opportunity to optimise L1 and L2 cache usage, the case for including
-Vector LOAD/STORE is compelling.
-
-## Mask and Tagging (Predication)
-
-Tagging (aka Masks aka Predication) is a pseudo-method of implementing
-simplistic branching in a parallel fashion, by allowing execution on
-elements of a vector to be switched on or off depending on the results
-of prior operations in the same array position.
-
-The reason for considering this is simple: by *definition* it
-is not possible to perform individual parallel branches in a SIMD
-(Single-Instruction, **Multiple**-Data) context. Branches (modifying
-of the Program Counter) will result in *all* parallel data having
-a different instruction executed on it: that's just the definition of
-SIMD, and it is simply unavoidable.
-
-So these are the ways in which conditional execution may be implemented:
-
-* explicit compare and branch: BNE x, y -> offs would jump offs
- instructions if x was not equal to y
-* explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) such as ADD results in a carry, carry bit
- implicitly (or sometimes explicitly) goes into a "tag" (mask) register
-
-The first of these is a "normal" branch method, which is flat-out impossible
-to parallelise without look-ahead and effectively rewriting instructions.
-This would defeat the purpose of RISC.
-
-The latter two are where parallelism becomes easy to do without complexity:
-every operation is modified to be "conditionally executed" (in an explicit
-way directly in the instruction format *or* implicitly).
-
-RVV (Vector-Extension) proposes to have *explicit* storing of the compare
-in a tag/mask register, and to *explicitly* have every vector operation
-*require* that its operation be "predicated" on the bits within an
-explicitly-named tag/mask register.
-
-SIMD (P-Extension) has not yet published precise documentation on what its
-schema is to be: there is however verbal indication at the time of writing
-that:
-
-> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will
-> be executed using the same compare ALU logic for the base ISA with some
-> minor modifications to handle smaller data types. The function will not
-> be duplicated.
-
-This is an *implicit* form of predication as the base RV ISA does not have
-condition-codes or predication. By adding a CSR it becomes possible
-to also tag certain registers as "predicated if referenced as a destination".
-Example:
-
- // in future operations from now on, if r0 is the destination use r5 as
- // the PREDICATION register
- SET_IMPLICIT_CSRPREDICATE r0, r5
- // store the compares in r5 as the PREDICATION register
- CMPEQ8 r5, r1, r2
- // r0 is used here. ah ha! that means it's predicated using r5!
- ADD8 r0, r1, r3
-
-With enough registers (and in RISC-V there are enough registers) some fairly
-complex predication can be set up and yet still execute without significant
-stalling, even in a simple non-superscalar architecture.
-
-(For details on how Branch Instructions would be retro-fitted to indirectly
-predicated equivalents, see Appendix)
-
-## Conclusions
-
-In the above sections the five different ways where parallel instruction
-execution has closely and loosely inter-related implications for the ISA and
-for implementors, were outlined. The pluses and minuses came out as
-follows:
-
-* Fixed vs variable parallelism: <b>variable</b>
-* Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
-* Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
-* Single-instruction Vector LOAD/STORE: <b>Complex but highly beneficial</b>
-* Tag or no-tag: <b>Complex but highly beneficial</b>
-
-In particular:
-
-* variable-length vectors came out on top because of the high setup, teardown
- and corner-cases associated with the fixed width of SIMD.
-* Implicit bit-width helps to extend the ISA to escape from
- former limitations and restrictions (in a backwards-compatible fashion),
- whilst also leaving implementors free to simmplify implementations
- by using actual explicit internal parallelism.
-* Implicit (zero-overhead) loops provide a means to keep pipelines
- potentially 100% occupied in a single-issue in-order implementation
- i.e. *without* requiring a super-scalar or out-of-order architecture,
- but doing a proper, full job (ZOLC) is an entirely different matter.
-
-Constructing a SIMD/Simple-Vector proposal based around four of these six
-requirements would therefore seem to be a logical thing to do.
-
-# Note on implementation of parallelism
-
-One extremely important aspect of this proposal is to respect and support
-implementors desire to focus on power, area or performance. In that regard,
-it is proposed that implementors be free to choose whether to implement
-the Vector (or variable-width SIMD) parallelism as sequential operations
-with a single ALU, fully parallel (if practical) with multiple ALUs, or
-a hybrid combination of both.
-
-In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
-Parallelism". They achieve a 16-way SIMD at an **instruction** level
-by providing a combination of a 4-way parallel ALU *and* an externally
-transparent loop that feeds 4 sequential sets of data into each of the
-4 ALUs.
-
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
-
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler. Whilst
-a Vector (varible-width SIMD) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward. All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
-
-To make that clear: should an implementor choose a particularly wide
-SIMD-style ALU, each parallel unit *must* have predication so that
-the parallel SIMD ALU may emulate variable-length parallel operations.
-Thus the "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
-
-In addition, implementors will be free to choose whether to provide an
-absolute bare minimum level of compliance with the "API" (software-traps
-when vectorisation is detected), all the way up to full supercomputing
-level all-hardware parallelism. Options are covered in the Appendix.
-
-
-### FMV, FNEG and FABS Instructions
-
-These are identical in form to C.MV, except covering floating-point
-register copying. The same double-predication rules also apply.
-However when elwidth is not set to default the instruction is implicitly
-and automatic converted to a (vectorised) floating-point type conversion
-operation of the appropriate size covering the source and destination
-register bitwidths.
-
-(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
-
-### FVCT Instructions
-
-These are again identical in form to C.MV, except that they cover
-floating-point to integer and integer to floating-point. When element
-width in each vector is set to default, the instructions behave exactly
-as they are defined for standard RV (scalar) operations, except vectorised
-in exactly the same fashion as outlined in C.MV.
-
-However when the source or destination element width is not set to default,
-the opcode's explicit element widths are *over-ridden* to new definitions,
-and the opcode's element width is taken as indicative of the SIMD width
-(if applicable i.e. if packed SIMD is requested) instead.
-
-For example FCVT.S.L would normally be used to convert a 64-bit
-integer in register rs1 to a 64-bit floating-point number in rd.
-If however the source rs1 is set to be a vector, where elwidth is set to
-default/2 and "packed SIMD" is enabled, then the first 32 bits of
-rs1 are converted to a floating-point number to be stored in rd's
-first element and the higher 32-bits *also* converted to floating-point
-and stored in the second. The 32 bit size comes from the fact that
-FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
-divide that by two it means that rs1 element width is to be taken as 32.
-
-Similar rules apply to the destination register.
-
-# Exceptions
-
-> What does an ADD of two different-sized vectors do in simple-V?
-
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
- than the destination, throw an exception.
-
-> And what about instructions like JALR?
-> What does jumping to a vector do?
-
-* Throw an exception. Whether that actually results in spawning threads
- as part of the trap-handling remains to be seen.
-
-# Under consideration <a name="issues"></a>
-
-From the Chennai 2018 slides the following issues were raised.
-Efforts to analyse and answer these questions are below.
-
-* Should future extra bank be included now?
-* How many Register and Predication CSRs should there be?
- (and how many in RV32E)
-* How many in M-Mode (for doing context-switch)?
-* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-* Can CLIP be done as a CSR (mode, like elwidth)
-* SIMD saturation (etc.) also set as a mode?
-* Include src1/src2 predication on Comparison Ops?
- (same arrangement as C.MV, with same flexibility/power)
-* 8/16-bit ops is it worthwhile adding a "start offset"?
- (a bit like misaligned addressing... for registers)
- or just use predication to skip start?
-
-## Future (extra) bank be included (made mandatory)
-
-The implications of expanding the *standard* register file from
-32 entries per bank to 64 per bank is quite an extensive architectural
-change. Also it has implications for context-switching.
-
-Therefore, on balance, it is not recommended and certainly should
-not be made a *mandatory* requirement for the use of SV. SV's design
-ethos is to be minimally-disruptive for implementors to shoe-horn
-into an existing design.
-
-## How large should the Register and Predication CSR key-value stores be?
-
-This is something that definitely needs actual evaluation and for
-code to be run and the results analysed. At the time of writing
-(12jul2018) that is too early to tell. An approximate best-guess
-however would be 16 entries.
-
-RV32E however is a special case, given that it is highly unlikely
-(but not outside the realm of possibility) that it would be used
-for performance reasons but instead for reducing instruction count.
-The number of CSR entries therefore has to be considered extremely
-carefully.
-
-## How many CSR entries in M-Mode or S-Mode (for context-switching)?
-
-The minimum required CSR entries would be 1 for each register-bank:
-one for integer and one for floating-point. However, as shown
-in the "Context Switch Example" section, for optimal efficiency
-(minimal instructions in a low-latency situation) the CSRs for
-the context-switch should be set up *and left alone*.
-
-This means that it is not really a good idea to touch the CSRs
-used for context-switching in the M-Mode (or S-Mode) trap, so
-if there is ever demonstrated a need for vectors then there would
-need to be *at least* one more free. However just one does not make
-much sense (as it one only covers scalar-vector ops) so it is more
-likely that at least two extra would be needed.
-
-This *in addition* - in the RV32E case - if an RV32E implementation
-happens also to support U/S/M modes. This would be considered quite
-rare but not outside of the realm of possibility.
-
-Conclusion: all needs careful analysis and future work.
-
-## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-
-On balance it's a neat idea however it does seem to be one where the
-benefits are not really clear. It would however obviate the need for
-an exception to be raised if the VL runs out of registers to put
-things in (gets to x31, tries a non-existent x32 and fails), however
-the "fly in the ointment" is that x0 is hard-coded to "zero". The
-increment therefore would need to be double-stepped to skip over x0.
-Some microarchitectures could run into difficulties (SIMD-like ones
-in particular) so it needs a lot more thought.
-
-## Can CLIP be done as a CSR (mode, like elwidth)
-
-RVV appears to be going this way. At the time of writing (12jun2018)
-it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
-clip by way of exactly this method: setting a "clip mode" in a CSR.
-
-No details are given however the most sensible thing to have would be
-to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
-extra bits specifying the type of clipping to be carried out, on
-a per-register basis. Other bits may be used for other purposes
-(see SIMD saturation below)
-
-## SIMD saturation (etc.) also set as a mode?
-
-Similar to "CLIP" as an extension to the CSR key-value store, "saturate"
-may also need extra details (what the saturation maximum is for example).
-
-## Include src1/src2 predication on Comparison Ops?
-
-In the C.MV (and other ops - see "C.MV Instruction"), the decision
-was taken, unlike in ADD (etc.) which are 3-operand ops, to use
-*both* the src *and* dest predication masks to give an extremely
-powerful and flexible instruction that covers a huge number of
-"traditional" vector opcodes.
-
-The natural question therefore to ask is: where else could this
-flexibility be deployed? What about comparison operations?
-
-Unfortunately, C.MV is basically "regs[dest] = regs[src]" whilst
-predicated comparison operations are actually a *three* operand
-instruction:
-
- regs[pred] |= 1<< (cmp(regs[src1], regs[src2]) ? 1 : 0)
-
-Therefore at first glance it does not make sense to use src1 and src2
-predication masks, as it breaks the rule of 3-operand instructions
-to use the *destination* predication register.
-
-In this case however, the destination *is* a predication register
-as opposed to being a predication mask that is applied *to* the
-(vectorised) operation, element-at-a-time on src1 and src2.
-
-Thus the question is directly inter-related to whether the modification
-of the predication mask should *itself* be predicated.
-
-It is quite complex, in other words, and needs careful consideration.
-
-## 8/16-bit ops is it worthwhile adding a "start offset"?
-
-The idea here is to make it possible, particularly in a "Packed SIMD"
-case, to be able to avoid doing unaligned Load/Store operations
-by specifying that operations, instead of being carried out
-element-for-element, are offset by a fixed amount *even* in 8 and 16-bit
-element Packed SIMD cases.
-
-For example rather than take 2 32-bit registers divided into 4 8-bit
-elements and have them ADDed element-for-element as follows:
-
- r3[0] = add r4[0], r6[0]
- r3[1] = add r4[1], r6[1]
- r3[2] = add r4[2], r6[2]
- r3[3] = add r4[3], r6[3]
-
-an offset of 1 would result in four operations as follows, instead:
-
- r3[0] = add r4[1], r6[0]
- r3[1] = add r4[2], r6[1]
- r3[2] = add r4[3], r6[2]
- r3[3] = add r5[0], r6[3]
-
-In non-packed-SIMD mode there is no benefit at all, as a vector may
-be created using a different CSR that has the offset built-in. So this
-leaves just the packed-SIMD case to consider.
-
-Two ways in which this could be implemented / emulated (without special
-hardware):
-
-* bit-manipulation that shuffles the data along by one byte (or one word)
- either prior to or as part of the operation requiring the offset.
-* just use an unaligned Load/Store sequence, even if there are performance
- penalties for doing so.
-
-The question then is whether the performance hit is worth the extra hardware
-involving byte-shuffling/shifting the data by an arbitrary offset. On
-balance given that there are two reasonable instruction-based options, the
-hardware-offset option should be left out for the initial version of SV,
-with the option to consider it in an "advanced" version of the specification.
-
-# Impementing V on top of Simple-V
-
-With Simple-V converting the original RVV draft concept-for-concept
-from explicit opcodes to implicit overloading of existing RV Standard
-Extensions, certain features were (deliberately) excluded that need
-to be added back in for RVV to reach its full potential. This is
-made slightly complicated by the fact that RVV itself has two
-levels: Base and reserved future functionality.
-
-* Representation Encoding is entirely left out of Simple-V in favour of
- implicitly taking the exact (explicit) meaning from RV Standard Extensions.
-* VCLIP and VCLIPI do not have corresponding RV Standard Extension
- opcodes (and are the only such operations).
-* Extended Element bitwidths (1 through to 24576 bits) were left out
- of Simple-V as, again, there is no corresponding RV Standard Extension
- that covers anything even below 32-bit operands.
-* Polymorphism was entirely left out of Simple-V due to the inherent
- complexity of automatic type-conversion.
-* Vector Register files were specifically left out of Simple-V in favour
- of fitting on top of the integer and floating-point files. An
- "RVV re-retro-fit" needs to be able to mark (implicitly marked)
- registers as being actually in a separate *vector* register file.
-* Fortunately in RVV (Draft 0.4, V2.3-Draft), the "base" vector
- register file size is 5 bits (32 registers), whilst the "Extended"
- variant of RVV specifies 8 bits (256 registers) and has yet to
- be published.
-* One big difference: Sections 17.12 and 17.17, there are only two possible
- predication registers in RVV "Base". Through the "indirect" method,
- Simple-V provides a key-value CSR table that allows (arbitrarily)
- up to 16 (TBD) of either the floating-point or integer registers to
- be marked as "predicated" (key), and if so, which integer register to
- use as the predication mask (value).
-
-**TODO**
-
-# Implementing P (renamed to DSP) on top of Simple-V
-
-* Implementors indicate chosen bitwidth support in Vector-bitwidth CSR
- (caveat: anything not specified drops through to software-emulation / traps)
-* TODO
-
-# Appendix
-
-## V-Extension to Simple-V Comparative Analysis
-
-This section has been moved to its own page [[v_comparative_analysis]]
-
-## P-Ext ISA
-
-This section has been moved to its own page [[p_comparative_analysis]]
-
-## Comparison of "Traditional" SIMD, Alt-RVP, Simple-V and RVV Proposals <a name="parallelism_comparisons"></a>
-
-This section compares the various parallelism proposals as they stand,
-including traditional SIMD, in terms of features, ease of implementation,
-complexity, flexibility, and die area.
-
-### [[harmonised_rvv_rvp]]
-
-This is an interesting proposal under development to retro-fit the AndesStar
-P-Ext into V-Ext.
-
-### [[alt_rvp]]
-
-Primary benefit of Alt-RVP is the simplicity with which parallelism
-may be introduced (effective multiplication of regfiles and associated ALUs).
-
-* plus: the simplicity of the lanes (combined with the regularity of
- allocating identical opcodes multiple independent registers) meaning
- that SRAM or 2R1W can be used for entire regfile (potentially).
-* minus: a more complex instruction set where the parallelism is much
- more explicitly directly specified in the instruction and
-* minus: if you *don't* have an explicit instruction (opcode) and you
- need one, the only place it can be added is... in the vector unit and
-* minus: opcode functions (and associated ALUs) duplicated in Alt-RVP are
- not useable or accessible in other Extensions.
-* plus-and-minus: Lanes may be utilised for high-speed context-switching
- but with the down-side that they're an all-or-nothing part of the Extension.
- No Alt-RVP: no fast register-bank switching.
-* plus: Lane-switching would mean that complex operations not suited to
- parallelisation can be carried out, followed by further parallel Lane-based
- work, without moving register contents down to memory (and back)
-* minus: Access to registers across multiple lanes is challenging. "Solution"
- is to drop data into memory and immediately back in again (like MMX).
-
-### Simple-V
-
-Primary benefit of Simple-V is the OO abstraction of parallel principles
-from actual (internal) parallel hardware. It's an API in effect that's
-designed to be slotted in to an existing implementation (just after
-instruction decode) with minimum disruption and effort.
-
-* minus: the complexity (if full parallelism is to be exploited)
- of having to use register renames, OoO, VLIW, register file cacheing,
- all of which has been done before but is a pain
-* plus: transparent re-use of existing opcodes as-is just indirectly
- saying "this register's now a vector" which
-* plus: means that future instructions also get to be inherently
- parallelised because there's no "separate vector opcodes"
-* plus: Compressed instructions may also be (indirectly) parallelised
-* minus: the indirect nature of Simple-V means that setup (setting
- a CSR register to indicate vector length, a separate one to indicate
- that it is a predicate register and so on) means a little more setup
- time than Alt-RVP or RVV's "direct and within the (longer) instruction"
- approach.
-* plus: shared register file meaning that, like Alt-RVP, complex
- operations not suited to parallelisation may be carried out interleaved
- between parallelised instructions *without* requiring data to be dropped
- down to memory and back (into a separate vectorised register engine).
-* plus-and-maybe-minus: re-use of integer and floating-point 32-wide register
- files means that huge parallel workloads would use up considerable
- chunks of the register file. However in the case of RV64 and 32-bit
- operations, that effectively means 64 slots are available for parallel
- operations.
-* plus: inherent parallelism (actual parallel ALUs) doesn't actually need to
- be added, yet the instruction opcodes remain unchanged (and still appear
- to be parallel). consistent "API" regardless of actual internal parallelism:
- even an in-order single-issue implementation with a single ALU would still
- appear to have parallel vectoristion.
-* hard-to-judge: if actual inherent underlying ALU parallelism is added it's
- hard to say if there would be pluses or minuses (on die area). At worse it
- would be "no worse" than existing register renaming, OoO, VLIW and register
- file cacheing schemes.
-
-### RVV (as it stands, Draft 0.4 Section 17, RISC-V ISA V2.3-Draft)
-
-RVV is extremely well-designed and has some amazing features, including
-2D reorganisation of memory through LOAD/STORE "strides".
-
-* plus: regular predictable workload means that implementations may
- streamline effects on L1/L2 Cache.
-* plus: regular and clear parallel workload also means that lanes
- (similar to Alt-RVP) may be used as an implementation detail,
- using either SRAM or 2R1W registers.
-* plus: separate engine with no impact on the rest of an implementation
-* minus: separate *complex* engine with no RTL (ALUs, Pipeline stages) reuse
- really feasible.
-* minus: no ISA abstraction or re-use either: additions to other Extensions
- do not gain parallelism, resulting in prolific duplication of functionality
- inside RVV *and out*.
-* minus: when operations require a different approach (scalar operations
- using the standard integer or FP regfile) an entire vector must be
- transferred out to memory, into standard regfiles, then back to memory,
- then back to the vector unit, this to occur potentially multiple times.
-* minus: will never fit into Compressed instruction space (as-is. May
- be able to do so if "indirect" features of Simple-V are partially adopted).
-* plus-and-slight-minus: extended variants may address up to 256
- vectorised registers (requires 48/64-bit opcodes to do it).
-* minus-and-partial-plus: separate engine plus complexity increases
- implementation time and die area, meaning that adoption is likely only
- to be in high-performance specialist supercomputing (where it will
- be absolutely superb).
-
-### Traditional SIMD
-
-The only really good things about SIMD are how easy it is to implement and
-get good performance. Unfortunately that makes it quite seductive...
-
-* plus: really straightforward, ALU basically does several packed operations
- at once. Parallelism is inherent at the ALU, making the addition of
- SIMD-style parallelism an easy decision that has zero significant impact
- on the rest of any given architectural design and layout.
-* plus (continuation): SIMD in simple in-order single-issue designs can
- therefore result in superb throughput, easily achieved even with a very
- simple execution model.
-* minus: ridiculously complex setup and corner-cases that disproportionately
- increase instruction count on what would otherwise be a "simple loop",
- should the number of elements in an array not happen to exactly match
- the SIMD group width.
-* minus: getting data usefully out of registers (if separate regfiles
- are used) means outputting to memory and back.
-* minus: quite a lot of supplementary instructions for bit-level manipulation
- are needed in order to efficiently extract (or prepare) SIMD operands.
-* minus: MASSIVE proliferation of ISA both in terms of opcodes in one
- dimension and parallelism (width): an at least O(N^2) and quite probably
- O(N^3) ISA proliferation that often results in several thousand
- separate instructions. all requiring separate and distinct corner-case
- algorithms!
-* minus: EVEN BIGGER proliferation of SIMD ISA if the functionality of
- 8, 16, 32 or 64-bit reordering is built-in to the SIMD instruction.
- For example: add (high|low) 16-bits of r1 to (low|high) of r2 requires
- four separate and distinct instructions: one for (r1:low r2:high),
- one for (r1:high r2:low), one for (r1:high r2:high) and one for
- (r1:low r2:low) *per function*.
-* minus: EVEN BIGGER proliferation of SIMD ISA if there is a mismatch
- between operand and result bit-widths. In combination with high/low
- proliferation the situation is made even worse.
-* minor-saving-grace: some implementations *may* have predication masks
- that allow control over individual elements within the SIMD block.
-
-## Comparison *to* Traditional SIMD: Alt-RVP, Simple-V and RVV Proposals <a name="simd_comparison"></a>
-
-This section compares the various parallelism proposals as they stand,
-*against* traditional SIMD as opposed to *alongside* SIMD. In other words,
-the question is asked "How can each of the proposals effectively implement
-(or replace) SIMD, and how effective would they be"?
-
-### [[alt_rvp]]
-
-* Alt-RVP would not actually replace SIMD but would augment it: just as with
- a SIMD architecture where the ALU becomes responsible for the parallelism,
- Alt-RVP ALUs would likewise be so responsible... with *additional*
- (lane-based) parallelism on top.
-* Thus at least some of the downsides of SIMD ISA O(N^5) proliferation by
- at least one dimension are avoided (architectural upgrades introducing
- 128-bit then 256-bit then 512-bit variants of the exact same 64-bit
- SIMD block)
-* Thus, unfortunately, Alt-RVP would suffer the same inherent proliferation
- of instructions as SIMD, albeit not quite as badly (due to Lanes).
-* In the same discussion for Alt-RVP, an additional proposal was made to
- be able to subdivide the bits of each register lane (columns) down into
- arbitrary bit-lengths (RGB 565 for example).
-* A recommendation was given instead to make the subdivisions down to 32-bit,
- 16-bit or even 8-bit, effectively dividing the registerfile into
- Lane0(H), Lane0(L), Lane1(H) ... LaneN(L) or further. If inter-lane
- "swapping" instructions were then introduced, some of the disadvantages
- of SIMD could be mitigated.
-
-### RVV
-
-* RVV is designed to replace SIMD with a better paradigm: arbitrary-length
- parallelism.
-* However whilst SIMD is usually designed for single-issue in-order simple
- DSPs with a focus on Multimedia (Audio, Video and Image processing),
- RVV's primary focus appears to be on Supercomputing: optimisation of
- mathematical operations that fit into the OpenCL space.
-* Adding functions (operations) that would normally fit (in parallel)
- into a SIMD instruction requires an equivalent to be added to the
- RVV Extension, if one does not exist. Given the specialist nature of
- some SIMD instructions (8-bit or 16-bit saturated or halving add),
- this possibility seems extremely unlikely to occur, even if the
- implementation overhead of RVV were acceptable (compared to
- normal SIMD/DSP-style single-issue in-order simplicity).
-
-### Simple-V
-
-* Simple-V borrows hugely from RVV as it is intended to be easy to
- topologically transplant every single instruction from RVV (as
- designed) into Simple-V equivalents, with *zero loss of functionality
- or capability*.
-* With the "parallelism" abstracted out, a hypothetical SIMD-less "DSP"
- Extension which contained the basic primitives (non-parallelised
- 8, 16 or 32-bit SIMD operations) inherently *become* parallel,
- automatically.
-* Additionally, standard operations (ADD, MUL) that would normally have
- to have special SIMD-parallel opcodes added need no longer have *any*
- of the length-dependent variants (2of 32-bit ADDs in a 64-bit register,
- 4of 32-bit ADDs in a 128-bit register) because Simple-V takes the
- *standard* RV opcodes (present and future) and automatically parallelises
- them.
-* By inheriting the RVV feature of arbitrary vector-length, then just as
- with RVV the corner-cases and ISA proliferation of SIMD is avoided.
-* Whilst not entirely finalised, registers are expected to be
- capable of being subdivided down to an implementor-chosen bitwidth
- in the underlying hardware (r1 becomes r1[31..24] r1[23..16] r1[15..8]
- and r1[7..0], or just r1[31..16] r1[15..0]) where implementors can
- choose to have separate independent 8-bit ALUs or dual-SIMD 16-bit
- ALUs that perform twin 8-bit operations as they see fit, or anything
- else including no subdivisions at all.
-* Even though implementors have that choice even to have full 64-bit
- (with RV64) SIMD, they *must* provide predication that transparently
- switches off appropriate units on the last loop, thus neatly fitting
- underlying SIMD ALU implementations *into* the arbitrary vector-length
- RVV paradigm, keeping the uniform consistent API that is a key strategic
- feature of Simple-V.
-* With Simple-V fitting into the standard register files, certain classes
- of SIMD operations such as High/Low arithmetic (r1[31..16] + r2[15..0])
- can be done by applying *Parallelised* Bit-manipulation operations
- followed by parallelised *straight* versions of element-to-element
- arithmetic operations, even if the bit-manipulation operations require
- changing the bitwidth of the "vectors" to do so. Predication can
- be utilised to skip high words (or low words) in source or destination.
-* In essence, the key downside of SIMD - massive duplication of
- identical functions over time as an architecture evolves from 32-bit
- wide SIMD all the way up to 512-bit, is avoided with Simple-V, through
- vector-style parallelism being dropped on top of 8-bit or 16-bit
- operations, all the while keeping a consistent ISA-level "API" irrespective
- of implementor design choices (or indeed actual implementations).
-
-### Example Instruction translation: <a name="example_translation"></a>
-
-Instructions "ADD r7 r4 r4" would result in three instructions being
-generated and placed into the FIFO. r7 and r4 are marked as "vectorised":
-
-* ADD r7 r4 r4
-* ADD r8 r5 r5
-* ADD r9 r6 r6
-
-Instructions "ADD r7 r4 r1" would result in three instructions being
-generated and placed into the FIFO. r7 and r1 are marked as "vectorised"
-whilst r4 is not:
-
-* ADD r7 r4 r1
-* ADD r8 r4 r2
-* ADD r9 r4 r3
-
-## Example of vector / vector, vector / scalar, scalar / scalar => vector add
-
- function op_add(rd, rs1, rs2) # add not VADD!
- int i, id=0, irs1=0, irs2=0;
- rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
- rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
- rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
- predval = get_pred_val(FALSE, rd);
- for (i = 0; i < VL; i++)
- if (predval & 1<<i) # predication uses intregs
- ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
-
-## Retro-fitting Predication into branch-explicit ISA <a name="predication_retrofit"></a>
-
-One of the goals of this parallelism proposal is to avoid instruction
-duplication. However, with the base ISA having been designed explictly
-to *avoid* condition-codes entirely, shoe-horning predication into it
-bcomes quite challenging.