-# Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal
-
-Key insight: Simple-V is intended as an abstraction layer to provide
-a consistent "API" to parallelisation of existing *and future* operations.
-*Actual* internal hardware-level parallelism is *not* required, such
-that Simple-V may be viewed as providing a "compact" or "consolidated"
-means of issuing multiple near-identical arithmetic instructions to an
-instruction queue (FIFO), pending execution.
-
-*Actual* parallelism, if added independently of Simple-V in the form
-of Out-of-order restructuring (including parallel ALU lanes) or VLIW
-implementations, or SIMD, or anything else, would then benefit from
-the uniformity of a consistent API.
-
-[[!toc ]]
-
-# Introduction
-
-This proposal exists so as to be able to satisfy several disparate
-requirements: power-conscious, area-conscious, and performance-conscious
-designs all pull an ISA and its implementation in different conflicting
-directions, as do the specific intended uses for any given implementation.
-
-The existing P (SIMD) proposal and the V (Vector) proposals,
-whilst each extremely powerful in their own right and clearly desirable,
-are also:
-
-* Clearly independent in their origins (Cray and AndesStar v3 respectively)
- so need work to adapt to the RISC-V ethos and paradigm
-* Are sufficiently large so as to make adoption (and exploration for
- analysis and review purposes) prohibitively expensive
-* Both contain partial duplication of pre-existing RISC-V instructions
- (an undesirable characteristic)
-* Both have independent, incompatible and disparate methods for introducing
- parallelism at the instruction level
-* Both require that their respective parallelism paradigm be implemented
- along-side and integral to their respective functionality *or not at all*.
-* Both independently have methods for introducing parallelism that
- could, if separated, benefit
- *other areas of RISC-V not just DSP or Floating-point respectively*.
-
-There are also key differences between Vectorisation and SIMD (full
-details outlined in the Appendix), the key points being:
-
-* SIMD has an extremely seductively compelling ease of implementation argument:
- each operation is passed to the ALU, which is where the parallelism
- lies. There is *negligeable* (if any) impact on the rest of the core
- (with life instead being made hell for compiler writers and applications
- writers due to extreme ISA proliferation).
-* By contrast, Vectorisation has quite some complexity (for considerable
- flexibility, reduction in opcode proliferation and much more).
-* Vectorisation typically includes much more comprehensive memory load
- and store schemes (unit stride, constant-stride and indexed), which
- in turn have ramifications: virtual memory misses (TLB cache misses)
- and even multiple page-faults... all caused by a *single instruction*,
- yet with a clear benefit that the regularisation of LOAD/STOREs can
- be optimised for minimal impact on caches and maximised throughput.
-* By contrast, SIMD can use "standard" memory load/stores (32-bit aligned
- to pages), and these load/stores have absolutely nothing to do with the
- SIMD / ALU engine, no matter how wide the operand. Simplicity but with
- more impact on instruction and data caches.
-
-Overall it makes a huge amount of sense to have a means and method
-of introducing instruction parallelism in a flexible way that provides
-implementors with the option to choose exactly where they wish to offer
-performance improvements and where they wish to optimise for power
-and/or area (and if that can be offered even on a per-operation basis that
-would provide even more flexibility).
-
-Additionally it makes sense to *split out* the parallelism inherent within
-each of P and V, and to see if each of P and V then, in *combination* with
-a "best-of-both" parallelism extension, could be added on *on top* of
-this proposal, to topologically provide the exact same functionality of
-each of P and V. Each of P and V then can focus on providing the best
-operations possible for their respective target areas, without being
-hugely concerned about the actual parallelism.
-
-Furthermore, an additional goal of this proposal is to reduce the number
-of opcodes utilised by each of P and V as they currently stand, leveraging
-existing RISC-V opcodes where possible, and also potentially allowing
-P and V to make use of Compressed Instructions as a result.
-
-# Analysis and discussion of Vector vs SIMD
-
-There are six combined areas between the two proposals that help with
-parallelism (increased performance, reduced power / area) without
-over-burdening the ISA with a huge proliferation of
-instructions:
-
-* Fixed vs variable parallelism (fixed or variable "M" in SIMD)
-* Implicit vs fixed instruction bit-width (integral to instruction or not)
-* Implicit vs explicit type-conversion (compounded on bit-width)
-* Implicit vs explicit inner loops.
-* Single-instruction LOAD/STORE.
-* Masks / tagging (selecting/preventing certain indexed elements from execution)
-
-The pros and cons of each are discussed and analysed below.
-
-## Fixed vs variable parallelism length
-
-In David Patterson and Andrew Waterman's analysis of SIMD and Vector
-ISAs, the analysis comes out clearly in favour of (effectively) variable
-length SIMD. As SIMD is a fixed width, typically 4, 8 or in extreme cases
-16 or 32 simultaneous operations, the setup, teardown and corner-cases of SIMD
-are extremely burdensome except for applications whose requirements
-*specifically* match the *precise and exact* depth of the SIMD engine.
-
-Thus, SIMD, no matter what width is chosen, is never going to be acceptable
-for general-purpose computation, and in the context of developing a
-general-purpose ISA, is never going to satisfy 100 percent of implementors.
-
-To explain this further: for increased workloads over time, as the
-performance requirements increase for new target markets, implementors
-choose to extend the SIMD width (so as to again avoid mixing parallelism
-into the instruction issue phases: the primary "simplicity" benefit of
-SIMD in the first place), with the result that the entire opcode space
-effectively doubles with each new SIMD width that's added to the ISA.
-
-That basically leaves "variable-length vector" as the clear *general-purpose*
-winner, at least in terms of greatly simplifying the instruction set,
-reducing the number of instructions required for any given task, and thus
-reducing power consumption for the same.
-
-## Implicit vs fixed instruction bit-width
-
-SIMD again has a severe disadvantage here, over Vector: huge proliferation
-of specialist instructions that target 8-bit, 16-bit, 32-bit, 64-bit, and
-have to then have operations *for each and between each*. It gets very
-messy, very quickly: *six* separate dimensions giving an O(N^6) instruction
-proliferation profile.
-
-The V-Extension on the other hand proposes to set the bit-width of
-future instructions on a per-register basis, such that subsequent instructions
-involving that register are *implicitly* of that particular bit-width until
-otherwise changed or reset.
-
-This has some extremely useful properties, without being particularly
-burdensome to implementations, given that instruction decode already has
-to direct the operation to a correctly-sized width ALU engine, anyway.
-
-Not least: in places where an ISA was previously constrained (due for
-whatever reason, including limitations of the available operand space),
-implicit bit-width allows the meaning of certain operations to be
-type-overloaded *without* pollution or alteration of frozen and immutable
-instructions, in a fully backwards-compatible fashion.
-
-## Implicit and explicit type-conversion
-
-The Draft 2.3 V-extension proposal has (deprecated) polymorphism to help
-deal with over-population of instructions, such that type-casting from
-integer (and floating point) of various sizes is automatically inferred
-due to "type tagging" that is set with a special instruction. A register
-will be *specifically* marked as "16-bit Floating-Point" and, if added
-to an operand that is specifically tagged as "32-bit Integer" an implicit
-type-conversion will take place *without* requiring that type-conversion
-to be explicitly done with its own separate instruction.
-
-However, implicit type-conversion is not only quite burdensome to
-implement (explosion of inferred type-to-type conversion) but also is
-never really going to be complete. It gets even worse when bit-widths
-also have to be taken into consideration. Each new type results in
-an increased O(N^2) conversion space that, as anyone who has examined
-python's source code (which has built-in polymorphic type-conversion),
-knows that the task is more complex than it first seems.
-
-Overall, type-conversion is generally best to leave to explicit
-type-conversion instructions, or in definite specific use-cases left to
-be part of an actual instruction (DSP or FP)
-
-## Zero-overhead loops vs explicit loops
-
-The initial Draft P-SIMD Proposal by Chuanhua Chang of Andes Technology
-contains an extremely interesting feature: zero-overhead loops. This
-proposal would basically allow an inner loop of instructions to be
-repeated indefinitely, a fixed number of times.
-
-Its specific advantage over explicit loops is that the pipeline in a DSP
-can potentially be kept completely full *even in an in-order single-issue
-implementation*. Normally, it requires a superscalar architecture and
-out-of-order execution capabilities to "pre-process" instructions in
-order to keep ALU pipelines 100% occupied.
-
-By bringing that capability in, this proposal could offer a way to increase
-pipeline activity even in simpler implementations in the one key area
-which really matters: the inner loop.
-
-However when looking at much more comprehensive schemes
-"A portable specification of zero-overhead loop control hardware
-applied to embedded processors" (ZOLC), optimising only the single
-inner loop seems inadequate, tending to suggest that ZOLC may be
-better off being proposed as an entirely separate Extension.
-
-## Single-instruction LOAD/STORE
-
-In traditional Vector Architectures there are instructions which
-result in multiple register-memory transfer operations resulting
-from a single instruction. They're complicated to implement in hardware,
-yet the benefits are a huge consistent regularisation of memory accesses
-that can be highly optimised with respect to both actual memory and any
-L1, L2 or other caches. In Hwacha EECS-2015-263 it is explicitly made
-clear the consequences of getting this architecturally wrong:
-L2 cache-thrashing at the very least.
-
-Complications arise when Virtual Memory is involved: TLB cache misses
-need to be dealt with, as do page faults. Some of the tradeoffs are
-discussed in <http://people.eecs.berkeley.edu/~krste/thesis.pdf>, Section
-4.6, and an article by Jeff Bush when faced with some of these issues
-is particularly enlightening
-<https://jbush001.github.io/2015/11/03/lost-in-translation.html>
-
-Interestingly, none of this complexity is faced in SIMD architectures...
-but then they do not get the opportunity to optimise for highly-streamlined
-memory accesses either.
-
-With the "bang-per-buck" ratio being so high and the indirect improvement
-in L1 Instruction Cache usage (reduced instruction count), as well as
-the opportunity to optimise L1 and L2 cache usage, the case for including
-Vector LOAD/STORE is compelling.
-
-## Mask and Tagging (Predication)
-
-Tagging (aka Masks aka Predication) is a pseudo-method of implementing
-simplistic branching in a parallel fashion, by allowing execution on
-elements of a vector to be switched on or off depending on the results
-of prior operations in the same array position.
-
-The reason for considering this is simple: by *definition* it
-is not possible to perform individual parallel branches in a SIMD
-(Single-Instruction, **Multiple**-Data) context. Branches (modifying
-of the Program Counter) will result in *all* parallel data having
-a different instruction executed on it: that's just the definition of
-SIMD, and it is simply unavoidable.
-
-So these are the ways in which conditional execution may be implemented:
-
-* explicit compare and branch: BNE x, y -> offs would jump offs
- instructions if x was not equal to y
-* explicit store of tag condition: CMP x, y -> tagbit
-* implicit (condition-code) such as ADD results in a carry, carry bit
- implicitly (or sometimes explicitly) goes into a "tag" (mask) register
-
-The first of these is a "normal" branch method, which is flat-out impossible
-to parallelise without look-ahead and effectively rewriting instructions.
-This would defeat the purpose of RISC.
-
-The latter two are where parallelism becomes easy to do without complexity:
-every operation is modified to be "conditionally executed" (in an explicit
-way directly in the instruction format *or* implicitly).
-
-RVV (Vector-Extension) proposes to have *explicit* storing of the compare
-in a tag/mask register, and to *explicitly* have every vector operation
-*require* that its operation be "predicated" on the bits within an
-explicitly-named tag/mask register.
-
-SIMD (P-Extension) has not yet published precise documentation on what its
-schema is to be: there is however verbal indication at the time of writing
-that:
-
-> The "compare" instructions in the DSP/SIMD ISA proposed by Andes will
-> be executed using the same compare ALU logic for the base ISA with some
-> minor modifications to handle smaller data types. The function will not
-> be duplicated.
-
-This is an *implicit* form of predication as the base RV ISA does not have
-condition-codes or predication. By adding a CSR it becomes possible
-to also tag certain registers as "predicated if referenced as a destination".
-Example:
-
- // in future operations from now on, if r0 is the destination use r5 as
- // the PREDICATION register
- SET_IMPLICIT_CSRPREDICATE r0, r5
- // store the compares in r5 as the PREDICATION register
- CMPEQ8 r5, r1, r2
- // r0 is used here. ah ha! that means it's predicated using r5!
- ADD8 r0, r1, r3
-
-With enough registers (and in RISC-V there are enough registers) some fairly
-complex predication can be set up and yet still execute without significant
-stalling, even in a simple non-superscalar architecture.
-
-(For details on how Branch Instructions would be retro-fitted to indirectly
-predicated equivalents, see Appendix)
-
-## Conclusions
-
-In the above sections the five different ways where parallel instruction
-execution has closely and loosely inter-related implications for the ISA and
-for implementors, were outlined. The pluses and minuses came out as
-follows:
-
-* Fixed vs variable parallelism: <b>variable</b>
-* Implicit (indirect) vs fixed (integral) instruction bit-width: <b>indirect</b>
-* Implicit vs explicit type-conversion: <b>explicit</b>
-* Implicit vs explicit inner loops: <b>implicit but best done separately</b>
-* Single-instruction Vector LOAD/STORE: <b>Complex but highly beneficial</b>
-* Tag or no-tag: <b>Complex but highly beneficial</b>
-
-In particular:
-
-* variable-length vectors came out on top because of the high setup, teardown
- and corner-cases associated with the fixed width of SIMD.
-* Implicit bit-width helps to extend the ISA to escape from
- former limitations and restrictions (in a backwards-compatible fashion),
- whilst also leaving implementors free to simmplify implementations
- by using actual explicit internal parallelism.
-* Implicit (zero-overhead) loops provide a means to keep pipelines
- potentially 100% occupied in a single-issue in-order implementation
- i.e. *without* requiring a super-scalar or out-of-order architecture,
- but doing a proper, full job (ZOLC) is an entirely different matter.
-
-Constructing a SIMD/Simple-Vector proposal based around four of these six
-requirements would therefore seem to be a logical thing to do.
-
-# Note on implementation of parallelism
-
-One extremely important aspect of this proposal is to respect and support
-implementors desire to focus on power, area or performance. In that regard,
-it is proposed that implementors be free to choose whether to implement
-the Vector (or variable-width SIMD) parallelism as sequential operations
-with a single ALU, fully parallel (if practical) with multiple ALUs, or
-a hybrid combination of both.
-
-In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
-Parallelism". They achieve a 16-way SIMD at an **instruction** level
-by providing a combination of a 4-way parallel ALU *and* an externally
-transparent loop that feeds 4 sequential sets of data into each of the
-4 ALUs.
-
-Also in the same core, it is worth noting that particularly uncommon
-but essential operations (Reciprocal-Square-Root for example) are
-*not* part of the 4-way parallel ALU but instead have a *single* ALU.
-Under the proposed Vector (varible-width SIMD) implementors would
-be free to do precisely that: i.e. free to choose *on a per operation
-basis* whether and how much "Virtual Parallelism" to deploy.
-
-It is absolutely critical to note that it is proposed that such choices MUST
-be **entirely transparent** to the end-user and the compiler. Whilst
-a Vector (varible-width SIMD) may not precisely match the width of the
-parallelism within the implementation, the end-user **should not care**
-and in this way the performance benefits are gained but the ISA remains
-straightforward. All that happens at the end of an instruction run is: some
-parallel units (if there are any) would remain offline, completely
-transparently to the ISA, the program, and the compiler.
-
-To make that clear: should an implementor choose a particularly wide
-SIMD-style ALU, each parallel unit *must* have predication so that
-the parallel SIMD ALU may emulate variable-length parallel operations.
-Thus the "SIMD considered harmful" trap of having huge complexity and extra
-instructions to deal with corner-cases is thus avoided, and implementors
-get to choose precisely where to focus and target the benefits of their
-implementation efforts, without "extra baggage".
-
-In addition, implementors will be free to choose whether to provide an
-absolute bare minimum level of compliance with the "API" (software-traps
-when vectorisation is detected), all the way up to full supercomputing
-level all-hardware parallelism. Options are covered in the Appendix.
-
-# CSRs <a name="csrs"></a>
-
-There are two CSR tables needed to create lookup tables which are used at
-the register decode phase.
-
-* Integer Register N is Vector
-* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
-* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
-* Integer Register N is a Predication Register (note: a key-value store)
-
-Also (see Appendix, "Context Switch Example") it may turn out to be important
-to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
-Vectorised LOAD / STORE may be used to load and store multiple registers:
-something that is missing from the Base RV ISA.
-
-Notes:
-
-* for the purposes of LOAD / STORE, Integer Registers which are
- marked as a Vector will result in a Vector LOAD / STORE.
-* Vector Lengths are *not* the same as vsetl but are an integral part
- of vsetl.
-* Actual vector length is *multipled* by how many blocks of length
- "bitwidth" may fit into an XLEN-sized register file.
-* Predication is a key-value store due to the implicit referencing,
- as opposed to having the predicate register explicitly in the instruction.
-* Whilst the predication CSR is a key-value store it *generates* easier-to-use
- state information.
-* TODO: assess whether the same technique could be applied to the other
- Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
- V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
- needed for context-switches (empty slots need never be stored).
-
-## Predication CSR <a name="predication_csr_table"></a>
-
-The Predication CSR is a key-value store indicating whether, if a given
-destination register (integer or floating-point) is referred to in an
-instruction, it is to be predicated. However it is important to note
-that the *actual* register is *different* from the one that ends up
-being used, due to the level of indirection through the lookup table.
-This includes (in the future) redirecting to a *second* bank of
-integer registers (as a future option)
-
-* regidx is the actual register that in combination with the
- i/f flag, if that integer or floating-point register is referred to,
- results in the lookup table being referenced to find the predication
- mask to use on the operation in which that (regidx) register has
- been used
-* predidx (in combination with the bank bit in the future) is the
- *actual* register to be used for the predication mask. Note:
- in effect predidx is actually a 6-bit register address, as the bank
- bit is the MSB (and is nominally set to zero for now).
-* inv indicates that the predication mask bits are to be inverted
- prior to use *without* actually modifying the contents of the
- register itself.
-* zeroing is either 1 or 0, and if set to 1, the operation must
- place zeros in any element position where the predication mask is
- set to zero. If zeroing is set to 1, unpredicated elements *must*
- be left alone. Some microarchitectures may choose to interpret
- this as skipping the operation entirely. Others which wish to
- stick more closely to a SIMD architecture may choose instead to
- interpret unpredicated elements as an internal "copy element"
- operation (which would be necessary in SIMD microarchitectures
- that perform register-renaming)
-
-| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
-| ----- | - | - | - | - | ------- | ------- |
-| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx |
-| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx |
-| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx |
-| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx |
-
-The Predication CSR Table is a key-value store, so implementation-wise
-it will be faster to turn the table around (maintain topologically
-equivalent state):
-
- struct pred {
- bool zero;
- bool inv;
- bool bank; // 0 for now, 1=rsvd
- bool enabled;
- int predidx; // redirection: actual int register to use
- }
-
- struct pred fp_pred_reg[32]; // 64 in future (bank=1)
- struct pred int_pred_reg[32]; // 64 in future (bank=1)
-
- for (i = 0; i < 16; i++)
- tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
- idx = CSRpred[i].regidx
- tb[idx].zero = CSRpred[i].zero
- tb[idx].inv = CSRpred[i].inv
- tb[idx].bank = CSRpred[i].bank
- tb[idx].predidx = CSRpred[i].predidx
- tb[idx].enabled = true
-
-So when an operation is to be predicated, it is the internal state that
-is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
-pseudo-code for operations is given, where p is the explicit (direct)
-reference to the predication register to be used:
-
- for (int i=0; i<vl; ++i)
- if ([!]preg[p][i])
- (d ? vreg[rd][i] : sreg[rd]) =
- iop(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
-
-This instead becomes an *indirect* reference using the *internal* state
-table generated from the Predication CSR key-value store, which iwws used
-as follows.
-
- if type(iop) == INT:
- preg = int_pred_reg[rd]
- else:
- preg = fp_pred_reg[rd]
-
- for (int i=0; i<vl; ++i)
- predidx = preg[rd].predidx; // the indirection takes place HERE
- if (!preg[rd].enabled)
- predicate = ~0x0; // all parallel ops enabled
- else:
- predicate = intregfile[predidx]; // get actual reg contents HERE
- if (preg[rd].inv) // invert if requested
- predicate = ~predicate;
- if (predicate && (1<<i))
- (d ? regfile[rd+i] : regfile[rd]) =
- iop(s1 ? regfile[rs1+i] : regfile[rs1],
- s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
- else if (preg[rd].zero)
- // TODO: place zero in dest reg
-
-Note:
-
-* d, s1 and s2 are booleans indicating whether destination,
- source1 and source2 are vector or scalar
-* key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
- above, for clarity. rd, rs1 and rs2 all also must ALSO go through
- register-level redirection (from the Register CSR table) if they are
- vectors.
-
-If written as a function, obtaining the predication mask (but not whether
-zeroing takes place) may be done as follows:
-
- def get_pred_val(bool is_fp_op, int reg):
- tb = int_pred if is_fp_op else fp_pred
- if (!tb[reg].enabled):
- return ~0x0 // all ops enabled
- predidx = tb[reg].predidx // redirection occurs HERE
- predicate = intreg[predidx] // actual predicate HERE
- if (tb[reg].inv):
- predicate = ~predicate // invert ALL bits
- return predicate
-
-## MAXVECTORLENGTH
-
-MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V,
-given that its primary (base, unextended) purpose is for 3D, Video and
-other purposes (not requiring supercomputing capability), it makes sense
-to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
-and so on).
-
-The reason for setting this limit is so that predication registers, when
-marked as such, may fit into a single register as opposed to fanning out
-over several registers. This keeps the implementation a little simpler.
-Note also (as also described in the VSETVL section) that the *minimum*
-for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
-and 31 for RV32 or RV64).
-
-Note that RVV on top of Simple-V may choose to over-ride this decision.
-
-## Register CSR key-value (CAM) table
-
-The purpose of the Register CSR table is four-fold:
-
-* To mark integer and floating-point registers as requiring "redirection"
- if it is ever used as a source or destination in any given operation.
- This involves a level of indirection through a 5-to-6-bit lookup table
- (where the 6th bit - bank - is always set to 0 for now).
-* To indicate whether, after redirection through the lookup table, the
- register is a vector (or remains a scalar).
-* To over-ride the implicit or explicit bitwidth that the operation would
- normally give the register.
-* To indicate if the register is to be interpreted as "packed" (SIMD)
- i.e. containing multiple contiguous elements of size equal to "bitwidth".
-
-| RgCSR | 15 | 14 | 13 | (12..11) | 10 | (9..5) | (4..0) |
-| ----- | - | - | - | - | - | ------- | ------- |
-| 0 | simd0 | bank0 | isvec0 | vew0 | i/f | regidx | predidx |
-| 1 | simd1 | bank1 | isvec1 | vew1 | i/f | regidx | predidx |
-| .. | simd.. | bank.. | isvec.. | vew.. | i/f | regidx | predidx |
-| 15 | simd15 | bank15 | isvec15 | vew15 | i/f | regidx | predidx |
-
-vew may be one of the following (giving a table "bytestable", used below):
-
-| vew | bitwidth |
-| --- | --------- |
-| 00 | default |
-| 01 | default/2 |
-| 10 | 8 |
-| 11 | 16 |
-
-Extending this table (with extra bits) is covered in the section
-"Implementing RVV on top of Simple-V".
-
-As the above table is a CAM (key-value store) it may be appropriate
-to expand it as follows:
-
- struct vectorised fp_vec[32], int_vec[32]; // 64 in future
-
- for (i = 0; i < 16; i++) // 16 CSRs?
- tb = int_vec if CSRvec[i].type == 0 else fp_vec
- idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
- tb[idx].elwidth = CSRvec[i].elwidth
- tb[idx].regidx = CSRvec[i].regidx // indirection
- tb[idx].isvector = CSRvec[i].isvector // 0=scalar
- tb[idx].packed = CSRvec[i].packed // SIMD or not
- tb[idx].bank = CSRvec[i].bank // 0 (1=rsvd)
-
-TODO: move elsewhere
-
- # TODO: use elsewhere (retire for now)
- vew = CSRbitwidth[rs1]
- if (vew == 0)
- bytesperreg = (XLEN/8) # or FLEN as appropriate
- elif (vew == 1)
- bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
- else:
- bytesperreg = bytestable[vew] # 8 or 16
- simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
- vlen = CSRvectorlen[rs1] * simdmult
- CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-The reason for multiplying the vector length by the number of SIMD elements
-(in each individual register) is so that each SIMD element may optionally be
-predicated.
-
-An example of how to subdivide the register file when bitwidth != default
-is given in the section "Bitwidth Virtual Register Reordering".
-
-# Instructions
-
-Despite being a 98% complete and accurate topological remap of RVV
-concepts and functionality, the only instructions needed are VSETVL
-and VGETVL. *All* RVV instructions can be re-mapped, however xBitManip
-becomes a critical dependency for efficient manipulation of predication
-masks (as a bit-field). Despite the removal of all but VSETVL and VGETVL,
-*all instructions from RVV are topologically re-mapped and retain their
-complete functionality, intact*.
-
-Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
-equivalents, so are left out of Simple-V. VSELECT could be included if
-there existed a MV.X instruction in RV (MV.X is a hypothetical
-non-immediate variant of MV that would allow another register to
-specify which register was to be copied). Note that if any of these three
-instructions are added to any given RV extension, their functionality
-will be inherently parallelised.
-
-## Instruction Format
-
-The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or *any*
-memory instructions.
-Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations, MV,
-FCVT, and LOAD/STORE
-depending on CSR configurations for bitwidth and
-predication. **Everything** becomes parallelised. *This includes
-Compressed instructions* as well as any
-future instructions and Custom Extensions.
-
-* For analysis of RVV see [[v_comparative_analysis]] which begins to
- outline topologically-equivalent mappings of instructions
-* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
- for format of Branch opcodes.
-
-**TODO**: *analyse and decide whether the implicit nature of predication
-as proposed is or is not a lot of hassle, and if explicit prefixes are
-a better idea instead. Parallelism therefore effectively may end up
-as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
-with some opportunities for to use Compressed bringing it down to 48.
-Also to consider is whether one or both of the last two remaining Compressed
-instruction codes in Quadrant 1 could be used as a parallelism prefix,
-bringing parallelised opcodes down to 32-bit (when combined with C)
-and having the benefit of being explicit.*
-
-## VSETVL
-
-NOTE TODO: 28may2018: VSETVL may need to be *really* different from RVV,
-with the instruction format remaining the same.
-
-VSETVL is slightly different from RVV in that the minimum vector length
-is required to be at least the number of registers in the register file,
-and no more than XLEN. This allows vector LOAD/STORE to be used to switch
-the entire bank of registers using a single instruction (see Appendix,
-"Context Switch Example"). The reason for limiting VSETVL to XLEN is
-down to the fact that predication bits fit into a single register of length
-XLEN bits.
-
-The second change is that when VSETVL is requested to be stored
-into x0, it is *ignored* silently (VSETVL x0, x5, #4)
-
-The third change is that there is an additional immediate added to VSETVL,
-to which VL is set after first going through MIN-filtering.
-So When using the "vsetl rs1, rs2, #vlen" instruction, it becomes:
-
- VL = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
-
-where RegfileLen <= MAXVECTORDEPTH < XLEN
-
-This has implication for the microarchitecture, as VL is required to be
-set (limits from MAXVECTORDEPTH notwithstanding) to the actual value
-requested in the #immediate parameter. RVV has the option to set VL
-to an arbitrary value that suits the conditions and the micro-architecture:
-SV does *not* permit that.
-
-The reason is so that if SV is to be used for a context-switch or as a
-substitute for LOAD/STORE-Multiple, the operation can be done with only
-2-3 instructions (setup of the CSRs, VSETVL x0, x0, #{regfilelen-1},
-single LD/ST operation). If VL does *not* get set to the register file
-length when VSETVL is called, then a software-loop would be needed.
-To avoid this need, VL *must* be set to exactly what is requested
-(limits notwithstanding).
-
-Therefore, in turn, unlike RVV, implementors *must* provide
-pseudo-parallelism (using sequential loops in hardware) if actual
-hardware-parallelism in the ALUs is not deployed. A hybrid is also
-permitted (as used in Broadcom's VideoCore-IV) however this must be
-*entirely* transparent to the ISA.
-
-## Branch Instruction:
-
-Branch operations use standard RV opcodes that are reinterpreted to
-be "predicate variants" in the instance where either of the two src
-registers are marked as vectors (isvector=1). When this reinterpretation
-is enabled the "immediate" field of the branch operation is taken to be a
-predication target register, rs3. The predicate target register rs3 is
-to be treated as a bitfield (up to a maximum of XLEN bits corresponding
-to a maximum of XLEN elements).
-
-If either of src1 or src2 are scalars (CSRvectorlen[src] == 0) the comparison
-goes ahead as vector-scalar or scalar-vector. Implementors should note that
-this could require considerable multi-porting of the register file in order
-to parallelise properly, so may have to involve the use of register cacheing
-and transparent copying (see Multiple-Banked Register File Architectures
-paper).
-
-In instances where no vectorisation is detected on either src registers
-the operation is treated as an absolutely standard scalar branch operation.
-
-This is the overloaded table for Integer-base Branch operations. Opcode
-(bits 6..0) is set in all cases to 1100011.
-
-[[!table data="""
-31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
-imm[12,10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
-7 | 5 | 5 | 3 | 4 | 1 | 7 |
-reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
-reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
-reserved | src2 | src1 | 001 | predicate rs3 || BNE |
-reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
-reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
-reserved | src2 | src1 | 100 | predicate rs3 || BLE |
-reserved | src2 | src1 | 101 | predicate rs3 || BGE |
-reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
-reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
-"""]]
-
-Note that just as with the standard (scalar, non-predicated) branch
-operations, BLT, BGT, BLEU and BTGU may be synthesised by inverting
-src1 and src2.
-
-Below is the overloaded table for Floating-point Predication operations.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield (done
-implicitly).
-
-As with
-Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
-Likewise Single-precision, fmt bits 26..25) is still set to 00.
-Double-precision is still set to 01, whilst Quad-precision
-appears not to have a definition in V2.3-Draft (but should be unaffected).
-
-It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact. To deal with this, SV's predication has
-had "invert" added to it.
-
-[[!table data="""
-31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
-funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
-5 | 2 | 5 | 5 | 3 | 4 | 7 |
-10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
-10100 | 00/01/11 | src2 | src1 | **011**| pred rs3 | rsvd |
-10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
-10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
-"""]]
-
-Note (**TBD**): floating-point exceptions will need to be extended
-to cater for multiple exceptions (and statuses of the same). The
-usual approach is to have an array of status codes and bit-fields,
-and one exception, rather than throw separate exceptions for each
-Vector element.
-
-In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
-for predicated compare operations of function "cmp":
-
- for (int i=0; i<vl; ++i)
- if ([!]preg[p][i])
- preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
- if I/F == INT: # integer type cmp
- preg = int_pred_reg[rd]
- reg = int_regfile
- else:
- preg = fp_pred_reg[rd]
- reg = fp_regfile
-
- s1 = reg_is_vectorised(src1);
- s2 = reg_is_vectorised(src2);
- if (!s2 && !s1) goto branch;
- for (int i = 0; i < VL; ++i)
- if (cmp(s1 ? reg[src1+i]:reg[src1],
- s2 ? reg[src2+i]:reg[src2])
- preg[rs3] |= 1<<i; # bitfield not vector
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
- into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
- Reordering") setting Vector-Length times (number of SIMD elements) bits
- in Predicate Register rs3 as opposed to just Vector-Length bits.
-* Predicated Branches do not actually have an adjustment to the Program
- Counter, so all of bits 25 through 30 in every case are not needed.
-* There are plenty of reserved opcodes for which bits 25 through 30 could
- be put to good use if there is a suitable use-case.
- FLT and FLE may be inverted to FGT and FGE if needed by swapping
- src1 and src2 (likewise the integer counterparts).
-
-## Compressed Branch Instruction:
-
-Compressed Branch instructions are likewise re-interpreted as predicated
-2-register operations, with the result going into rs3. All the bits of
-the immediate are re-interpreted for different purposes, to extend the
-number of comparator operations to beyond the original specification,
-but also to cater for floating-point comparisons as well as integer ones.
-
-[[!table data="""
-15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
-funct3 | imm | rs10 | imm | | op | |
-3 | 3 | 3 | 2 | 3 | 2 | |
-C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
-110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
-111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
-110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
-111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
-"""]]
-
-Notes:
-
-* Bits 5 13 14 and 15 make up the comparator type
-* Bit 6 indicates whether to use integer or floating-point comparisons
-* In both floating-point and integer cases there are four predication
- comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
- src1 and src2).
-
-## LOAD / STORE Instructions <a name="load_store"></a>
-
-For full analysis of topological adaptation of RVV LOAD/STORE
-see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction,
-and likewise for STORE.
-
-Revised LOAD:
-
-[[!table data="""
-31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
-imm[11:0] |||| rs1 | funct3 | rd | opcode |
-1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
-? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
-"""]]
-
-The exact same corresponding adaptation is also carried out on the single,
-double and quad precision floating-point LOAD-FP and STORE-FP operations,
-which fit the exact same instruction format. Thus all three types
-(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
-as well as FSW, FSD and FSQ.
-
-Notes:
-
-* LOAD remains functionally (topologically) identical to RVV LOAD
- (for both integer and floating-point variants).
-* Predication CSR-marking register is not explicitly shown in instruction, it's
- implicit based on the CSR predicate state for the rd (destination) register
-* rs2, the source, may *also be marked as a vector*, which implicitly
- is taken to indicate "Indexed Load" (LD.X)
-* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
-* Bit 31 is reserved (ideas under consideration: auto-increment)
-* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
-* **TODO**: clarify where width maps to elsize
-
-Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
-
- if (unit-strided) stride = elsize;
- else stride = areg[as2]; // constant-strided
-
- preg = int_pred_reg[rd]
-
- for (int i=0; i<vl; ++i)
- if ([!]preg[rd] & 1<<i)
- for (int j=0; j<seglen+1; j++)
- {
- if CSRvectorised[rs2])
- offs = vreg[rs2+i]
- else
- offs = i*(seglen+1)*stride;
- vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
- }
-
-Taking CSR (SIMD) bitwidth into account involves using the vector
-length and register encoding according to the "Bitwidth Virtual Register
-Reordering" scheme shown in the Appendix (see function "regoffs").
-
-A similar instruction exists for STORE, with identical topological
-translation of all features. **TODO**
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are of the same format, where bits 2-4 are
-a src register instead of dest:
-
-[[!table data="""
-15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
-funct3 | imm | rs10 | imm | rd0 | op |
-3 | 3 | 3 | 2 | 3 | 2 |
-C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
-"""]]
-
-Unfortunately it is not possible to fit the full functionality
-of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
-require another operand (rs2) in addition to the operand width
-(which is also missing), offset, base, and src/dest.
-
-However a close approximation may be achieved by taking the top bit
-of the offset in each of the five types of LD (and ST), reducing the
-offset to 4 bits and utilising the 5th bit to indicate whether "stride"
-is to be enabled. In this way it is at least possible to introduce
-that functionality.
-
-(**TODO**: *assess whether the loss of one bit from offset is worth having
-"stride" capability.*)
-
-We also assume (including for the "stride" variant) that the "width"
-parameter, which is missing, is derived and implicit, just as it is
-with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
-and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
-C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
-
-Interestingly we note that the Vectorised Simple-V variant of
-LOAD/STORE (Compressed and otherwise), due to it effectively using the
-standard register file(s), is the direct functional equivalent of
-standard load-multiple and store-multiple instructions found in other
-processors.
-
-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not". The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults and adjusting the TLB
-in advance, accordingly: other strategies are explored in the Appendix
-Section "Virtual Memory Page Faults".
-
-## Vectorised Copy/Move (and conversion) instructions
-
-There is a series of 2-operand instructions involving copying (and
-alteration): C.MV, FMV, FNEG, FABS, FCVT, FSGNJ. These operations all
-follow the same pattern, as it is *both* the source *and* destination
-predication masks that are taken into account. This is different from
-the three-operand arithmetic instructions, where the predication mask
-is taken from the *destination* register, and applied uniformly to the
-elements of the source register(s), element-for-element.
-
-### C.MV Instruction <a name="c_mv"></a>
-
-There is no MV instruction in RV however there is a C.MV instruction.
-It is used for copying integer-to-integer registers (vectorised FMV
-is used for copying floating-point).
-
-If either the source or the destination register are marked as vectors
-C.MV is reinterpreted to be a vectorised (multi-register) predicated
-move operation. The actual instruction's format does not change:
-
-[[!table data="""
-15 12 | 11 7 | 6 2 | 1 0 |
-funct4 | rd | rs | op |
-4 | 5 | 5 | 2 |
-C.MV | dest | src | C0 |
-"""]]
-
-A simplified version of the pseudocode for this operation is as follows:
-
- function op_mv(rd, rs) # MV not VMV!
- rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd;
- rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_vec[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_vec[rd].isvec) while (!(pd & 1<<j)) j++;
- ireg[rd+j] <= ireg[rs+i];
- if (int_vec[rs].isvec) i++;
- if (int_vec[rd].isvec) j++;
-
-Note that:
-
-* elwidth (SIMD) is not covered above
-* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
- not covered
-
-There are several different instructions from RVV that are covered by
-this one opcode:
-
-[[!table data="""
-src | dest | predication | op |
-scalar | vector | none | VSPLAT |
-scalar | vector | destination | sparse VSPLAT |
-scalar | vector | 1-bit dest | VINSERT |
-vector | scalar | 1-bit? src | VEXTRACT |
-vector | vector | none | VCOPY |
-vector | vector | src | Vector Gather |
-vector | vector | dest | Vector Scatter |
-vector | vector | src & dest | Gather/Scatter |
-vector | vector | src == dest | sparse VCOPY |
-"""]]
-
-Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
-operations with inversion on the src and dest predication for one of the
-two C.MV operations.
-
-Note that in the instance where the Compressed Extension is not implemented,
-MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
-Note that the behaviour is **different** from C.MV because with addi the
-predication mask to use is taken **only** from rd and is applied against
-all elements: rs[i] = rd[i].
-
-### FMV, FNEG and FABS Instructions
-
-These are identical in form to C.MV, except covering floating-point
-register copying. The same double-predication rules also apply.
-However when elwidth is not set to default the instruction is implicitly
-and automatic converted to a (vectorised) floating-point type conversion
-operation of the appropriate size covering the source and destination
-register bitwidths.
-
-(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
-
-### FVCT Instructions
-
-These are again identical in form to C.MV, except that they cover
-floating-point to integer and integer to floating-point. When element
-width in each vector is set to default, the instructions behave exactly
-as they are defined for standard RV (scalar) operations, except vectorised
-in exactly the same fashion as outlined in C.MV.
-
-However when the source or destination element width is not set to default,
-the opcode's explicit element widths are *over-ridden* to new definitions,
-and the opcode's element width is taken as indicative of the SIMD width
-(if applicable i.e. if packed SIMD is requested) instead.
-
-For example FCVT.S.L would normally be used to convert a 64-bit
-integer in register rs1 to a 64-bit floating-point number in rd.
-If however the source rs1 is set to be a vector, where elwidth is set to
-default/2 and "packed SIMD" is enabled, then the first 32 bits of
-rs1 are converted to a floating-point number to be stored in rd's
-first element and the higher 32-bits *also* converted to floating-point
-and stored in the second. The 32 bit size comes from the fact that
-FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
-divide that by two it means that rs1 element width is to be taken as 32.
-
-Similar rules apply to the destination register.
-
-# Exceptions
-
-> What does an ADD of two different-sized vectors do in simple-V?
-
-* if the two source operands are not the same, throw an exception.
-* if the destination operand is also a vector, and the source is longer
- than the destination, throw an exception.
-
-> And what about instructions like JALR?
-> What does jumping to a vector do?
-
-* Throw an exception. Whether that actually results in spawning threads
- as part of the trap-handling remains to be seen.
-
-# Under consideration <a name="issues"></a>
-
-From the Chennai 2018 slides the following issues were raised.
-Efforts to analyse and answer these questions are below.
-
-* Should future extra bank be included now?
-* How many Register and Predication CSRs should there be?
- (and how many in RV32E)
-* How many in M-Mode (for doing context-switch)?
-* Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-* Can CLIP be done as a CSR (mode, like elwidth)
-* SIMD saturation (etc.) also set as a mode?
-* Include src1/src2 predication on Comparison Ops?
- (same arrangement as C.MV, with same flexibility/power)
-* 8/16-bit ops is it worthwhile adding a "start offset"?
- (a bit like misaligned addressing... for registers)
- or just use predication to skip start?
-
-## Future (extra) bank be included (made mandatory)
-
-The implications of expanding the *standard* register file from
-32 entries per bank to 64 per bank is quite an extensive architectural
-change. Also it has implications for context-switching.
-
-Therefore, on balance, it is not recommended and certainly should
-not be made a *mandatory* requirement for the use of SV. SV's design
-ethos is to be minimally-disruptive for implementors to shoe-horn
-into an existing design.
-
-## How large should the Register and Predication CSR key-value stores be?
-
-This is something that definitely needs actual evaluation and for
-code to be run and the results analysed. At the time of writing
-(12jul2018) that is too early to tell. An approximate best-guess
-however would be 16 entries.
-
-RV32E however is a special case, given that it is highly unlikely
-(but not outside the realm of possibility) that it would be used
-for performance reasons but instead for reducing instruction count.
-The number of CSR entries therefore has to be considered extremely
-carefully.
-
-## How many CSR entries in M-Mode or S-Mode (for context-switching)?
-
-The minimum required CSR entries would be 1 for each register-bank:
-one for integer and one for floating-point. However, as shown
-in the "Context Switch Example" section, for optimal efficiency
-(minimal instructions in a low-latency situation) the CSRs for
-the context-switch should be set up *and left alone*.
-
-This means that it is not really a good idea to touch the CSRs
-used for context-switching in the M-Mode (or S-Mode) trap, so
-if there is ever demonstrated a need for vectors then there would
-need to be *at least* one more free. However just one does not make
-much sense (as it one only covers scalar-vector ops) so it is more
-likely that at least two extra would be needed.
-
-This *in addition* - in the RV32E case - if an RV32E implementation
-happens also to support U/S/M modes. This would be considered quite
-rare but not outside of the realm of possibility.
-
-Conclusion: all needs careful analysis and future work.
-
-## Should use of registers be allowed to "wrap" (x30 x31 x1 x2)?
-
-On balance it's a neat idea however it does seem to be one where the
-benefits are not really clear. It would however obviate the need for
-an exception to be raised if the VL runs out of registers to put
-things in (gets to x31, tries a non-existent x32 and fails), however
-the "fly in the ointment" is that x0 is hard-coded to "zero". The
-increment therefore would need to be double-stepped to skip over x0.
-Some microarchitectures could run into difficulties (SIMD-like ones
-in particular) so it needs a lot more thought.
-
-## Can CLIP be done as a CSR (mode, like elwidth)
-
-RVV appears to be going this way. At the time of writing (12jun2018)
-it's noted that in V2.3-Draft V0.4 RVV Chapter, RVV intends to do
-clip by way of exactly this method: setting a "clip mode" in a CSR.
-
-No details are given however the most sensible thing to have would be
-to extend the 16-bit Register CSR table to 24-bit (or 32-bit) and have
-extra bits specifying the type of clipping to be carried out, on
-a per-register basis. Other bits may be used for other purposes
-(see SIMD saturation below)
-
-## SIMD saturation (etc.) also set as a mode?