+# Note on implementation of parallelism
+
+One extremely important aspect of this proposal is to respect and support
+implementors desire to focus on power, area or performance. In that regard,
+it is proposed that implementors be free to choose whether to implement
+the Vector (or variable-width SIMD) parallelism as sequential operations
+with a single ALU, fully parallel (if practical) with multiple ALUs, or
+a hybrid combination of both.
+
+In Broadcom's Videocore-IV, they chose hybrid, and called it "Virtual
+Parallelism". They achieve a 16-way SIMD at an **instruction** level
+by providing a combination of a 4-way parallel ALU *and* an externally
+transparent loop that feeds 4 sequential sets of data into each of the
+4 ALUs.
+
+Also in the same core, it is worth noting that particularly uncommon
+but essential operations (Reciprocal-Square-Root for example) are
+*not* part of the 4-way parallel ALU but instead have a *single* ALU.
+Under the proposed Vector (varible-width SIMD) implementors would
+be free to do precisely that: i.e. free to choose *on a per operation
+basis* whether and how much "Virtual Parallelism" to deploy.
+
+It is absolutely critical to note that it is proposed that such choices MUST
+be **entirely transparent** to the end-user and the compiler. Whilst
+a Vector (varible-width SIMD) may not precisely match the width of the
+parallelism within the implementation, the end-user **should not care**
+and in this way the performance benefits are gained but the ISA remains
+straightforward. All that happens at the end of an instruction run is: some
+parallel units (if there are any) would remain offline, completely
+transparently to the ISA, the program, and the compiler.
+
+To make that clear: should an implementor choose a particularly wide
+SIMD-style ALU, each parallel unit *must* have predication so that
+the parallel SIMD ALU may emulate variable-length parallel operations.
+Thus the "SIMD considered harmful" trap of having huge complexity and extra
+instructions to deal with corner-cases is thus avoided, and implementors
+get to choose precisely where to focus and target the benefits of their
+implementation efforts, without "extra baggage".
+
+In addition, implementors will be free to choose whether to provide an
+absolute bare minimum level of compliance with the "API" (software-traps
+when vectorisation is detected), all the way up to full supercomputing
+level all-hardware parallelism. Options are covered in the Appendix.
+
+# CSRs <a name="csrs"></a>
+
+There are two CSR tables needed to create lookup tables which are used at
+the register decode phase.
+
+* Integer Register N is Vector
+* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1)
+* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64)
+* Integer Register N is a Predication Register (note: a key-value store)
+
+Also (see Appendix, "Context Switch Example") it may turn out to be important
+to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that
+Vectorised LOAD / STORE may be used to load and store multiple registers:
+something that is missing from the Base RV ISA.
+
+Notes:
+
+* for the purposes of LOAD / STORE, Integer Registers which are
+ marked as a Vector will result in a Vector LOAD / STORE.
+* Vector Lengths are *not* the same as vsetl but are an integral part
+ of vsetl.
+* Actual vector length is *multipled* by how many blocks of length
+ "bitwidth" may fit into an XLEN-sized register file.
+* Predication is a key-value store due to the implicit referencing,
+ as opposed to having the predicate register explicitly in the instruction.
+* Whilst the predication CSR is a key-value store it *generates* easier-to-use
+ state information.
+* TODO: assess whether the same technique could be applied to the other
+ Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4,
+ V2.3-Draft ISA Reference) it becomes possible to greatly reduce state
+ needed for context-switches (empty slots need never be stored).
+
+## Predication CSR <a name="predication_csr_table"></a>
+
+The Predication CSR is a key-value store indicating whether, if a given
+destination register (integer or floating-point) is referred to in an
+instruction, it is to be predicated. However it is important to note
+that the *actual* register is *different* from the one that ends up
+being used, due to the level of indirection through the lookup table.
+This includes (in the future) redirecting to a *second* bank of
+integer registers (as a future option)
+
+* regidx is the actual register that in combination with the
+ i/f flag, if that integer or floating-point register is referred to,
+ results in the lookup table being referenced to find the predication
+ mask to use on the operation in which that (regidx) register has
+ been used
+* predidx (in combination with the bank bit in the future) is the
+ *actual* register to be used for the predication mask. Note:
+ in effect predidx is actually a 6-bit register address, as the bank
+ bit is the MSB (and is nominally set to zero for now).
+* inv indicates that the predication mask bits are to be inverted
+ prior to use *without* actually modifying the contents of the
+ register itself.
+* zeroing is either 1 or 0, and if set to 1, the operation must
+ place zeros in any element position where the predication mask is
+ set to zero. If zeroing is set to 1, unpredicated elements *must*
+ be left alone. Some microarchitectures may choose to interpret
+ this as skipping the operation entirely. Others which wish to
+ stick more closely to a SIMD architecture may choose instead to
+ interpret unpredicated elements as an internal "copy element"
+ operation (which would be necessary in SIMD microarchitectures
+ that perform register-renaming)
+
+| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) |
+| ----- | - | - | - | - | ------- | ------- |
+| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx |
+| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx |
+| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx |
+| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx |
+
+The Predication CSR Table is a key-value store, so implementation-wise
+it will be faster to turn the table around (maintain topologically
+equivalent state):
+
+ struct pred {
+ bool zero;
+ bool inv;
+ bool bank; // 0 for now, 1=rsvd
+ bool enabled;
+ int predidx; // redirection: actual int register to use
+ }
+
+ struct pred fp_pred_reg[32]; // 64 in future (bank=1)
+ struct pred int_pred_reg[32]; // 64 in future (bank=1)
+
+ for (i = 0; i < 16; i++)
+ tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg;
+ idx = CSRpred[i].regidx
+ tb[idx].zero = CSRpred[i].zero
+ tb[idx].inv = CSRpred[i].inv
+ tb[idx].bank = CSRpred[i].bank
+ tb[idx].predidx = CSRpred[i].predidx
+ tb[idx].enabled = true
+
+So when an operation is to be predicated, it is the internal state that
+is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following
+pseudo-code for operations is given, where p is the explicit (direct)
+reference to the predication register to be used:
+
+ for (int i=0; i<vl; ++i)
+ if ([!]preg[p][i])
+ (d ? vreg[rd][i] : sreg[rd]) =
+ iop(s1 ? vreg[rs1][i] : sreg[rs1],
+ s2 ? vreg[rs2][i] : sreg[rs2]); // for insts with 2 inputs
+
+This instead becomes an *indirect* reference using the *internal* state
+table generated from the Predication CSR key-value store, which iwws used
+as follows.
+
+ if type(iop) == INT:
+ preg = int_pred_reg[rd]
+ else:
+ preg = fp_pred_reg[rd]
+
+ for (int i=0; i<vl; ++i)
+ predidx = preg[rd].predidx; // the indirection takes place HERE
+ if (!preg[rd].enabled)
+ predicate = ~0x0; // all parallel ops enabled
+ else:
+ predicate = intregfile[predidx]; // get actual reg contents HERE
+ if (preg[rd].inv) // invert if requested
+ predicate = ~predicate;
+ if (predicate && (1<<i))
+ (d ? regfile[rd+i] : regfile[rd]) =
+ iop(s1 ? regfile[rs1+i] : regfile[rs1],
+ s2 ? regfile[rs2+i] : regfile[rs2]); // for insts with 2 inputs
+ else if (preg[rd].zero)
+ // TODO: place zero in dest reg
+
+Note:
+
+* d, s1 and s2 are booleans indicating whether destination,
+ source1 and source2 are vector or scalar
+* key-value CSR-redirection of rd, rs1 and rs2 have NOT been included
+ above, for clarity. rd, rs1 and rs2 all also must ALSO go through
+ register-level redirection (from the Register CSR table) if they are
+ vectors.
+
+If written as a function, obtaining the predication mask (but not whether
+zeroing takes place) may be done as follows:
+
+ def get_pred_val(bool is_fp_op, int reg):
+ tb = int_pred if is_fp_op else fp_pred
+ if (!tb[reg].enabled):
+ return ~0x0 // all ops enabled
+ predidx = tb[reg].predidx // redirection occurs HERE
+ predicate = intreg[predidx] // actual predicate HERE
+ if (tb[reg].inv):
+ predicate = ~predicate // invert ALL bits
+ return predicate
+
+## MAXVECTORLENGTH
+
+MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V,
+given that its primary (base, unextended) purpose is for 3D, Video and
+other purposes (not requiring supercomputing capability), it makes sense
+to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64
+and so on).
+
+The reason for setting this limit is so that predication registers, when
+marked as such, may fit into a single register as opposed to fanning out
+over several registers. This keeps the implementation a little simpler.
+Note also (as also described in the VSETVL section) that the *minimum*
+for MAXVECTORDEPTH must be the total number of registers (15 for RV32E
+and 31 for RV32 or RV64).
+
+Note that RVV on top of Simple-V may choose to over-ride this decision.
+
+## Register CSR key-value (CAM) table
+
+The purpose of the Register CSR table is four-fold:
+
+* To mark integer and floating-point registers as requiring "redirection"
+ if it is ever used as a source or destination in any given operation.
+ This involves a level of indirection through a 5-to-6-bit lookup table
+ (where the 6th bit - bank - is always set to 0 for now).
+* To indicate whether, after redirection through the lookup table, the
+ register is a vector (or remains a scalar).
+* To over-ride the implicit or explicit bitwidth that the operation would
+ normally give the register.
+* To indicate if the register is to be interpreted as "packed" (SIMD)
+ i.e. containing multiple contiguous elements of size equal to "bitwidth".
+
+| RgCSR | 15 | 14 | 13 | (12..11) | 10 | (9..5) | (4..0) |
+| ----- | - | - | - | - | - | ------- | ------- |
+| 0 | simd0 | bank0 | isvec0 | vew0 | i/f | regidx | predidx |
+| 1 | simd1 | bank1 | isvec1 | vew1 | i/f | regidx | predidx |
+| .. | simd.. | bank.. | isvec.. | vew.. | i/f | regidx | predidx |
+| 15 | simd15 | bank15 | isvec15 | vew15 | i/f | regidx | predidx |
+
+vew may be one of the following (giving a table "bytestable", used below):
+
+| vew | bitwidth |
+| --- | --------- |
+| 00 | default |
+| 01 | default/2 |
+| 10 | 8 |
+| 11 | 16 |
+
+Extending this table (with extra bits) is covered in the section
+"Implementing RVV on top of Simple-V".
+
+As the above table is a CAM (key-value store) it may be appropriate
+to expand it as follows:
+
+ struct vectorised fp_vec[32], int_vec[32]; // 64 in future
+
+ for (i = 0; i < 16; i++) // 16 CSRs?
+ tb = int_vec if CSRvec[i].type == 0 else fp_vec
+ idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode
+ tb[idx].elwidth = CSRvec[i].elwidth
+ tb[idx].regidx = CSRvec[i].regidx // indirection
+ tb[idx].isvector = CSRvec[i].isvector // 0=scalar
+ tb[idx].packed = CSRvec[i].packed // SIMD or not
+ tb[idx].bank = CSRvec[i].bank // 0 (1=rsvd)
+
+TODO: move elsewhere
+
+ # TODO: use elsewhere (retire for now)
+ vew = CSRbitwidth[rs1]
+ if (vew == 0)
+ bytesperreg = (XLEN/8) # or FLEN as appropriate
+ elif (vew == 1)
+ bytesperreg = (XLEN/4) # or FLEN/2 as appropriate
+ else:
+ bytesperreg = bytestable[vew] # 8 or 16
+ simdmult = (XLEN/8) / bytesperreg # or FLEN as appropriate
+ vlen = CSRvectorlen[rs1] * simdmult
+ CSRvlength = MIN(MIN(vlen, MAXVECTORDEPTH), rs2)
+
+The reason for multiplying the vector length by the number of SIMD elements
+(in each individual register) is so that each SIMD element may optionally be
+predicated.
+
+An example of how to subdivide the register file when bitwidth != default
+is given in the section "Bitwidth Virtual Register Reordering".
+
+# Instructions
+
+Despite being a 98% complete and accurate topological remap of RVV
+concepts and functionality, the only instructions needed are VSETVL
+and VGETVL. *All* RVV instructions can be re-mapped, however xBitManip
+becomes a critical dependency for efficient manipulation of predication
+masks (as a bit-field). Despite the removal of all but VSETVL and VGETVL,
+*all instructions from RVV are topologically re-mapped and retain their
+complete functionality, intact*.
+
+Three instructions, VSELECT, VCLIP and VCLIPI, do not have RV Standard
+equivalents, so are left out of Simple-V. VSELECT could be included if
+there existed a MV.X instruction in RV (MV.X is a hypothetical
+non-immediate variant of MV that would allow another register to
+specify which register was to be copied). Note that if any of these three
+instructions are added to any given RV extension, their functionality
+will be inherently parallelised.
+
+## Instruction Format