- for (int i=0; i<vl; ++i)
- if ([!]preg[p][i])
- preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
- s1 = reg_is_vectorised(src1);
- s2 = reg_is_vectorised(src2);
-
- if not s1 && not s2
- if cmp(rs1, rs2) # scalar compare
- goto branch
- return
-
- preg = int_pred_reg[rd]
- reg = int_regfile
-
- ps = get_pred_val(I/F==INT, rs1);
- rd = get_pred_val(I/F==INT, rs2); # this may not exist
-
- if not exists(rd) or zeroing:
- result = 0
- else
- result = preg[rd]
-
- for (int i = 0; i < VL; ++i)
- if (zeroing)
- if not (ps & (1<<i))
- result &= ~(1<<i);
- else if (ps & (1<<i))
- if (cmp(s1 ? reg[src1+i]:reg[src1],
- s2 ? reg[src2+i]:reg[src2])
- result |= 1<<i;
- else
- result &= ~(1<<i);
-
- if not exists(rd)
- if result == ps
- goto branch
- else
- preg[rd] = result # store in destination
- if preg[rd] == ps
- goto branch
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
- into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
- Reordering") setting Vector-Length times (number of SIMD elements) bits
- in Predicate Register rd, as opposed to just Vector-Length bits.
-* The execution of "parallelised" instructions **must** be implemented
- as "re-entrant" (to use a term from software). If an exception (trap)
- occurs during the middle of a vectorised
- Branch (now a SV predicated compare) operation, the partial results
- of any comparisons must be written out to the destination
- register before the trap is permitted to begin. If however there
- is no predicate, the **entire** set of comparisons must be **restarted**,
- with the offset loop indices set back to zero. This is because
- there is no place to store the temporary result during the handling
- of traps.
-
-TODO: predication now taken from src2. also branch goes ahead
-if all compares are successful.
-
-Note also that where normally, predication requires that there must
-also be a CSR register entry for the register being used in order
-for the **predication** CSR register entry to also be active,
-for branches this is **not** the case. src2 does **not** have
-to have its CSR register entry marked as active in order for
-predication on src2 to be active.
-
-Also note: SV Branch operations are **not** twin-predicated
-(see Twin Predication section). This would require three
-element offsets: one to track src1, one to track src2 and a third
-to track where to store the accumulation of the results. Given
-that the element offsets need to be exposed via CSRs so that
-the parallel hardware looping may be made re-entrant on traps
-and exceptions, the decision was made not to make SV Branches
-twin-predicated.
-
-### Floating-point Comparisons
-
-There does not exist floating-point branch operations, only compare.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-
-In RV (scalar) Base, a branch on a floating-point compare is
-done via the sequence "FEQ x1, f0, f5; BEQ x1, x0, #jumploc".
-This does extend to SV, as long as x1 (in the example sequence given)
-is vectorised. When that is the case, x1..x(1+VL-1) will also be
-set to 0 or 1 depending on whether f0==f5, f1==f6, f2==f7 and so on.
-The BEQ that follows will *also* compare x1==x0, x2==x0, x3==x0 and
-so on. Consequently, unlike integer-branch, FP Compare needs no
-modification in its behaviour.
-
-In addition, it is noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact. To deal with this, SV's predication has
-had "invert" added to it.
-
-Also: note that FP Compare may be predicated, using the destination
-integer register (rd) to determine the predicate. FP Compare is **not**
-a twin-predication operation, as, again, just as with SV Branches,
-there are three registers involved: FP src1, FP src2 and INT rd.
-
-### Compressed Branch Instruction
-
-Compressed Branch instructions are, just like standard Branch instructions,
-reinterpreted to be vectorised and predicated based on the source register
-(rs1s) CSR entries. As however there is only the one source register,
-given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
-to store the results of the comparisions is taken from CSR predication
-table entries for **x0**.
-
-The specific required use of x0 is, with a little thought, quite obvious,
-but is counterintuitive. Clearly it is **not** recommended to redirect
-x0 with a CSR register entry, however as a means to opaquely obtain
-a predication target it is the only sensible option that does not involve
-additional special CSRs (or, worse, additional special opcodes).
-
-Note also that, just as with standard branches, the 2nd source
-(in this case x0 rather than src2) does **not** have to have its CSR
-register table marked as "active" in order for predication to work.
-
-## Vectorised Dual-operand instructions
-
-There is a series of 2-operand instructions involving copying (and
-sometimes alteration):
-
-* C.MV
-* FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
-* C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
-* LOAD(-FP) and STORE(-FP)
-
-All of these operations follow the same two-operand pattern, so it is
-*both* the source *and* destination predication masks that are taken into
-account. This is different from
-the three-operand arithmetic instructions, where the predication mask
-is taken from the *destination* register, and applied uniformly to the
-elements of the source register(s), element-for-element.
-
-The pseudo-code pattern for twin-predicated operations is as
-follows:
-
- function op(rd, rs):
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++; else break
-
-This pattern covers scalar-scalar, scalar-vector, vector-scalar
-and vector-vector, and predicated variants of all of those.
-Zeroing is not presently included (TODO). As such, when compared
-to RVV, the twin-predicated variants of C.MV and FMV cover
-**all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
-VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
-
-Note that:
-
-* elwidth (SIMD) is not covered in the pseudo-code above
-* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
- not covered
-* zero predication is also not shown (TODO).
-
-### C.MV Instruction <a name="c_mv"></a>
-
-There is no MV instruction in RV however there is a C.MV instruction.
-It is used for copying integer-to-integer registers (vectorised FMV
-is used for copying floating-point).
-
-If either the source or the destination register are marked as vectors
-C.MV is reinterpreted to be a vectorised (multi-register) predicated
-move operation. The actual instruction's format does not change:
-
-[[!table data="""
-15 12 | 11 7 | 6 2 | 1 0 |
-funct4 | rd | rs | op |
-4 | 5 | 5 | 2 |
-C.MV | dest | src | C0 |
-"""]]
-
-A simplified version of the pseudocode for this operation is as follows:
-
- function op_mv(rd, rs) # MV not VMV!
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- ireg[rd+j] <= ireg[rs+i];
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++; else break
-
-There are several different instructions from RVV that are covered by
-this one opcode:
-
-[[!table data="""
-src | dest | predication | op |
-scalar | vector | none | VSPLAT |
-scalar | vector | destination | sparse VSPLAT |
-scalar | vector | 1-bit dest | VINSERT |
-vector | scalar | 1-bit? src | VEXTRACT |
-vector | vector | none | VCOPY |
-vector | vector | src | Vector Gather |
-vector | vector | dest | Vector Scatter |
-vector | vector | src & dest | Gather/Scatter |
-vector | vector | src == dest | sparse VCOPY |
-"""]]
-
-Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
-operations with inversion on the src and dest predication for one of the
-two C.MV operations.
-
-Note that in the instance where the Compressed Extension is not implemented,
-MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
-Note that the behaviour is **different** from C.MV because with addi the
-predication mask to use is taken **only** from rd and is applied against
-all elements: rs[i] = rd[i].
-
-### FMV, FNEG and FABS Instructions
-
-These are identical in form to C.MV, except covering floating-point
-register copying. The same double-predication rules also apply.
-However when elwidth is not set to default the instruction is implicitly
-and automatic converted to a (vectorised) floating-point type conversion
-operation of the appropriate size covering the source and destination
-register bitwidths.
-
-(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
-
-### FVCT Instructions
-
-These are again identical in form to C.MV, except that they cover
-floating-point to integer and integer to floating-point. When element
-width in each vector is set to default, the instructions behave exactly
-as they are defined for standard RV (scalar) operations, except vectorised
-in exactly the same fashion as outlined in C.MV.
-
-However when the source or destination element width is not set to default,
-the opcode's explicit element widths are *over-ridden* to new definitions,
-and the opcode's element width is taken as indicative of the SIMD width
-(if applicable i.e. if packed SIMD is requested) instead.
-
-For example FCVT.S.L would normally be used to convert a 64-bit
-integer in register rs1 to a 64-bit floating-point number in rd.
-If however the source rs1 is set to be a vector, where elwidth is set to
-default/2 and "packed SIMD" is enabled, then the first 32 bits of
-rs1 are converted to a floating-point number to be stored in rd's
-first element and the higher 32-bits *also* converted to floating-point
-and stored in the second. The 32 bit size comes from the fact that
-FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
-divide that by two it means that rs1 element width is to be taken as 32.
-
-Similar rules apply to the destination register.
-
-## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
-
-An earlier draft of SV modified the behaviour of LOAD/STORE (modified
-the interpretation of the instruction fields). This
-actually undermined the fundamental principle of SV, namely that there
-be no modifications to the scalar behaviour (except where absolutely
-necessary), in order to simplify an implementor's task if considering
-converting a pre-existing scalar design to support parallelism.
-
-So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
-do not change in SV, however just as with C.MV it is important to note
-that dual-predication is possible.
-
-In vectorised architectures there are usually at least two different modes
-for LOAD/STORE:
-
-* Read (or write for STORE) from sequential locations, where one
- register specifies the address, and the one address is incremented
- by a fixed amount. This is usually known as "Unit Stride" mode.
-* Read (or write) from multiple indirected addresses, where the
- vector elements each specify separate and distinct addresses.
-
-To support these different addressing modes, the CSR Register "isvector"
-bit is used. So, for a LOAD, when the src register is set to
-scalar, the LOADs are sequentially incremented by the src register
-element width, and when the src register is set to "vector", the
-elements are treated as indirection addresses. Simplified
-pseudo-code would look like this:
-
- function op_ld(rd, rs) # LD not VLD!
- rdv = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rsv = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- if (int_csr[rd].isvec)
- # indirect mode (multi mode)
- srcbase = ireg[rsv+i];
- else
- # unit stride mode
- srcbase = ireg[rsv] + i * XLEN/8; # offset in bytes
- ireg[rdv+j] <= mem[srcbase + imm_offs];
- if (!int_csr[rs].isvec &&
- !int_csr[rd].isvec) break # scalar-scalar LD
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++;
-
-Notes:
-
-* For simplicity, zeroing and elwidth is not included in the above:
- the key focus here is the decision-making for srcbase; vectorised
- rs means use sequentially-numbered registers as the indirection
- address, and scalar rs is "offset" mode.
-* The test towards the end for whether both source and destination are
- scalar is what makes the above pseudo-code provide the "standard" RV
- Base behaviour for LD operations.
-* The offset in bytes (XLEN/8) changes depending on whether the
- operation is a LB (1 byte), LH (2 byes), LW (4 bytes) or LD
- (8 bytes), and also whether the element width is over-ridden
- (see special element width section).
-
-## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
-
-C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
-where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register.
-It is therefore possible to use predicated C.LWSP to efficiently
-pop registers off the stack (by predicating x2 as the source), cherry-picking
-which registers to store to (by predicating the destination). Likewise
-for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
-
-The two modes ("unit stride" and multi-indirection) are still supported,
-as with standard LD/ST. Essentially, the only difference is that the
-use of x2 is hard-coded into the instruction.
-
-**Note**: it is still possible to redirect x2 to an alternative target
-register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
-general-purpose LOAD/STORE operations.
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
-where the same rules apply and the same pseudo-code apply as for
-non-compressed LOAD/STORE. Again: setting scalar or vector mode
-on the src for LOAD and dest for STORE switches mode from "Unit Stride"
-to "Multi-indirection", respectively.
-
-# Element bitwidth polymorphism <a name="elwidth"></a>
-
-Element bitwidth is best covered as its own special section, as it
-is quite involved and applies uniformly across-the-board. SV restricts
-bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit.
-
-The effect of setting an element bitwidth is to re-cast each entry
-in the register table, and for all memory operations involving
-load/stores of certain specific sizes, to a completely different width.
-Thus In c-style terms, on an RV64 architecture, effectively each register
-now looks like this:
-
- typedef union {
- uint8_t b[8];
- uint16_t s[4];
- uint32_t i[2];
- uint64_t l[1];
- } reg_t;
-
- // integer table: assume maximum SV 7-bit regfile size
- reg_t int_regfile[128];
-
-where the CSR Register table entry (not the instruction alone) determines
-which of those union entries is to be used on each operation, and the
-VL element offset in the hardware-loop specifies the index into each array.
-
-However a naive interpretation of the data structure above masks the
-fact that setting VL greater than 8, for example, when the bitwidth is 8,
-accessing one specific register "spills over" to the following parts of
-the register file in a sequential fashion. So a much more accurate way
-to reflect this would be:
-
- typedef union {
- uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
- uint8_t b[0]; // array of type uint8_t
- uint16_t s[0];
- uint32_t i[0];
- uint64_t l[0];
- uint128_t d[0];
- } reg_t;
-
- reg_t int_regfile[128];
-
-where when accessing any individual regfile[n].b entry it is permitted
-(in c) to arbitrarily over-run the *declared* length of the array (zero),
-and thus "overspill" to consecutive register file entries in a fashion
-that is completely transparent to a greatly-simplified software / pseudo-code
-representation.
-It is however critical to note that it is clearly the responsibility of
-the implementor to ensure that, towards the end of the register file,
-an exception is thrown if attempts to access beyond the "real" register
-bytes is ever attempted.
-
-Now we may modify pseudo-code an operation where all element bitwidths have
-been set to the same size, where this pseudo-code is otherwise identical
-to its "non" polymorphic versions (above):
-
- function op_add(rd, rs1, rs2) # add not VADD!
- ...
- ...
- for (i = 0; i < VL; i++)
- ...
- ...
- // TODO, calculate if over-run occurs, for each elwidth
- if (elwidth == 8) {
- int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
- int_regfile[rs2].i[irs2];
- } else if elwidth == 16 {
- int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
- int_regfile[rs2].s[irs2];
- } else if elwidth == 32 {
- int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
- int_regfile[rs2].i[irs2];
- } else { // elwidth == 64
- int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
- int_regfile[rs2].l[irs2];
- }
- ...
- ...
-
-So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
-following sequentially on respectively from the same) are "type-cast"
-to 8-bit; for 16-bit entries likewise and so on.
-
-However that only covers the case where the element widths are the same.
-Where the element widths are different, the following algorithm applies:
-
-* Analyse the bitwidth of all source operands and work out the
- maximum. Record this as "maxsrcbitwidth"
-* If any given source operand requires sign-extension or zero-extension
- (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
- sign-extension / zero-extension or whatever is specified in the standard
- RV specification, **change** that to sign-extending from the respective
- individual source operand's bitwidth from the CSR table out to
- "maxsrcbitwidth" (previously calculated), instead.
-* Following separate and distinct (optional) sign/zero-extension of all
- source operands as specifically required for that operation, carry out the
- operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
- this may be a "null" (copy) operation, and that with FCVT, the changes
- to the source and destination bitwidths may also turn FVCT effectively
- into a copy).
-* If the destination operand requires sign-extension or zero-extension,
- instead of a mandatory fixed size (typically 32-bit for arithmetic,
- for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
- etc.), overload the RV specification with the bitwidth from the
- destination register's elwidth entry.
-* Finally, store the (optionally) sign/zero-extended value into its
- destination: memory for sb/sw etc., or an offset section of the register
- file for an arithmetic operation.
-
-In this way, polymorphic bitwidths are achieved without requiring a
-massive 64-way permutation of calculations **per opcode**, for example
-(4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
-rd bitwidths). The pseudo-code is therefore as follows:
-
- typedef union {
- uint8_t b;
- uint16_t s;
- uint32_t i;
- uint64_t l;
- } el_reg_t;
-
- bw(elwidth):
- if elwidth == 0:
- return xlen
- if elwidth == 1:
- return xlen / 2
- if elwidth == 2:
- return xlen * 2
- // elwidth == 3:
- return 8
-
- get_max_elwidth(rs1, rs2):
- return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
- bw(int_csr[rs2].elwidth)) # again XLEN if no entry
-
- get_polymorphed_reg(reg, bitwidth, offset):
- el_reg_t res;
- res.l = 0; // TODO: going to need sign-extending / zero-extending
- if bitwidth == 8:
- reg.b = int_regfile[reg].b[offset]
- elif bitwidth == 16:
- reg.s = int_regfile[reg].s[offset]
- elif bitwidth == 32:
- reg.i = int_regfile[reg].i[offset]
- elif bitwidth == 64:
- reg.l = int_regfile[reg].l[offset]
- return res
-
- set_polymorphed_reg(reg, bitwidth, offset, val):
- if (!int_csr[reg].isvec):
- # sign/zero-extend depending on opcode requirements, from
- # the reg's bitwidth out to the full bitwidth of the regfile
- val = sign_or_zero_extend(val, bitwidth, xlen)
- int_regfile[reg].l[0] = val
- elif bitwidth == 8:
- int_regfile[reg].b[offset] = val
- elif bitwidth == 16:
- int_regfile[reg].s[offset] = val
- elif bitwidth == 32:
- int_regfile[reg].i[offset] = val
- elif bitwidth == 64:
- int_regfile[reg].l[offset] = val
-
- maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
- destwid = int_csr[rs1].elwidth # destination element width
- for (i = 0; i < VL; i++)
- if (predval & 1<<i) # predication uses intregs
- // TODO, calculate if over-run occurs, for each elwidth
- src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
- // TODO, sign/zero-extend src1 and src2 as operation requires
- if (op_requires_sign_extend_src1)
- src1 = sign_extend(src1, maxsrcwid)
- src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
- result = src1 + src2 # actual add here
- // TODO, sign/zero-extend result, as operation requires
- if (op_requires_sign_extend_dest)
- result = sign_extend(result, maxsrcwid)
- set_polymorphed_reg(rd, destwid, ird, result)
- if (!int_vec[rd].isvector) break
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
-
-Whilst specific sign-extension and zero-extension pseudocode call
-details are left out, due to each operation being different, the above
-should be clear that;
-
-* the source operands are extended out to the maximum bitwidth of all
- source operands
-* the operation takes place at that maximum source bitwidth (the
- destination bitwidth is not involved at this point, at all)
-* the result is extended (or potentially even, truncated) before being
- stored in the destination. i.e. truncation (if required) to the
- destination width occurs **after** the operation **not** before.
-* when the destination is not marked as "vectorised", the **full**
- (standard, scalar) register file entry is taken up, i.e. the
- element is either sign-extended or zero-extended to cover the
- full register bitwidth (XLEN) if it is not already XLEN bits long.
-
-Implementors are entirely free to optimise the above, particularly
-if it is specifically known that any given operation will complete
-accurately in less bits, as long as the results produced are
-directly equivalent and equal, for all inputs and all outputs,
-to those produced by the above algorithm.
-
-## Polymorphic floating-point operation exceptions and error-handling
-
-For floating-point operations, conversion takes place without
-raising any kind of exception. Exactly as specified in the standard
-RV specification, NAN (or appropriate) is stored if the result
-is beyond the range of the destination, and, again, exactly as
-with the standard RV specification just as with scalar
-operations, the floating-point flag is raised (FCSR). And, again, just as
-with scalar operations, it is software's responsibility to check this flag.
-Given that the FCSR flags are "accrued", the fact that multiple element
-operations could have occurred is not a problem.
-
-Note that it is perfectly legitimate for floating-point bitwidths of
-only 8 to be specified. However whilst it is possible to apply IEEE 754
-principles, no actual standard yet exists. Implementors wishing to
-provide hardware-level 8-bit support rather than throw a trap to emulate
-in software should contact the author of this specification before
-proceeding.
-
-## Polymorphic shift operators
-
-A special note is needed for changing the element width of left and right
-shift operators, particularly right-shift. Even for standard RV base,
-in order for correct results to be returned, the second operand RS2 must
-be truncated to be within the range of RS1's bitwidth. spike's implementation
-of sll for example is as follows:
-
- WRITE_RD(sext_xlen(zext_xlen(RS1) << (RS2 & (xlen-1))));
-
-which means: where XLEN is 32 (for RV32), restrict RS2 to cover the
-range 0..31 so that RS1 will only be left-shifted by the amount that
-is possible to fit into a 32-bit register. Whilst this appears not
-to matter for hardware, it matters greatly in software implementations,
-and it also matters where an RV64 system is set to "RV32" mode, such
-that the underlying registers RS1 and RS2 comprise 64 hardware bits
-each.
-
-For SV, where each operand's element bitwidth may be over-ridden, the
-rule about determining the operation's bitwidth *still applies*, being
-defined as the maximum bitwidth of RS1 and RS2. *However*, this rule
-**also applies to the truncation of RS2**. In other words, *after*
-determining the maximum bitwidth, RS2's range must **also be truncated**
-to ensure a correct answer. Example:
-
-* RS1 is over-ridden to a 16-bit width
-* RS2 is over-ridden to an 8-bit width
-* RD is over-ridden to a 64-bit width
-* the maximum bitwidth is thus determined to be 16-bit - max(8,16)
-* RS2 is **truncated to a range of values from 0 to 15**: RS2 & (16-1)
-
-Pseudocode (in spike) for this example would therefore be:
-
- WRITE_RD(sext_xlen(zext_16bit(RS1) << (RS2 & (16-1))));
-
-This example illustrates that considerable care therefore needs to be
-taken to ensure that left and right shift operations are implemented
-correctly. The key is that
-
-* The operation bitwidth is determined by the maximum bitwidth
- of the *source registers*, **not** the destination register bitwidth
-* The result is then sign-extend (or truncated) as appropriate.
-
-## Polymorphic MULH/MULHU/MULHSU
-
-MULH is designed to take the top half MSBs of a multiply that
-does not fit within the range of the source operands, such that
-smaller width operations may produce a full double-width multiply
-in two cycles. The issue is: SV allows the source operands to
-have variable bitwidth.
-
-Here again special attention has to be paid to the rules regarding
-bitwidth, which, again, are that the operation is performed at
-the maximum bitwidth of the **source** registers. Therefore:
-
-* An 8-bit x 8-bit multiply will create a 16-bit result that must
- be shifted down by 8 bits
-* A 16-bit x 8-bit multiply will create a 24-bit result that must
- be shifted down by 16 bits (top 8 bits being zero)
-* A 16-bit x 16-bit multiply will create a 32-bit result that must
- be shifted down by 16 bits
-* A 32-bit x 16-bit multiply will create a 48-bit result that must
- be shifted down by 32 bits
-* A 32-bit x 8-bit multiply will create a 40-bit result that must
- be shifted down by 32 bits
-
-So again, just as with shift-left and shift-right, the result
-is shifted down by the maximum of the two source register bitwidths.
-And, exactly again, truncation or sign-extension is performed on the
-result. If sign-extension is to be carried out, it is performed
-from the same maximum of the two source register bitwidths out
-to the result element's bitwidth.
-
-If truncation occurs, i.e. the top MSBs of the result are lost,
-this is "Officially Not Our Problem", i.e. it is assumed that the
-programmer actually desires the result to be truncated. i.e. if the
-programmer wanted all of the bits, they would have set the destination
-elwidth to accommodate them.
-
-## Polymorphic elwidth on LOAD/STORE <a name="elwidth_loadstore"></a>
-
-Polymorphic element widths in vectorised form means that the data
-being loaded (or stored) across multiple registers needs to be treated
-(reinterpreted) as a contiguous stream of elwidth-wide items, where
-the source register's element width is **independent** from the destination's.
-
-This makes for a slightly more complex algorithm when using indirection
-on the "addressed" register (source for LOAD and destination for STORE),
-particularly given that the LOAD/STORE instruction provides important
-information about the width of the data to be reinterpreted.
-
-Let's illustrate the "load" part, where the pseudo-code for elwidth=default
-was as follows, and i is the loop from 0 to VL-1:
-
- srcbase = ireg[rs+i];
- return mem[srcbase + imm]; // returns XLEN bits
-
-Instead, when elwidth != default, for a LW (32-bit LOAD), elwidth-wide
-chunks are taken from the source memory location addressed by the current
-indexed source address register, and only when a full 32-bits-worth
-are taken will the index be moved on to the next contiguous source
-address register:
-
- bitwidth = bw(elwidth); // source elwidth from CSR reg entry
- elsperblock = 32 / bitwidth // 1 if bw=32, 2 if bw=16, 4 if bw=8
- srcbase = ireg[rs+i/(elsperblock)]; // integer divide
- offs = i % elsperblock; // modulo
- return &mem[srcbase + imm + offs]; // re-cast to uint8_t*, uint16_t* etc.
-
-Note that the constant "32" above is replaced by 8 for LB, 16 for LH, 64 for LD
-and 128 for LQ.
-
-The principle is basically exactly the same as if the srcbase were pointing
-at the memory of the *register* file: memory is re-interpreted as containing
-groups of elwidth-wide discrete elements.
-
-When storing the result from a load, it's important to respect the fact
-that the destination register has its *own separate element width*. Thus,
-when each element is loaded (at the source element width), any sign-extension
-or zero-extension (or truncation) needs to be done to the *destination*
-bitwidth. Also, the storing has the exact same analogous algorithm as
-above, where in fact it is just the set\_polymorphed\_reg pseudocode
-(completely unchanged) used above.
-
-One issue remains: when the source element width is **greater** than
-the width of the operation, it is obvious that a single LB for example
-cannot possibly obtain 16-bit-wide data. This condition may be detected
-where, when using integer divide, elsperblock (the width of the LOAD
-divided by the bitwidth of the element) is zero.
-
-The issue is "fixed" by ensuring that elsperblock is a minimum of 1:
-
- elsperblock = min(1, LD_OP_BITWIDTH / element_bitwidth)
-
-The elements, if the element bitwidth is larger than the LD operation's
-size, will then be sign/zero-extended to the full LD operation size, as
-specified by the LOAD (LDU instead of LD, LBU instead of LB), before
-being passed on to the second phase.
-
-As LOAD/STORE may be twin-predicated, it is important to note that
-the rules on twin predication still apply, except where in previous
-pseudo-code (elwidth=default for both source and target) it was
-the *registers* that the predication was applied to, it is now the
-**elements** that the predication is applied to.
-
-Thus the full pseudocode for all LD operations may be written out
-as follows:
-
- function LBU(rd, rs):
- load_elwidthed(rd, rs, 8, true)
- function LB(rd, rs):
- load_elwidthed(rd, rs, 8, false)
- function LH(rd, rs):
- load_elwidthed(rd, rs, 16, false)
- ...
- ...
- function LQ(rd, rs):
- load_elwidthed(rd, rs, 128, false)
-
- # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16..
- function load_memory(rs, imm, i, opwidth):
- elwidth = int_csr[rs].elwidth
- bitwidth = bw(elwidth);
- elsperblock = min(1, opwidth / bitwidth)
- srcbase = ireg[rs+i/(elsperblock)];
- offs = i % elsperblock;
- return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes
-
- function load_elwidthed(rd, rs, opwidth, unsigned):
- destwid = int_csr[rd].elwidth # destination element width
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- val = load_memory(rs, imm, i, opwidth)
- if unsigned:
- val = zero_extend(val, min(opwidth, bitwidth))
- else:
- val = sign_extend(val, min(opwidth, bitwidth))
- set_polymorphed_reg(rd, bitwidth, j, val)
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++; else break;