- for (int i=0; i<vl; ++i)
- if ([!]preg[p][i])
- preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
- s1 = reg_is_vectorised(src1);
- s2 = reg_is_vectorised(src2);
-
- if not s1 && not s2
- if cmp(rs1, rs2) # scalar compare
- goto branch
- return
-
- preg = int_pred_reg[rd]
- reg = int_regfile
-
- ps = get_pred_val(I/F==INT, rs1);
- rd = get_pred_val(I/F==INT, rs2); # this may not exist
-
- if not exists(rd)
- temporary_result = 0
- else
- preg[rd] = 0; # initialise to zero
-
- for (int i = 0; i < VL; ++i)
- if (ps & (1<<i)) && (cmp(s1 ? reg[src1+i]:reg[src1],
- s2 ? reg[src2+i]:reg[src2])
- if not exists(rd)
- temporary_result |= 1<<i;
- else
- preg[rd] |= 1<<i; # bitfield not vector
-
- if not exists(rd)
- if temporary_result == ps
- goto branch
- else
- if preg[rd] == ps
- goto branch
-
-Notes:
-
-* zeroing has been temporarily left out of the above pseudo-code,
- for clarity
-* Predicated SIMD comparisons would break src1 and src2 further down
- into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
- Reordering") setting Vector-Length times (number of SIMD elements) bits
- in Predicate Register rd, as opposed to just Vector-Length bits.
-
-TODO: predication now taken from src2. also branch goes ahead
-if all compares are successful.
-
-Note also that where normally, predication requires that there must
-also be a CSR register entry for the register being used in order
-for the **predication** CSR register entry to also be active,
-for branches this is **not** the case. src2 does **not** have
-to have its CSR register entry marked as active in order for
-predication on src2 to be active.
-
-### Floating-point Comparisons
-
-There does not exist floating-point branch operations, only compare.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-Thus, no change is made to the floating-point comparison, so
-
-It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact. To deal with this, SV's predication has
-had "invert" added to it.
-
-### Compressed Branch Instruction
-
-Compressed Branch instructions are, just like standard Branch instructions,
-reinterpreted to be vectorised and predicated based on the source register
-(rs1s) CSR entries. As however there is only the one source register,
-given that c.beqz a10 is equivalent to beqz a10,x0, the optional target
-to store the results of the comparisions is taken from CSR predication
-table entries for **x0**.
-
-The specific required use of x0 is, with a little thought, quite obvious,
-but is counterintuitive. Clearly it is **not** recommended to redirect
-x0 with a CSR register entry, however as a means to opaquely obtain
-a predication target it is the only sensible option that does not involve
-additional special CSRs (or, worse, additional special opcodes).
-
-Note also that, just as with standard branches, the 2nd source
-(in this case x0 rather than src2) does **not** have to have its CSR
-register table marked as "active" in order for predication to work.
-
-## Vectorised Dual-operand instructions
-
-There is a series of 2-operand instructions involving copying (and
-sometimes alteration):
-
-* C.MV
-* FMV, FNEG, FABS, FCVT, FSGNJ, FSGNJN and FSGNJX
-* C.LWSP, C.SWSP, C.LDSP, C.FLWSP etc.
-* LOAD(-FP) and STORE(-FP)
-
-All of these operations follow the same two-operand pattern, so it is
-*both* the source *and* destination predication masks that are taken into
-account. This is different from
-the three-operand arithmetic instructions, where the predication mask
-is taken from the *destination* register, and applied uniformly to the
-elements of the source register(s), element-for-element.
-
-The pseudo-code pattern for twin-predicated operations is as
-follows:
-
- function op(rd, rs):
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- reg[rd+j] = SCALAR_OPERATION_ON(reg[rs+i])
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++;
-
-This pattern covers scalar-scalar, scalar-vector, vector-scalar
-and vector-vector, and predicated variants of all of those.
-Zeroing is not presently included (TODO). As such, when compared
-to RVV, the twin-predicated variants of C.MV and FMV cover
-**all** standard vector operations: VINSERT, VSPLAT, VREDUCE,
-VEXTRACT, VSCATTER, VGATHER, VCOPY, and more.
-
-Note that:
-
-* elwidth (SIMD) is not covered in the pseudo-code above
-* ending the loop early in scalar cases (VINSERT, VEXTRACT) is also
- not covered
-* zero predication is also not shown (TODO).
-
-### C.MV Instruction <a name="c_mv"></a>
-
-There is no MV instruction in RV however there is a C.MV instruction.
-It is used for copying integer-to-integer registers (vectorised FMV
-is used for copying floating-point).
-
-If either the source or the destination register are marked as vectors
-C.MV is reinterpreted to be a vectorised (multi-register) predicated
-move operation. The actual instruction's format does not change:
-
-[[!table data="""
-15 12 | 11 7 | 6 2 | 1 0 |
-funct4 | rd | rs | op |
-4 | 5 | 5 | 2 |
-C.MV | dest | src | C0 |
-"""]]
-
-A simplified version of the pseudocode for this operation is as follows:
-
- function op_mv(rd, rs) # MV not VMV!
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = int_csr[rs].active ? int_csr[rs].regidx : rs;
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- ireg[rd+j] <= ireg[rs+i];
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++;
-
-There are several different instructions from RVV that are covered by
-this one opcode:
-
-[[!table data="""
-src | dest | predication | op |
-scalar | vector | none | VSPLAT |
-scalar | vector | destination | sparse VSPLAT |
-scalar | vector | 1-bit dest | VINSERT |
-vector | scalar | 1-bit? src | VEXTRACT |
-vector | vector | none | VCOPY |
-vector | vector | src | Vector Gather |
-vector | vector | dest | Vector Scatter |
-vector | vector | src & dest | Gather/Scatter |
-vector | vector | src == dest | sparse VCOPY |
-"""]]
-
-Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV
-operations with inversion on the src and dest predication for one of the
-two C.MV operations.
-
-Note that in the instance where the Compressed Extension is not implemented,
-MV may be used, but that is a pseudo-operation mapping to addi rd, x0, rs.
-Note that the behaviour is **different** from C.MV because with addi the
-predication mask to use is taken **only** from rd and is applied against
-all elements: rs[i] = rd[i].
-
-### FMV, FNEG and FABS Instructions
-
-These are identical in form to C.MV, except covering floating-point
-register copying. The same double-predication rules also apply.
-However when elwidth is not set to default the instruction is implicitly
-and automatic converted to a (vectorised) floating-point type conversion
-operation of the appropriate size covering the source and destination
-register bitwidths.
-
-(Note that FMV, FNEG and FABS are all actually pseudo-instructions)
-
-### FVCT Instructions
-
-These are again identical in form to C.MV, except that they cover
-floating-point to integer and integer to floating-point. When element
-width in each vector is set to default, the instructions behave exactly
-as they are defined for standard RV (scalar) operations, except vectorised
-in exactly the same fashion as outlined in C.MV.
-
-However when the source or destination element width is not set to default,
-the opcode's explicit element widths are *over-ridden* to new definitions,
-and the opcode's element width is taken as indicative of the SIMD width
-(if applicable i.e. if packed SIMD is requested) instead.
-
-For example FCVT.S.L would normally be used to convert a 64-bit
-integer in register rs1 to a 64-bit floating-point number in rd.
-If however the source rs1 is set to be a vector, where elwidth is set to
-default/2 and "packed SIMD" is enabled, then the first 32 bits of
-rs1 are converted to a floating-point number to be stored in rd's
-first element and the higher 32-bits *also* converted to floating-point
-and stored in the second. The 32 bit size comes from the fact that
-FCVT.S.L's integer width is 64 bit, and with elwidth on rs1 set to
-divide that by two it means that rs1 element width is to be taken as 32.
-
-Similar rules apply to the destination register.
-
-## LOAD / STORE Instructions and LOAD-FP/STORE-FP <a name="load_store"></a>
-
-An earlier draft of SV modified the behaviour of LOAD/STORE. This
-actually undermined the fundamental principle of SV, namely that there
-be no modifications to the scalar behaviour (except where absolutely
-necessary), in order to simplify an implementor's task if considering
-converting a pre-existing scalar design to support parallelism.
-
-So the original RISC-V scalar LOAD/STORE and LOAD-FP/STORE-FP functionality
-do not change in SV, however just as with C.MV it is important to note
-that dual-predication is possible. Using the template outlined in
-the section "Vectorised dual-op instructions", the pseudo-code covering
-scalar-scalar, scalar-vector, vector-scalar and vector-vector applies,
-where SCALAR\_OPERATION is as follows, exactly as for a standard
-scalar RV LOAD operation:
-
- srcbase = ireg[rs+i];
- return mem[srcbase + imm];
-
-Whilst LOAD and STORE remain as-is when compared to their scalar
-counterparts, the incrementing on the source register (for LOAD)
-means that pointers-to-structures can be easily implemented, and
-if contiguous offsets are required, those pointers (the contents
-of the contiguous source registers) may simply be set up to point
-to contiguous locations.
-
-## Compressed Stack LOAD / STORE Instructions <a name="c_ld_st"></a>
-
-C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated,
-where it is implicit in C.LWSP/FLWSP that x2 is the source register.
-It is therefore possible to use predicated C.LWSP to efficiently
-pop registers off the stack (by predicating x2 as the source), cherry-picking
-which registers to store to (by predicating the destination). Likewise
-for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved.
-
-However, to do so, the behaviour of C.LWSP/C.SWSP needs to be slightly
-different: where x2 is marked as vectorised, instead of incrementing
-the register on each loop (x2, x3, x4...), instead it is the *immediate*
-that must be incremented. Pseudo-code follows:
-
- function lwsp(rd, rs):
- rd = int_csr[rd].active ? int_csr[rd].regidx : rd;
- rs = x2 # effectively no redirection on x2.
- ps = get_pred_val(FALSE, rs); # predication on src
- pd = get_pred_val(FALSE, rd); # ... AND on dest
- for (int i = 0, int j = 0; i < VL && j < VL;):
- if (int_csr[rs].isvec) while (!(ps & 1<<i)) i++;
- if (int_csr[rd].isvec) while (!(pd & 1<<j)) j++;
- reg[rd+j] = mem[x2 + ((offset+i) * 4)]
- if (int_csr[rs].isvec) i++;
- if (int_csr[rd].isvec) j++;
-
-For C.LDSP, the offset (and loop) multiplier would be 8, and for
-C.LQSP it would be 16. Effectively this makes C.LWSP etc. a Vector
-"Unit Stride" Load instruction.
-
-**Note**: it is still possible to redirect x2 to an alternative target
-register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as
-general-purpose Vector "Unit Stride" LOAD/STORE operations.
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE,
-where the same rules apply and the same pseudo-code apply as for
-non-compressed LOAD/STORE. This is **different** from Compressed Stack
-LOAD/STORE (C.LWSP / C.SWSP), which have been augmented to become
-Vector "Unit Stride" capable.
-
-Just as with uncompressed LOAD/STORE C.LD / C.ST increment the *register*
-during the hardware loop, **not** the offset.
-
-# Element bitwidth polymorphism <a name="elwidth"></a>
-
-Element bitwidth is best covered as its own special section, as it
-is quite involved and applies uniformly across-the-board.
-
-The effect of setting an element bitwidth is to re-cast each entry
-in the register table to a completely different width. In c-style terms,
-on an RV64 architecture, effectively each register looks like this:
-
- typedef union {
- uint8_t b[8];
- uint16_t s[4];
- uint32_t i[2];
- uint64_t l[1];
- } reg_t;
-
- // integer table: assume maximum SV 7-bit regfile size
- reg_t int_regfile[128];
-
-However this hides the fact that setting VL greater than 8, for example,
-when the bitwidth is 8, accessing one specific register "spills over"
-to the following parts of the register file in a sequential fashion.
-So a much more accurate way to reflect this would be:
-
- typedef union {
- uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128
- uint8_t *b;
- uint16_t *s;
- uint32_t *i;
- uint64_t *l;
- uint128_t *d;
- } reg_t;
-
- reg_t int_regfile[128];
-
-Where it is up to the implementor to ensure that, towards the end
-of the register file, an exception is thrown if attempts to access
-beyond the "real" register bytes is ever attempted.
-
-Now we may modify pseudo-code an operation where all element bitwidths have
-been set to the same size, where this pseudo-code is otherwise identical
-to its "non" polymorphic versions (above):
-
- function op_add(rd, rs1, rs2) # add not VADD!
- ...
- ...
- for (i = 0; i < VL; i++)
- ...
- ...
- // TODO, calculate if over-run occurs, for each elwidth
- if (elwidth == 8) {
- int_regfile[rd].b[id] <= int_regfile[rs1].i[irs1] +
- int_regfile[rs2].i[irs2];
- } else if elwidth == 16 {
- int_regfile[rd].s[id] <= int_regfile[rs1].s[irs1] +
- int_regfile[rs2].s[irs2];
- } else if elwidth == 32 {
- int_regfile[rd].i[id] <= int_regfile[rs1].i[irs1] +
- int_regfile[rs2].i[irs2];
- } else { // elwidth == 64
- int_regfile[rd].l[id] <= int_regfile[rs1].l[irs1] +
- int_regfile[rs2].l[irs2];
- }
- ...
- ...
-
-So here we can see clearly: for 8-bit entries rd, rs1 and rs2 (and registers
-following sequentially on respectively from the same) are "type-cast"
-to 8-bit; for 16-bit entries likewise and so on.
-
-However that only covers the case where the element widths are the same.
-Where the element widths are different, the following algorithm applies:
-
-* Analyse the bitwidth of all source operands and work out the
- maximum. Record this as "maxsrcbitwidth"
-* If any given source operand requires sign-extension or zero-extension
- (ldb, div, rem, mul, sll, srl, sra etc.), instead of mandatory 32-bit
- sign-extension / zero-extension or whatever is specified in the standard
- RV specification, **change** that to sign-extending from the respective
- individual source operand's bitwidth from the CSR table out to
- "maxsrcbitwidth" (previously calculated), instead.
-* Following separate and distinct (optional) sign/zero-extension of all
- source operands as specifically required for that operation, carry out the
- operation at "maxsrcbitwidth". (Note that in the case of LOAD/STORE or MV
- this may be a "null" (copy) operation, and that with FCVT, the changes
- to the source and destination bitwidths may also turn FVCT effectively
- into a copy).
-* If the destination operand requires sign-extension or zero-extension,
- instead of a mandatory fixed size (typically 32-bit for arithmetic,
- for subw for example, and otherwise various: 8-bit for sb, 16-bit for sw
- etc.), overload the RV specification with the bitwidth from the
- destination register's elwidth entry.
-* Finally, store the (optionally) sign/zero-extended value into its
- destination: memory for sb/sw etc., or an offset section of the register
- file for an arithmetic operation.
-
-In this way, polymorphic bitwidths are achieved without requiring a
-massive 64-way permutation of calculations **per opcode**, for example
-(4 possible rs1 bitwidths times 4 possible rs2 bitwidths times 4 possible
-rd bitwidths). The pseudo-code is therefore as follows:
-
- typedef union {
- uint8_t b;
- uint16_t s;
- uint32_t i;
- uint64_t l;
- } el_reg_t;
-
- bw(elwidth):
- if elwidth == 0:
- return xlen
- if elwidth == 1:
- return xlen / 2
- if elwidth == 2:
- return xlen * 2
- // elwidth == 3:
- return 8
-
- get_max_elwidth(rs1, rs2):
- return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set
- bw(int_csr[rs2].elwidth)) # again XLEN if no entry
-
- get_polymorphed_reg(reg, bitwidth, offset):
- el_reg_t res;
- res.l = 0; // TODO: going to need sign-extending / zero-extending
- if bitwidth == 8:
- reg.b = int_regfile[reg].b[offset]
- elif bitwidth == 16:
- reg.s = int_regfile[reg].s[offset]
- elif bitwidth == 32:
- reg.i = int_regfile[reg].i[offset]
- elif bitwidth == 64:
- reg.l = int_regfile[reg].l[offset]
- return res
-
- set_polymorphed_reg(reg, bitwidth, offset, val):
- if bitwidth == 8:
- int_regfile[reg].b[offset] = val
- elif bitwidth == 16:
- int_regfile[reg].s[offset] = val
- reg.s = int_regfile[reg].s[offset]
- elif bitwidth == 32:
- int_regfile[reg].i[offset] = val
- elif bitwidth == 64:
- int_regfile[reg].l[offset] = val
-
- maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s)
- destwid = int_csr[rs1].elwidth # destination element width
- for (i = 0; i < VL; i++)
- if (predval & 1<<i) # predication uses intregs
- // TODO, calculate if over-run occurs, for each elwidth
- src1 = get_polymorphed_reg(rs1, maxsrcwid, irs1)
- // TODO, sign/zero-extend src1 and src2 as operation requires
- if (op_requires_sign_extend_src1)
- src1 = sign_extend(src1, maxsrcwid)
- src2 = get_polymorphed_reg(rs2, maxsrcwid, irs2)
- result = src1 + src2 # actual add here
- // TODO, sign/zero-extend result, as operation requires
- if (op_requires_sign_extend_dest)
- result = sign_extend(result, maxsrcwid)
- set_polymorphed_reg(rd, destwid, ird, result)
- if (int_vec[rd ].isvector) { id += 1; }
- if (int_vec[rs1].isvector) { irs1 += 1; }
- if (int_vec[rs2].isvector) { irs2 += 1; }
-
-Whilst specific sign-extension and zero-extension pseudocode calls
-are left out, due to each operation being different, the above should
-be clear that;
-
-* the source operands are extended out to the maximum bitwidth of all
- source operands
-* the operation takes place at that maximum source bitwidth
-* the result is extended (or potentially even, truncated) before being
- stored in the destination. i.e. truncation (if required) to the
- destination width occurs **after** the operation **not** before.
-
-For floating-point operations, conversion takes place without
-raising any kind of exception. Exactly as specified in the standard
-RV specification, NAN (or appropriate) is stored if the result
-is beyond the range of the destination, and, again, exactly as
-with the standard RV specification just as with scalar
-operations, the floating-point flag is raised (FCSR). And, again, just as
-with scalar operations, it is software's responsibility to check this flag.
-Given that the FCSR flags are "accrued", the fact that multiple element
-operations could have occurred is not a problem.
-
-Note that it is perfectly legitimate for floating-point bitwidths of
-only 8 to be specified. However whilst it is possible to apply IEEE 754
-principles, no actual standard yet exists. Implementors wishing to
-provide hardware-level 8-bit support rather than throw a trap to emulate
-in software should contact the author of this specification before
-proceeding.