-# Instructions
-
-By being a topological remap of RVV concepts, the following RVV instructions
-remain exactly the same: VMPOP, VMFIRST, VEXTRACT, VINSERT, VMERGE, VSELECT,
-VSLIDE, VCLASS and VPOPC. Two instructions, VCLIP and VCLIPI, do not
-have RV Standard equivalents, so are left out of Simple-V.
-All other instructions from RVV are topologically re-mapped and retain
-their complete functionality, intact.
-
-## Instruction Format
-
-The instruction format for Simple-V does not actually have *any* explicit
-compare operations, *any* arithmetic, floating point or *any*
-memory instructions.
-Instead it *overloads* pre-existing branch operations into predicated
-variants, and implicitly overloads arithmetic operations and LOAD/STORE
-depending on implicit CSR configurations for both vector length and
-bitwidth. *This includes Compressed instructions* as well as any
-future ones, *including* future Extensions.
-
-* For analysis of RVV see [[v_comparative_analysis]] which begins to
- outline topologically-equivalent mappings of instructions
-* Also see Appendix "Retro-fitting Predication into branch-explicit ISA"
- for format of Branch opcodes.
-
-**TODO**: *analyse and decide whether the implicit nature of predication
-as proposed is or is not a lot of hassle, and if explicit prefixes are
-a better idea instead. Parallelism therefore effectively may end up
-as always being 64-bit opcodes (32 for the prefix, 32 for the instruction)
-with some opportunities for to use Compressed bringing it down to 48.
-Also to consider is whether one or both of the last two remaining Compressed
-instruction codes in Quadrant 1 could be used as a parallelism prefix,
-bringing parallelised opcodes down to 32-bit (when combined with C)
-and having the benefit of being explicit.*
-
-## Branch Instruction:
-
-This is the overloaded table for Integer-base Branch operations. Opcode
-(bits 6..0) is set in all cases to 1100011.
-
-[[!table data="""
-31 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 |
-imm[12|10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode |
-7 | 5 | 5 | 3 | 4 | 1 | 7 |
-reserved | src2 | src1 | BPR | predicate rs3 || BRANCH |
-reserved | src2 | src1 | 000 | predicate rs3 || BEQ |
-reserved | src2 | src1 | 001 | predicate rs3 || BNE |
-reserved | src2 | src1 | 010 | predicate rs3 || rsvd |
-reserved | src2 | src1 | 011 | predicate rs3 || rsvd |
-reserved | src2 | src1 | 100 | predicate rs3 || BLE |
-reserved | src2 | src1 | 101 | predicate rs3 || BGE |
-reserved | src2 | src1 | 110 | predicate rs3 || BLTU |
-reserved | src2 | src1 | 111 | predicate rs3 || BGEU |
-"""]]
-
-This is the overloaded table for Floating-point Predication operations.
-Interestingly no change is needed to the instruction format because
-FP Compare already stores a 1 or a zero in its "rd" integer register
-target, i.e. it's not actually a Branch at all: it's a compare.
-The target needs to simply change to be a predication bitfield.
-
-As with
-Standard RVF/D/Q, Opcode (bits 6..0) is set in all cases to 1010011.
-Likewise Single-precision, fmt bits 26..25) is still set to 00.
-Double-precision is still set to 01, whilst Quad-precision
-appears not to have a definition in V2.3-Draft (but should be unaffected).
-
-It is however noted that an entry "FNE" (the opposite of FEQ) is missing,
-and whilst in ordinary branch code this is fine because the standard
-RVF compare can always be followed up with an integer BEQ or a BNE (or
-a compressed comparison to zero or non-zero), in predication terms that
-becomes more of an impact as an explicit (scalar) instruction is needed
-to invert the predicate. An additional encoding funct3=011 is therefore
-proposed to cater for this.
-
-[[!table data="""
-31 .. 27| 26 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 7 | 6 ... 0 |
-funct5 | fmt | rs2 | rs1 | funct3 | rd | opcode |
-5 | 2 | 5 | 5 | 3 | 4 | 7 |
-10100 | 00/01/11 | src2 | src1 | 010 | pred rs3 | FEQ |
-10100 | 00/01/11 | src2 | src1 | *011* | pred rs3 | FNE |
-10100 | 00/01/11 | src2 | src1 | 001 | pred rs3 | FLT |
-10100 | 00/01/11 | src2 | src1 | 000 | pred rs3 | FLE |
-"""]]
-
-Note (**TBD**): floating-point exceptions will need to be extended
-to cater for multiple exceptions (and statuses of the same). The
-usual approach is to have an array of status codes and bit-fields,
-and one exception, rather than throw separate exceptions for each
-Vector element.
-
-In Hwacha EECS-2015-262 Section 6.7.2 the following pseudocode is given
-for predicated compare operations of function "cmp":
-
- for (int i=0; i<vl; ++i)
- if ([!]preg[p][i])
- preg[pd][i] = cmp(s1 ? vreg[rs1][i] : sreg[rs1],
- s2 ? vreg[rs2][i] : sreg[rs2]);
-
-With associated predication, vector-length adjustments and so on,
-and temporarily ignoring bitwidth (which makes the comparisons more
-complex), this becomes:
-
- if I/F == INT: # integer type cmp
- pred_enabled = int_pred_enabled # TODO: exception if not set!
- preg = int_pred_reg[rd]
- else:
- pred_enabled = fp_pred_enabled # TODO: exception if not set!
- preg = fp_pred_reg[rd]
-
- s1 = CSRvectorlen[src1] > 1;
- s2 = CSRvectorlen[src2] > 1;
- for (int i=0; i<vl; ++i)
- preg[rs3][i] = cmp(s1 ? reg[src1+i] : reg[src1],
- s2 ? reg[src2+i] : reg[src2]);
-
-Notes:
-
-* Predicated SIMD comparisons would break src1 and src2 further down
- into bitwidth-sized chunks (see Appendix "Bitwidth Virtual Register
- Reordering") setting Vector-Length * (number of SIMD elements) bits
- in Predicate Register rs3 as opposed to just Vector-Length bits.
-* Predicated Branches do not actually have an adjustment to the Program
- Counter, so all of bits 25 through 30 in every case are not needed.
-* There are plenty of reserved opcodes for which bits 25 through 30 could
- be put to good use if there is a suitable use-case.
-* FEQ and FNE (and BEQ and BNE) are included in order to save one
- instruction having to invert the resultant predicate bitfield.
- FLT and FLE may be inverted to FGT and FGE if needed by swapping
- src1 and src2 (likewise the integer counterparts).
-
-## Compressed Branch Instruction:
-
-[[!table data="""
-15..13 | 12...10 | 9..7 | 6..5 | 4..2 | 1..0 | name |
-funct3 | imm | rs10 | imm | | op | |
-3 | 3 | 3 | 2 | 3 | 2 | |
-C.BPR | pred rs3 | src1 | I/F B | src2 | C1 | |
-110 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.EQ |
-111 | pred rs3 | src1 | I/F 0 | src2 | C1 | P.NE |
-110 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LT |
-111 | pred rs3 | src1 | I/F 1 | src2 | C1 | P.LE |
-"""]]
-
-Notes:
-
-* Bits 5 13 14 and 15 make up the comparator type
-* In both floating-point and integer cases there are four predication
- comparators: EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting
- src1 and src2).
-
-## LOAD / STORE Instructions
-
-For full analysis of topological adaptation of RVV LOAD/STORE
-see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X)
-may be implicitly overloaded into the one base RV LOAD instruction.
-
-Revised LOAD:
-
-[[!table data="""
-31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 |
-imm[11:0] |||| rs1 | funct3 | rd | opcode |
-1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 |
-? | s | rs2 | imm[4:0] | base | width | dest | LOAD |
-"""]]
-
-The exact same corresponding adaptation is also carried out on the single,
-double and quad precision floating-point LOAD-FP and STORE-FP operations,
-which fit the exact same instruction format. Thus all three types
-(unit, stride and indexed) may be fitted into FLW, FLD and FLQ,
-as well as FSW, FSD and FSQ.
-
-Notes:
-
-* LOAD remains functionally (topologically) identical to RVV LOAD
- (for both integer and floating-point variants).
-* Predication CSR-marking register is not explicitly shown in instruction, it's
- implicit based on the CSR predicate state for the rd (destination) register
-* rs2, the source, may *also be marked as a vector*, which implicitly
- is taken to indicate "Indexed Load" (LD.X)
-* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S)
-* Bit 31 is reserved (ideas under consideration: auto-increment)
-* **TODO**: include CSR SIMD bitwidth in the pseudo-code below.
-* **TODO**: clarify where width maps to elsize
-
-Pseudo-code (excludes CSR SIMD bitwidth for simplicity):
-
- if (unit-strided) stride = elsize;
- else stride = areg[as2]; // constant-strided
-
- pred_enabled = int_pred_enabled
- preg = int_pred_reg[rd]
-
- for (int i=0; i<vl; ++i)
- if (preg_enabled[rd] && [!]preg[i])
- for (int j=0; j<seglen+1; j++)
- {
- if CSRvectorised[rs2])
- offs = vreg[rs2][i]
- else
- offs = i*(seglen+1)*stride;
- vreg[rd+j][i] = mem[sreg[base] + offs + j*stride];
- }
-
-Taking CSR (SIMD) bitwidth into account involves using the vector
-length and register encoding according to the "Bitwidth Virtual Register
-Reordering" scheme shown in the Appendix (see function "regoffs").
-
-A similar instruction exists for STORE, with identical topological
-translation of all features. **TODO**
-
-## Compressed LOAD / STORE Instructions
-
-Compressed LOAD and STORE are of the same format, where bits 2-4 are
-a src register instead of dest:
-
-[[!table data="""
-15 13 | 12 10 | 9 7 | 6 5 | 4 2 | 1 0 |
-funct3 | imm | rs10 | imm | rd0 | op |
-3 | 3 | 3 | 2 | 3 | 2 |
-C.LW | offset[5:3] | base | offset[2|6] | dest | C0 |
-"""]]
-
-Unfortunately it is not possible to fit the full functionality
-of vectorised LOAD / STORE into C.LD / C.ST: the "X" variants (Indexed)
-require another operand (rs2) in addition to the operand width
-(which is also missing), offset, base, and src/dest.
-
-However a close approximation may be achieved by taking the top bit
-of the offset in each of the five types of LD (and ST), reducing the
-offset to 4 bits and utilising the 5th bit to indicate whether "stride"
-is to be enabled. In this way it is at least possible to introduce
-that functionality.
-
-(**TODO**: *assess whether the loss of one bit from offset is worth having
-"stride" capability.*)
-
-We also assume (including for the "stride" variant) that the "width"
-parameter, which is missing, is derived and implicit, just as it is
-with the standard Compressed LOAD/STORE instructions. For C.LW, C.LD
-and C.LQ, the width is implicitly 4, 8 and 16 respectively, whilst for
-C.FLW and C.FLD the width is implicitly 4 and 8 respectively.
-
-Interestingly we note that the Vectorised Simple-V variant of
-LOAD/STORE (Compressed and otherwise), due to it effectively using the
-standard register file(s), is the direct functional equivalent of
-standard load-multiple and store-multiple instructions found in other
-processors.
-
-In Section 12.3 riscv-isa manual V2.3-draft it is noted the comments on
-page 76, "For virtual memory systems some data accesses could be resident
-in physical memory and some not". The interesting question then arises:
-how does RVV deal with the exact same scenario?
-Expired U.S. Patent 5895501 (Filing Date Sep 3 1996) describes a method
-of detecting early page / segmentation faults and adjusting the TLB
-in advance, accordingly: other strategies are explored in the Appendix
-Section "Virtual Memory Page Faults".
-