From: Luke Kenneth Casson Leighton Date: Tue, 25 Jun 2019 10:16:36 +0000 (+0100) Subject: add abridged spec, split out vblock format to own file X-Git-Tag: convert-csv-opcode-to-binary~4456 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=3bf64d6e64bb8b7c5fd86cca0e39e04a27914d61;p=libreriscv.git add abridged spec, split out vblock format to own file --- diff --git a/simple_v_extension/abridged_spec.mdwn b/simple_v_extension/abridged_spec.mdwn new file mode 100644 index 000000000..d4c008f44 --- /dev/null +++ b/simple_v_extension/abridged_spec.mdwn @@ -0,0 +1,1104 @@ +# Simple-V (Parallelism Extension Proposal) Specification (Abridged) + +* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton +* Status: DRAFTv0.6 +* Last edited: 25 jun 2019 + +[[!toc ]] + +# Introduction + +Simple-V is a uniform parallelism API for RISC-V hardware that allows +the Program Counter to enter "sub-contexts" in which, ultimately, standard +RISC-V scalar opcodes are executed. + +The sub-context execution is "nested" in "re-entrant" form, in the +following order: + +* Main standard RISC-V Program Counter (PC) +* VBLOCK sub-execution context (PCVBLK increments whilst PC is paused) +* VL element loops (STATE srcoffs and destoffs increment, PC and PCVBLK pause) +* SUBVL element loops (STATE svsrcoffs/svdestoffs increment, VL pauses) + +Note: **there are *no* new opcodes**. The scheme works *entirely* +on hidden context that augments *scalar* RISCV instructions. + +# CSRs + +There are five CSRs, available in any privilege level: + +* MVL (the Maximum Vector Length) +* VL (which has different characteristics from standard CSRs) +* SUBVL (effectively a kind of SIMD) +* STATE (containing copies of MVL, VL and SUBVL as well as context information) +* PCVBLK (the current operation being executed within a VBLOCK Group) + +For Privilege Levels (trap handling) there are the following CSRs, +where x may be u, m, s or h for User, Machine, Supervisor or Hypervisor +Modes respectively: + +* (x)ePCVBLK (a copy of the sub-execution Program Counter, that is relative + to the start of the current VBLOCK Group, set on a trap). +* (x) eSTATE (useful for saving and restoring during context switch, + and for providing fast transitions) + +The u/m/s CSRs are treated and handled exactly like their (x)epc +equivalents. On entry to or exit from a privilege level, the contents +of its (x)eSTATE are swapped with STATE. + +(x)EPCVBLK CSRs must be treated exactly like their corresponding (x)epc +equivalents. See VBLOCK section for details. + +## MAXVECTORLENGTH (MVL) + +MAXVECTORLENGTH is the same concept as MVL in RVV, except that it +is variable length and may be dynamically set. MVL is +however limited to the regfile bitwidth XLEN (1-32 for RV32, +1-64 for RV64 and so on). + +## Vector Length (VL) + +VSETVL is slightly different from RVV. Similar to RVV, VL is set to be within +the range 1 <= VL <= MVL (where MVL in turn is limited to 1 <= MVL <= XLEN) + + VL = rd = MIN(vlen, MVL) + +where 1 <= MVL <= XLEN + +## SUBVL - Sub Vector Length + +This is a "group by quantity" that effectivrly asks each iteration +of the hardware loop to load SUBVL elements of width elwidth at a +time. Effectively, SUBVL is like a SIMD multiplier: instead of just 1 +operation issued, SUBVL operations are issued. + +The main effect of SUBVL is that predication bits are applied per +**group**, rather than by individual element. + +## STATE + +This is a standard CSR that contains sufficient information for a +full context save/restore. It contains (and permits setting of): + +* MVL +* VL +* destoffs - the destination element offset of the current parallel + instruction being executed +* srcoffs - for twin-predication, the source element offset as well. +* SUBVL +* svdestoffs - the subvector destination element offset of the current + parallel instruction being executed +* svsrcoffs - for twin-predication, the subvector source element offset + as well. + +The format of the STATE CSR is as follows: + +| (29..28 | (27..26) | (25..24) | (23..18) | (17..12) | (11..6) | (5...0) | +| ------- | -------- | -------- | -------- | -------- | ------- | ------- | +| dsvoffs | ssvoffs | subvl | destoffs | srcoffs | vl | maxvl | + +Notes: + +* The entries are truncated to be within range. Attempts to set VL to + greater than MAXVL will truncate VL. +* Both VL and MAXVL are stored offset by one. 0b000000 represents VL=1, + 0b000001 represents VL=2. This allows the full range 1 to XLEN instead + of 0 to only 63. + +## VL, MVL and SUBVL instruction aliases + +This table contains pseudo-assembly instruction aliases. Note the +subtraction of 1 from the CSRRWI pseudo variants, to compensate for the +reduced range of the 5 bit immediate. + +| alias | CSR | +| - | - | +| SETVL rd, rs | CSRRW VL, rd, rs | +| SETVLi rd, #n | CSRRWI VL, rd, #n-1 | +| GETVL rd | CSRRW VL, rd, x0 | +| SETMVL rd, rs | CSRRW MVL, rd, rs | +| SETMVLi rd, #n | CSRRWI MVL,rd, #n-1 | +| GETMVL rd | CSRRW MVL, rd, x0 | + +Note: CSRRC and other bitsetting may still be used, they are however not particularly useful (very obscure). + +## Register key-value (CAM) table + +The purpose of the Register table is to mark which registers change behaviour +if used in a "Standard" (normally scalar) opcode. + +16 bit format: + +| RegCAM | 15 | (14..8) | 7 | (6..5) | (4..0) | +| ------ | - | - | - | ------ | ------- | +| 0 | isvec0 | regidx0 | i/f | vew0 | regkey | +| 1 | isvec1 | regidx1 | i/f | vew1 | regkey | +| 2 | isvec2 | regidx2 | i/f | vew2 | regkey | +| 3 | isvec3 | regidx3 | i/f | vew3 | regkey | + +8 bit format: + +| RegCAM | | 7 | (6..5) | (4..0) | +| ------ | | - | ------ | ------- | +| 0 | | i/f | vew0 | regnum | + +Mapping the 8-bit to 16-bit format: + +| RegCAM | 15 | (14..8) | 7 | (6..5) | (4..0) | +| ------ | - | - | - | ------ | ------- | +| 0 | isvec=1 | regnum0<<2 | i/f | vew0 | regnum0 | +| 1 | isvec=1 | regnum1<<2 | i/f | vew1 | regnum1 | +| 2 | isvec=1 | regnum2<<2 | i/f | vew2 | regnum2 | +| 3 | isvec=1 | regnum2<<2 | i/f | vew3 | regnum3 | + +Fields: + +* i/f is set to "1" to indicate that the redirection/tag entry is to + be applied to integer registers; 0 indicates that it is relevant to + floating-point registers. +* isvec indicates that the register (whether a src or dest) is to progress + incrementally forward on each loop iteration. this gives the "effect" + of vectorisation. isvec is zero indicates "do not progress", giving + the "effect" of that register being scalar. +* vew overrides the operation's default width. See table below +* regkey is the register which, if encountered in an op (as src or dest) + is to be "redirected" +* in the 16-bit format, regidx is the *actual* register to be used + for the operation (note that it is 7 bits wide) + +| vew | bitwidth | +| --- | ------------------- | +| 00 | default (XLEN/FLEN) | +| 01 | 8 bit | +| 10 | 16 bit | +| 11 | 32 bit | + +A useful way to view the above table (and not have it as a CAM): + +As the above table is a CAM (key-value store) it may be appropriate +(faster, less gates, implementation-wise) to expand it as follows: + + struct vectorised { + bool isvector:1; + int vew:2; + bool enabled:1; + int predidx:7; + } + + struct vectorised fp_vec[32], int_vec[32]; + + for (i = 0; i < len; i++) // from VBLOCK Format + tb = int_vec if CSRvec[i].type == 0 else fp_vec + idx = CSRvec[i].regkey // INT/FP src/dst reg in opcode + tb[idx].elwidth = CSRvec[i].elwidth + tb[idx].regidx = CSRvec[i].regidx // indirection + tb[idx].isvector = CSRvec[i].isvector // 0=scalar + tb[idx].enabled = true; + +## Predication Table + +The Predication Table is a key-value store indicating whether, if a +given destination register (integer or floating-point) is referred to +in an instruction, it is to be predicated. Like the Register table, it +is an indirect lookup that allows the RV opcodes to not need modification. + +* regidx is the register that in combination with the + i/f flag, if that integer or floating-point register is referred to in a + (standard RV) instruction results in the lookup table being referenced + to find the predication mask to use for this operation. +* predidx is the *actual* (full, 7 bit) register to be used for the + predication mask. +* inv indicates that the predication mask bits are to be inverted + prior to use *without* actually modifying the contents of the + registerfrom which those bits originated. +* zeroing is either 1 or 0, and if set to 1, the operation must + place zeros in any element position where the predication mask is + set to zero. If zeroing is set to 0, unpredicated elements *must* + be left alone. Some microarchitectures may choose to interpret + this as skipping the operation entirely. Others which wish to + stick more closely to a SIMD architecture may choose instead to + interpret unpredicated elements as an internal "copy element" + operation (which would be necessary in SIMD microarchitectures + that perform register-renaming) +* ffirst is a special mode that stops sequential element processing when + a data-dependent condition occurs, whether a trap or a conditional test. + The handling of each (trap or conditional test) is slightly different: + see Instruction sections for further details + +16 bit format: + +| PrCSR | (15..11) | 10 | 9 | 8 | (7..1) | 0 | +| ----- | - | - | - | - | ------- | ------- | +| 0 | predidx | zero0 | inv0 | i/f | regidx | ffirst0 | +| 1 | predidx | zero1 | inv1 | i/f | regidx | ffirst1 | +| 2 | predidx | zero2 | inv2 | i/f | regidx | ffirst2 | +| 3 | predidx | zero3 | inv3 | i/f | regidx | ffirst3 | + +Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding. Its use must +generate an illegal instruction trap. + +8 bit format: + +| PrCSR | 7 | 6 | 5 | (4..0) | +| ----- | - | - | - | ------- | +| 0 | zero0 | inv0 | i/f | regnum | + +Mapping from 8 to 16 bit format, the table becomes: + +| PrCSR | (15..11) | 10 | 9 | 8 | (7..1) | 0 | +| ----- | - | - | - | - | ------- | ------- | +| 0 | x9 | zero0 | inv0 | i/f | regnum | ff=0 | +| 1 | x10 | zero1 | inv1 | i/f | regnum | ff=0 | +| 2 | x11 | zero2 | inv2 | i/f | regnum | ff=0 | +| 3 | x12 | zero3 | inv3 | i/f | regnum | ff=0 | + +Pseudocode for predication: + + struct pred { + bool zero; // zeroing + bool inv; // register at predidx is inverted + bool ffirst; // fail-on-first + bool enabled; // use this to tell if the table-entry is active + int predidx; // redirection: actual int register to use + } + + struct pred fp_pred_reg[32]; + struct pred int_pred_reg[32]; + + for (i = 0; i < len; i++) // number of Predication entries in VBLOCK + tb = int_pred_reg if PredicateTable[i].type == 0 else fp_pred_reg; + idx = VBLOCKPredicateTable[i].regidx + tb[idx].zero = CSRpred[i].zero + tb[idx].inv = CSRpred[i].inv + tb[idx].ffirst = CSRpred[i].ffirst + tb[idx].predidx = CSRpred[i].predidx + tb[idx].enabled = true + + def get_pred_val(bool is_fp_op, int reg): + tb = int_reg if is_fp_op else fp_reg + if (!tb[reg].enabled): + return ~0x0, False // all enabled; no zeroing + tb = int_pred if is_fp_op else fp_pred + if (!tb[reg].enabled): + return ~0x0, False // all enabled; no zeroing + predidx = tb[reg].predidx // redirection occurs HERE + predicate = intreg[predidx] // actual predicate HERE + if (tb[reg].inv): + predicate = ~predicate // invert ALL bits + return predicate, tb[reg].zero + +## Fail-on-First Mode + +ffirst is a special data-dependent predicate mode. There are two +variants: one is for faults: typically for LOAD/STORE operations, +which may encounter end of page faults during a series of operations. +The other variant is comparisons such as FEQ (or the augmented behaviour +of Branch), and any operation that returns a result of zero (whether +integer or floating-point). In the FP case, this includes negative-zero. + +Note that the execution order must "appear" to be sequential for ffirst +mode to work correctly. An in-order architecture must execute the element +operations in sequence, whilst an out-of-order architecture must *commit* +the element operations in sequence (giving the appearance of in-order +execution). + +Note also, that if ffirst mode is needed without predication, a special +"always-on" Predicate Table Entry may be constructed by setting +inverse-on and using x0 as the predicate register. This +will have the effect of creating a mask of all ones, allowing ffirst +to be set. + +### Fail-on-first traps + +Except for the first element, ffault stops sequential element processing +when a trap occurs. The first element is treated normally (as if ffirst +is clear). Should any subsequent element instruction require a trap, +instead it and subsequent indexed elements are ignored (or cancelled in +out-of-order designs), and VL is set to the *last* instruction that did +not take the trap. + +Note that predicated-out elements (where the predicate mask bit is zero) +are clearly excluded (i.e. the trap will not occur). However, note that +the loop still had to test the predicate bit: thus on return, +VL is set to include elements that did not take the trap *and* includes +the elements that were predicated (masked) out (not tested up to the +point where the trap occurred). + +If SUBVL is being used (SUBVL!=1), the first *sub-group* of elements +will cause a trap as normal (as if ffirst is not set); subsequently, +the trap must not occur in the *sub-group* of elements. SUBVL will **NOT** +be modified. + +Given that predication bits apply to SUBVL groups, the same rules apply +to predicated-out (masked-out) sub-groups in calculating the value that VL +is set to. + +### Fail-on-first conditional tests + +ffault stops sequential element conditional testing on the first element result +being zero. VL is set to the number of elements that were processed before +the fail-condition was encountered. + +Note that just as with traps, if SUBVL!=1, the first of any of the *sub-group* +will cause the processing to end, and, even if there were elements within +the *sub-group* that passed the test, that sub-group is still (entirely) +excluded from the count (from setting VL). i.e. VL is set to the total +number of *sub-groups* that had no fail-condition up until execution was +stopped. + +Note again that, just as with traps, predicated-out (masked-out) elements +are included in the count leading up to the fail-condition, even though they +were not tested. + +The pseudo-code for Predication makes this clearer and simpler than it is +in words (the loop ends, VL is set to the current element index, "i"). + +# Instructions + +To illustrate how Scalar operations are turned "vector" and "predicated", +simplified example pseudo-code for an integer ADD operation is shown below. +Floating-point would use the FP Register Table. + + function op_add(rd, rs1, rs2) # add not VADD! +  int i, id=0, irs1=0, irs2=0; +  predval = get_pred_val(FALSE, rd); +  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; +  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; +  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; +  for (i = 0; i < VL; i++) + xSTATE.srcoffs = i # save context + if (predval & 1< + +Adding in support for SUBVL is a matter of adding in an extra inner +for-loop, where register src and dest are still incremented inside the +inner part. Not that the predication is still taken from the VL index. + +So whilst elements are indexed by "(i * SUBVL + s)", predicate bits are +indexed by "(i)" + + function op_add(rd, rs1, rs2) # add not VADD! +  int i, id=0, irs1=0, irs2=0; +  predval = get_pred_val(FALSE, rd); +  rd = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd; +  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1; +  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2; +  for (i = 0; i < VL; i++) + xSTATE.srcoffs = i # save context + for (s = 0; s < SUBVL; s++) + xSTATE.ssvoffs = s # save context + if (predval & 1< + +Branch operations use standard RV opcodes that are reinterpreted to +be "predicate variants" in the instance where either of the two src +registers are marked as vectors (active=1, vector=1). + +Note that the predication register to use (if one is enabled) is taken from +the *first* src register, and that this is used, just as with predicated +arithmetic operations, to mask whether the comparison operations take +place or not. If the second register is also marked as predicated, +that (scalar) predicate register is used as a **destination** to store +the results of all the comparisons. + +In instances where no vectorisation is detected on either src registers +the operation is treated as an absolutely standard scalar branch operation. +Where vectorisation is present on either or both src registers, the +branch may stil go ahead if any only if *all* tests succeed (i.e. excluding +those tests that are predicated out). + +Pseudo-code for branch: + + s1 = reg_is_vectorised(src1); + s2 = reg_is_vectorised(src2); + + if not s1 && not s2 + if cmp(rs1, rs2) # scalar compare + goto branch + return + + preg = int_pred_reg[rd] + reg = int_regfile + + ps = get_pred_val(I/F==INT, rs1); + rd = get_pred_val(I/F==INT, rs2); # this may not exist + + if not exists(rd) or zeroing: + result = 0 + else + result = preg[rd] + + for (int i = 0; i < VL; ++i) + if (zeroing) + if not (ps & (1< + +There is no MV instruction in RV however there is a C.MV instruction. +It is used for copying integer-to-integer registers (vectorised FMV +is used for copying floating-point). + +If either the source or the destination register are marked as vectors +C.MV is reinterpreted to be a vectorised (multi-register) predicated +move operation. The actual instruction's format does not change. + +There are several different instructions from RVV that are covered by +this one opcode: + +[[!table data=""" +src | dest | predication | op | +scalar | vector | none | VSPLAT | +scalar | vector | destination | sparse VSPLAT | +scalar | vector | 1-bit dest | VINSERT | +vector | scalar | 1-bit? src | VEXTRACT | +vector | vector | none | VCOPY | +vector | vector | src | Vector Gather | +vector | vector | dest | Vector Scatter | +vector | vector | src & dest | Gather/Scatter | +vector | vector | src == dest | sparse VCOPY | +"""]] + +Also, VMERGE may be implemented as back-to-back (macro-op fused) C.MV +operations with inversion on the src and dest predication for one of the +two C.MV operations. + +### FMV, FNEG and FABS Instructions + +These are identical in form to C.MV, except covering floating-point +register copying. The same double-predication rules also apply. +However when elwidth is not set to default the instruction is implicitly +and automatic converted to a (vectorised) floating-point type conversion +operation of the appropriate size covering the source and destination +register bitwidths. + +(Note that FMV, FNEG and FABS are all actually pseudo-instructions) + +### FVCT Instructions + +These are again identical in form to C.MV, except that they cover +floating-point to integer and integer to floating-point. When element +width in each vector is set to default, the instructions behave exactly +as they are defined for standard RV (scalar) operations, except vectorised +in exactly the same fashion as outlined in C.MV. + +However when the source or destination element width is not set to default, +the opcode's explicit element widths are *over-ridden* to new definitions, +and the opcode's element width is taken as indicative of the SIMD width +(if applicable i.e. if packed SIMD is requested) instead. + +## LOAD / STORE Instructions and LOAD-FP/STORE-FP + +In vectorised architectures there are usually at least two different modes +for LOAD/STORE: + +* Read (or write for STORE) from sequential locations, where one + register specifies the address, and the one address is incremented + by a fixed amount. This is usually known as "Unit Stride" mode. +* Read (or write) from multiple indirected addresses, where the + vector elements each specify separate and distinct addresses. + +To support these different addressing modes, the CSR Register "isvector" +bit is used. So, for a LOAD, when the src register is set to +scalar, the LOADs are sequentially incremented by the src register +element width, and when the src register is set to "vector", the +elements are treated as indirection addresses. Simplified +pseudo-code would look like this: + + function op_ld(rd, rs) # LD not VLD! +  rdv = int_csr[rd].active ? int_csr[rd].regidx : rd; +  rsv = int_csr[rs].active ? int_csr[rs].regidx : rs; +  ps = get_pred_val(FALSE, rs); # predication on src +  pd = get_pred_val(FALSE, rd); # ... AND on dest +  for (int i = 0, int j = 0; i < VL && j < VL;): + if (int_csr[rs].isvec) while (!(ps & 1< + +C.LWSP / C.SWSP and floating-point etc. are also source-dest twin-predicated, +where it is implicit in C.LWSP/FLWSP etc. that x2 is the source register. +It is therefore possible to use predicated C.LWSP to efficiently +pop registers off the stack (by predicating x2 as the source), cherry-picking +which registers to store to (by predicating the destination). Likewise +for C.SWSP. In this way, LOAD/STORE-Multiple is efficiently achieved. + +**Note**: it is still possible to redirect x2 to an alternative target +register. With care, this allows C.LWSP / C.SWSP (and C.FLWSP) to be used as +general-purpose LOAD/STORE operations. + +## Compressed LOAD / STORE Instructions + +Compressed LOAD and STORE are again exactly the same as scalar LOAD/STORE, +where the same rules apply and the same pseudo-code apply as for +non-compressed LOAD/STORE. Again: setting scalar or vector mode +on the src for LOAD and dest for STORE switches mode from "Unit Stride" +to "Multi-indirection", respectively. + +# Element bitwidth polymorphism + +Element bitwidth is best covered as its own special section, as it +is quite involved and applies uniformly across-the-board. SV restricts +bitwidth polymorphism to default, 8-bit, 16-bit and 32-bit. + +The effect of setting an element bitwidth is to re-cast each entry +in the register table, and for all memory operations involving +load/stores of certain specific sizes, to a completely different width. +Thus In c-style terms, on an RV64 architecture, effectively each register +now looks like this: + + typedef union { + uint8_t actual_bytes[8]; // 8 for RV64, 4 for RV32, 16 for RV128 + uint8_t b[0]; // array of type uint8_t + uint16_t s[0]; + uint32_t i[0]; + uint64_t l[0]; + uint128_t d[0]; + } reg_t; + + reg_t int_regfile[128]; + +where when accessing any individual regfile[n].b entry it is permitted +(in c) to arbitrarily over-run the *declared* length of the array (zero), +and thus "overspill" to consecutive register file entries in a fashion +that is completely transparent to a greatly-simplified software / pseudo-code +representation. +It is however critical to note that it is clearly the responsibility of +the implementor to ensure that, towards the end of the register file, +an exception is thrown if attempts to access beyond the "real" register +bytes is ever attempted. + +The pseudo-code is as follows, to demonstrate how the sign-extending +and width-extending works: + + typedef union { + uint8_t b; + uint16_t s; + uint32_t i; + uint64_t l; + } el_reg_t; + + bw(elwidth): + if elwidth == 0: + return xlen + if elwidth == 1: + return xlen / 2 + if elwidth == 2: + return xlen * 2 + // elwidth == 3: + return 8 + + get_max_elwidth(rs1, rs2): + return max(bw(int_csr[rs1].elwidth), # default (XLEN) if not set + bw(int_csr[rs2].elwidth)) # again XLEN if no entry + + get_polymorphed_reg(reg, bitwidth, offset): + el_reg_t res; + res.l = 0; // TODO: going to need sign-extending / zero-extending + if bitwidth == 8: + reg.b = int_regfile[reg].b[offset] + elif bitwidth == 16: + reg.s = int_regfile[reg].s[offset] + elif bitwidth == 32: + reg.i = int_regfile[reg].i[offset] + elif bitwidth == 64: + reg.l = int_regfile[reg].l[offset] + return res + + set_polymorphed_reg(reg, bitwidth, offset, val): + if (!int_csr[reg].isvec): + # sign/zero-extend depending on opcode requirements, from + # the reg's bitwidth out to the full bitwidth of the regfile + val = sign_or_zero_extend(val, bitwidth, xlen) + int_regfile[reg].l[0] = val + elif bitwidth == 8: + int_regfile[reg].b[offset] = val + elif bitwidth == 16: + int_regfile[reg].s[offset] = val + elif bitwidth == 32: + int_regfile[reg].i[offset] = val + elif bitwidth == 64: + int_regfile[reg].l[offset] = val + + maxsrcwid = get_max_elwidth(rs1, rs2) # source element width(s) + destwid = int_csr[rs1].elwidth # destination element width +  for (i = 0; i < VL; i++) + if (predval & 1< + +Polymorphic element widths in vectorised form means that the data +being loaded (or stored) across multiple registers needs to be treated +(reinterpreted) as a contiguous stream of elwidth-wide items, where +the source register's element width is **independent** from the destination's. + +This makes for a slightly more complex algorithm when using indirection +on the "addressed" register (source for LOAD and destination for STORE), +particularly given that the LOAD/STORE instruction provides important +information about the width of the data to be reinterpreted. + +As LOAD/STORE may be twin-predicated, it is important to note that +the rules on twin predication still apply. Where in previous +pseudo-code (elwidth=default for both source and target) it was +the *registers* that the predication was applied to, it is now the +**elements** that the predication is applied to. + +The full pseudocode for all LD operations may be written out +as follows: + + function LBU(rd, rs): + load_elwidthed(rd, rs, 8, true) + function LB(rd, rs): + load_elwidthed(rd, rs, 8, false) + function LH(rd, rs): + load_elwidthed(rd, rs, 16, false) + ... + ... + function LQ(rd, rs): + load_elwidthed(rd, rs, 128, false) + + # returns 1 byte of data when opwidth=8, 2 bytes when opwidth=16.. + function load_memory(rs, imm, i, opwidth): + elwidth = int_csr[rs].elwidth + bitwidth = bw(elwidth); + elsperblock = min(1, opwidth / bitwidth) + srcbase = ireg[rs+i/(elsperblock)]; + offs = i % elsperblock; + return mem[srcbase + imm + offs]; # 1/2/4/8/16 bytes + + function load_elwidthed(rd, rs, opwidth, unsigned): + destwid = int_csr[rd].elwidth # destination element width +  rd = int_csr[rd].active ? int_csr[rd].regidx : rd; +  rs = int_csr[rs].active ? int_csr[rs].regidx : rs; +  ps = get_pred_val(FALSE, rs); # predication on src +  pd = get_pred_val(FALSE, rd); # ... AND on dest +  for (int i = 0, int j = 0; i < VL && j < VL;): + if (int_csr[rs].isvec) while (!(ps & 1< + +See ancillary resource: [[vblock_format]] + +# Subsets of RV functionality + +It is permitted to only implement SVprefix and not the VBLOCK instruction +format option, and vice-versa. UNIX Platforms **MUST** raise illegal +instruction on seeing an unsupported VBLOCK or SVprefix opcode, so that +traps may emulate the format. + +It is permitted in SVprefix to either not implement VL or not implement +SUBVL (see [[sv_prefix_proposal]] for full details. Again, UNIX Platforms +*MUST* raise illegal instruction on implementations that do not support +VL or SUBVL. + +It is permitted to limit the size of either (or both) the register files +down to the original size of the standard RV architecture. However, below +the mandatory limits set in the RV standard will result in non-compliance +with the SV Specification. + diff --git a/simple_v_extension/specification.mdwn b/simple_v_extension/specification.mdwn index 6b57f8c5a..3c7caf337 100644 --- a/simple_v_extension/specification.mdwn +++ b/simple_v_extension/specification.mdwn @@ -3,7 +3,10 @@ * Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton * Status: DRAFTv0.6 * Last edited: 21 jun 2019 -* Ancillary resource: [[opcodes]] [[sv_prefix_proposal]] +* Ancillary resource: [[opcodes]] +* Ancillary resource: [[sv_prefix_proposal]] +* Ancillary resource: [[abridged_spec]] +* Ancillary resource: [[vblock_format]] With thanks to: @@ -2380,186 +2383,7 @@ No specific hints are yet defined in Simple-V # Vector Block Format -One issue with a former revision of SV was the setup and teardown -time of the CSRs. The cost of the use of a full CSRRW (requiring LI) -to set up registers and predicates was quite high. A VLIW-like format -therefore makes sense (named VBLOCK), and is conceptually reminiscent of -the ARM Thumb2 "IT" instruction. - -The format is: - -* the standard RISC-V 80 to 192 bit encoding sequence, with bits - defining the options to follow within the block -* An optional VL Block (16-bit) -* Optional predicate entries (8/16-bit blocks: see Predicate Table, above) -* Optional register entries (8/16-bit blocks: see Register Table, above) -* finally some 16/32/48 bit standard RV or SVPrefix opcodes follow. - -Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used -as follows: - -| base+4 ... base+2 | base | number of bits | -| ------ ----------------- | ---------------- | -------------------------- | -| ..xxxx xxxxxxxxxxxxxxxx | xnnnxxxxx1111111 | (80+16\*nnn)-bit, nnn!=111 | -| {ops}{Pred}{Reg}{VL Block} | SV Prefix | | - -A suitable prefix, which fits the Expanded Instruction-Length encoding -for "(80 + 16 times instruction-length)", as defined in Section 1.5 -of the RISC-V ISA, is as follows: - -| 15 | 14:12 | 11:10 | 9:8 | 7 | 6:0 | -| - | ----- | ----- | ----- | --- | ------- | -| vlset | 16xil | pplen | rplen | mode | 1111111 | - -The VL/MAXVL/SubVL Block format: - -| 31-30 | 29:28 | 27:22 | 21:17 - 16 | -| - | ----- | ------ | ------ - - | -| 0 | SubVL | VLdest | VLEN vlt | -| 1 | SubVL | VLdest | VLEN | - -Note: this format is very similar to that used in [[sv_prefix_proposal]] - -If vlt is 0, VLEN is a 5 bit immediate value, offset by one (i.e -a bit sequence of 0b00000 represents VL=1 and so on). If vlt is 1, -it specifies the scalar register from which VL is set by this VBLOCK -instruction group. VL, whether set from the register or the immediate, -is then modified (truncated) to be MIN(VL, MAXVL), and the result stored -in the scalar register specified in VLdest. If VLdest is zero, no store -in the regfile occurs (however VL is still set). - -This option will typically be used to start vectorised loops, where -the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL" -sequence (in compact form). - -When bit 15 is set to 1, MAXVL and VL are both set to the immediate, -VLEN (again, offset by one), which is 6 bits in length, and the same -value stored in scalar register VLdest (if that register is nonzero). -A value of 0b000000 will set MAXVL=VL=1, a value of 0b000001 will -set MAXVL=VL= 2 and so on. - -This option will typically not be used so much for loops as it will be -for one-off instructions such as saving the entire register file to the -stack with a single one-off Vectorised and predicated LD/ST, or as a way -to save or restore registers in a function call with a single instruction. - -CSRs needed: - -* mepcvliw -* sepcvliw -* uepcvliw -* hepcvliw - -Notes: - -* Bit 7 specifies if the prefix block format is the full 16 bit format - (1) or the compact less expressive format (0). In the 8 bit format, - pplen is multiplied by 2. -* 8 bit format predicate numbering is implicit and begins from x9. Thus - it is critical to put blocks in the correct order as required. -* Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit - (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number - of entries are needed the last may be set to 0x00, indicating "unused". -* Bit 15 specifies if the VL Block is present. If set to 1, the VL Block - immediately follows the VBLOCK instruction Prefix -* Bits 8 and 9 define how many RegCam entries (0 to 3 if bit 15 is 1, - otherwise 0 to 6) follow the (optional) VL Block. -* Bits 10 and 11 define how many PredCam entries (0 to 3 if bit 7 is 1, - otherwise 0 to 6) follow the (optional) RegCam entries -* Bits 14 to 12 (IL) define the actual length of the instruction: total - number of bits is 80 + 16 times IL. Standard RV32, RVC and also - SVPrefix (P48/64-\*-Type) instructions fit into this space, after the - (optional) VL / RegCam / PredCam entries -* In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed - format *MUST* have the RegCam and PredCam entries applied to the - operation (and the Vectorisation loop activated) -* P48 and P64 opcodes do **not** take their Register or predication - context from the VBLOCK tables: they do however have VL or SUBVL - applied (unless VLtyp or svlen are set). -* At the end of the VBLOCK Group, the RegCam and PredCam entries - *no longer apply*. VL, MAXVL and SUBVL on the other hand remain at - the values set by the last instruction (whether a CSRRW or the VL - Block header). -* Although an inefficient use of resources, it is fine to set the MAXVL, - VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK. - -All this would greatly reduce the amount of space utilised by Vectorised -instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes: -the CSR itself, a LI, and the setting up of the value into the RS -register of the CSR, which, again, requires a LI / LUI to get the 32 -bit data into the CSR. To get 64-bit data into the register in order -to put it into the CSR(s), LOAD operations from memory are needed! - -Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM -entries), that's potentially 6 to eight 32-bit instructions, just to -establish the Vector State! - -Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more -bits if VL needs to be set to greater than 32). Bear in mind that in SV, -both MAXVL and VL need to be set. - -By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is -only 16 bits, and as long as not too many predicates and register vector -qualifiers are specified, several 32-bit and 16-bit opcodes can fit into -the format. If the full flexibility of the 16 bit block formats are not -needed, more space is saved by using the 8 bit formats. - -In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries -into a VBLOCK format makes a lot of sense. - -Bear in mind the warning in an earlier section that use of VLtyp or svlen -in a P48 or P64 opcode within a VBLOCK Group will result in corruption -(use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To -avoid this situation, the STATE CSR may be copied into a temp register -and restored afterwards. - -Open Questions: - -* Is it necessary to stick to the RISC-V 1.5 format? Why not go with - using the 15th bit to allow 80 + 16\*0bnnnn bits? Perhaps to be sane, - limit to 256 bits (16 times 0-11). -* Could a "hint" be used to set which operations are parallel and which - are sequential? -* Could a new sub-instruction opcode format be used, one that does not - conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes? - no need for byte or bit-alignment -* Could a hardware compression algorithm be deployed? Quite likely, - because of the sub-execution context (sub-VBLOCK PC) - -## Limitations on instructions. - -To greatly simplify implementations, it is required to treat the VBLOCK -group as a separate sub-program with its own separate PC. The sub-pc -advances separately whilst the main PC remains pointing at the beginning -of the VBLOCK instruction (not to be confused with how VL works, which -is exactly the same principle, except it is VStart in the STATE CSR -that increments). - -This has implications, namely that a new set of CSRs identical to xepc -(mepc, srpc, hepc and uepc) must be created and managed and respected -as being a sub extension of the xepc set of CSRs. Thus, xepcvliw CSRs -must be context switched and saved / restored in traps. - -The srcoffs and destoffs indices in the STATE CSR may be similarly -regarded as another sub-execution context, giving in effect two sets of -nested sub-levels of the RISCV Program Counter (actually, three including -SUBVL and ssvoffs). - -In addition, as xepcvliw CSRs are relative to the beginning of the VBLOCK, -branches MUST be restricted to within (relative to) the block, -i.e. addressing is now restricted to the start (and very short) length -of the block. - -Also: calling subroutines is inadviseable, unless they can be entirely -accomplished within a block. - -A normal jump, normal branch and a normal function call may only be taken -by letting the VBLOCK group end, returning to "normal" standard RV mode, -and then using standard RVC, 32 bit or P48/64-\*-type opcodes. - -## Links - -* +See ancillary resource: [[vblock_format]] # Subsets of RV functionality diff --git a/simple_v_extension/vblock_format.mdwn b/simple_v_extension/vblock_format.mdwn new file mode 100644 index 000000000..5eda93016 --- /dev/null +++ b/simple_v_extension/vblock_format.mdwn @@ -0,0 +1,188 @@ +# Simple-V (Parallelism Extension Proposal) Vector Block Format + +* Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton +* Status: DRAFTv0.6 +* Last edited: 21 jun 2019 + +[[!toc ]] + +# Vector Block Format + +This is a way to give Vector and Predication Context to a group of +standard scalar RISC-V instructions, in a highly compact form. + +The format is: + +* the standard RISC-V 80 to 192 bit encoding sequence, with bits + defining the options to follow within the block +* An optional VL Block (16-bit) +* Optional predicate entries (8/16-bit blocks: see Predicate Table, above) +* Optional register entries (8/16-bit blocks: see Register Table, above) +* finally some 16/32/48 bit standard RV or SVPrefix opcodes follow. + +Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used +as follows: + +| base+4 ... base+2 | base | number of bits | +| ------ ----------------- | ---------------- | -------------------------- | +| ..xxxx xxxxxxxxxxxxxxxx | xnnnxxxxx1111111 | (80+16\*nnn)-bit, nnn!=111 | +| {ops}{Pred}{Reg}{VL Block} | SV Prefix | | + +A suitable prefix, which fits the Expanded Instruction-Length encoding +for "(80 + 16 times instruction-length)", as defined in Section 1.5 +of the RISC-V ISA, is as follows: + +| 15 | 14:12 | 11:10 | 9:8 | 7 | 6:0 | +| - | ----- | ----- | ----- | --- | ------- | +| vlset | 16xil | pplen | rplen | mode | 1111111 | + +The VL/MAXVL/SubVL Block format: + +| 31-30 | 29:28 | 27:22 | 21:17 - 16 | +| - | ----- | ------ | ------ - - | +| 0 | SubVL | VLdest | VLEN vlt | +| 1 | SubVL | VLdest | VLEN | + +Note: this format is very similar to that used in [[sv_prefix_proposal]] + +If vlt is 0, VLEN is a 5 bit immediate value, offset by one (i.e +a bit sequence of 0b00000 represents VL=1 and so on). If vlt is 1, +it specifies the scalar register from which VL is set by this VBLOCK +instruction group. VL, whether set from the register or the immediate, +is then modified (truncated) to be MIN(VL, MAXVL), and the result stored +in the scalar register specified in VLdest. If VLdest is zero, no store +in the regfile occurs (however VL is still set). + +This option will typically be used to start vectorised loops, where +the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL" +sequence (in compact form). + +When bit 15 is set to 1, MAXVL and VL are both set to the immediate, +VLEN (again, offset by one), which is 6 bits in length, and the same +value stored in scalar register VLdest (if that register is nonzero). +A value of 0b000000 will set MAXVL=VL=1, a value of 0b000001 will +set MAXVL=VL= 2 and so on. + +This option will typically not be used so much for loops as it will be +for one-off instructions such as saving the entire register file to the +stack with a single one-off Vectorised and predicated LD/ST, or as a way +to save or restore registers in a function call with a single instruction. + +CSRs needed: + +* mepcvliw +* sepcvliw +* uepcvliw +* hepcvliw + +Notes: + +* Bit 7 specifies if the prefix block format is the full 16 bit format + (1) or the compact less expressive format (0). In the 8 bit format, + pplen is multiplied by 2. +* 8 bit format predicate numbering is implicit and begins from x9. Thus + it is critical to put blocks in the correct order as required. +* Bit 7 also specifies if the register block format is 16 bit (1) or 8 bit + (0). In the 8 bit format, rplen is multiplied by 2. If only an odd number + of entries are needed the last may be set to 0x00, indicating "unused". +* Bit 15 specifies if the VL Block is present. If set to 1, the VL Block + immediately follows the VBLOCK instruction Prefix +* Bits 8 and 9 define how many RegCam entries (0 to 3 if bit 15 is 1, + otherwise 0 to 6) follow the (optional) VL Block. +* Bits 10 and 11 define how many PredCam entries (0 to 3 if bit 7 is 1, + otherwise 0 to 6) follow the (optional) RegCam entries +* Bits 14 to 12 (IL) define the actual length of the instruction: total + number of bits is 80 + 16 times IL. Standard RV32, RVC and also + SVPrefix (P48/64-\*-Type) instructions fit into this space, after the + (optional) VL / RegCam / PredCam entries +* In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed + format *MUST* have the RegCam and PredCam entries applied to the + operation (and the Vectorisation loop activated) +* P48 and P64 opcodes do **not** take their Register or predication + context from the VBLOCK tables: they do however have VL or SUBVL + applied (unless VLtyp or svlen are set). +* At the end of the VBLOCK Group, the RegCam and PredCam entries + *no longer apply*. VL, MAXVL and SUBVL on the other hand remain at + the values set by the last instruction (whether a CSRRW or the VL + Block header). +* Although an inefficient use of resources, it is fine to set the MAXVL, + VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK. + +All this would greatly reduce the amount of space utilised by Vectorised +instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes: +the CSR itself, a LI, and the setting up of the value into the RS +register of the CSR, which, again, requires a LI / LUI to get the 32 +bit data into the CSR. To get 64-bit data into the register in order +to put it into the CSR(s), LOAD operations from memory are needed! + +Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM +entries), that's potentially 6 to eight 32-bit instructions, just to +establish the Vector State! + +Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more +bits if VL needs to be set to greater than 32). Bear in mind that in SV, +both MAXVL and VL need to be set. + +By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is +only 16 bits, and as long as not too many predicates and register vector +qualifiers are specified, several 32-bit and 16-bit opcodes can fit into +the format. If the full flexibility of the 16 bit block formats are not +needed, more space is saved by using the 8 bit formats. + +In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries +into a VBLOCK format makes a lot of sense. + +Bear in mind the warning in an earlier section that use of VLtyp or svlen +in a P48 or P64 opcode within a VBLOCK Group will result in corruption +(use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To +avoid this situation, the STATE CSR may be copied into a temp register +and restored afterwards. + +Open Questions: + +* Is it necessary to stick to the RISC-V 1.5 format? Why not go with + using the 15th bit to allow 80 + 16\*0bnnnn bits? Perhaps to be sane, + limit to 256 bits (16 times 0-11). +* Could a "hint" be used to set which operations are parallel and which + are sequential? +* Could a new sub-instruction opcode format be used, one that does not + conform precisely to RISC-V rules, but *unpacks* to RISC-V opcodes? + no need for byte or bit-alignment +* Could a hardware compression algorithm be deployed? Quite likely, + because of the sub-execution context (sub-VBLOCK PC) + +## Limitations on instructions. + +To greatly simplify implementations, it is required to treat the VBLOCK +group as a separate sub-program with its own separate PC. The sub-pc +advances separately whilst the main PC remains pointing at the beginning +of the VBLOCK instruction (not to be confused with how VL works, which +is exactly the same principle, except it is VStart in the STATE CSR +that increments). + +This has implications, namely that a new set of CSRs identical to xepc +(mepc, srpc, hepc and uepc) must be created and managed and respected +as being a sub extension of the xepc set of CSRs. Thus, xepcvliw CSRs +must be context switched and saved / restored in traps. + +The srcoffs and destoffs indices in the STATE CSR may be similarly +regarded as another sub-execution context, giving in effect two sets of +nested sub-levels of the RISCV Program Counter (actually, three including +SUBVL and ssvoffs). + +In addition, as xepcvliw CSRs are relative to the beginning of the VBLOCK, +branches MUST be restricted to within (relative to) the block, +i.e. addressing is now restricted to the start (and very short) length +of the block. + +Also: calling subroutines is inadviseable, unless they can be entirely +accomplished within a block. + +A normal jump, normal branch and a normal function call may only be taken +by letting the VBLOCK group end, returning to "normal" standard RV mode, +and then using standard RVC, 32 bit or P48/64-\*-type opcodes. + +## Links + +* +