X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=simple_v_extension.mdwn;h=0642ce926d702469769bae2210773b1035398a95;hb=fa430b3a7c6a80d54aa5d040bd3dd4add9fcf5d0;hp=723cc07259960db0b00ed61b10873a310b9e3ab8;hpb=77fdb0e132155a3060cad2104ea528b1ab9233c2;p=libreriscv.git diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index 723cc0725..0642ce926 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -1,5 +1,7 @@ # Variable-width Variable-packed SIMD / Simple-V / Parallelism Extension Proposal +**Note: this document is out of date and involved early ideas and discussions** + Key insight: Simple-V is intended as an abstraction layer to provide a consistent "API" to parallelisation of existing *and future* operations. *Actual* internal hardware-level parallelism is *not* required, such @@ -12,6 +14,11 @@ of Out-of-order restructuring (including parallel ALU lanes) or VLIW implementations, or SIMD, or anything else, would then benefit from the uniformity of a consistent API. +**No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E. + +* Talk slides: +* Specification: now move to its own page: [[specification]] + [[!toc ]] # Introduction @@ -355,659 +362,6 @@ absolute bare minimum level of compliance with the "API" (software-traps when vectorisation is detected), all the way up to full supercomputing level all-hardware parallelism. Options are covered in the Appendix. -# CSRs - -There are two CSR tables needed to create lookup tables which are used at -the register decode phase. - -* Integer Register N is Vector -* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Integer Register N is a Predication Register (note: a key-value store) - -Also (see Appendix, "Context Switch Example") it may turn out to be important -to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that -Vectorised LOAD / STORE may be used to load and store multiple registers: -something that is missing from the Base RV ISA. - -Notes: - -* for the purposes of LOAD / STORE, Integer Registers which are - marked as a Vector will result in a Vector LOAD / STORE. -* Vector Lengths are *not* the same as vsetl but are an integral part - of vsetl. -* Actual vector length is *multipled* by how many blocks of length - "bitwidth" may fit into an XLEN-sized register file. -* Predication is a key-value store due to the implicit referencing, - as opposed to having the predicate register explicitly in the instruction. -* Whilst the predication CSR is a key-value store it *generates* easier-to-use - state information. -* TODO: assess whether the same technique could be applied to the other - Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4, - V2.3-Draft ISA Reference) it becomes possible to greatly reduce state - needed for context-switches (empty slots need never be stored). - -## Predication CSR - -The Predication CSR is a key-value store indicating whether, if a given -destination register (integer or floating-point) is referred to in an -instruction, it is to be predicated. However it is important to note -that the *actual* register is *different* from the one that ends up -being used, due to the level of indirection through the lookup table. -This includes (in the future) redirecting to a *second* bank of -integer registers (as a future option) - -* regidx is the actual register that in combination with the - i/f flag, if that integer or floating-point register is referred to, - results in the lookup table being referenced to find the predication - mask to use on the operation in which that (regidx) register has - been used -* predidx (in combination with the bank bit in the future) is the - *actual* register to be used for the predication mask. Note: - in effect predidx is actually a 6-bit register address, as the bank - bit is the MSB (and is nominally set to zero for now). -* inv indicates that the predication mask bits are to be inverted - prior to use *without* actually modifying the contents of the - register itself. -* zeroing is either 1 or 0, and if set to 1, the operation must - place zeros in any element position where the predication mask is - set to zero. If zeroing is set to 1, unpredicated elements *must* - be left alone. Some microarchitectures may choose to interpret - this as skipping the operation entirely. Others which wish to - stick more closely to a SIMD architecture may choose instead to - interpret unpredicated elements as an internal "copy element" - operation (which would be necessary in SIMD microarchitectures - that perform register-renaming) - -| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) | -| ----- | - | - | - | - | ------- | ------- | -| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx | -| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx | -| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx | -| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx | - -The Predication CSR Table is a key-value store, so implementation-wise -it will be faster to turn the table around (maintain topologically -equivalent state): - - struct pred { - bool zero; - bool inv; - bool bank; // 0 for now, 1=rsvd - bool enabled; - int predidx; // redirection: actual int register to use - } - - struct pred fp_pred_reg[32]; // 64 in future (bank=1) - struct pred int_pred_reg[32]; // 64 in future (bank=1) - - for (i = 0; i < 16; i++) - tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg; - idx = CSRpred[i].regidx - tb[idx].zero = CSRpred[i].zero - tb[idx].inv = CSRpred[i].inv - tb[idx].bank = CSRpred[i].bank - tb[idx].predidx = CSRpred[i].predidx - tb[idx].enabled = true - -So when an operation is to be predicated, it is the internal state that -is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following -pseudo-code for operations is given, where p is the explicit (direct) -reference to the predication register to be used: - - for (int i=0; i - -For full analysis of topological adaptation of RVV LOAD/STORE -see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X) -may be implicitly overloaded into the one base RV LOAD instruction, -and likewise for STORE. - -Revised LOAD: - -[[!table data=""" -31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 | -imm[11:0] |||| rs1 | funct3 | rd | opcode | -1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 | -? | s | rs2 | imm[4:0] | base | width | dest | LOAD | -"""]] - -The exact same corresponding adaptation is also carried out on the single, -double and quad precision floating-point LOAD-FP and STORE-FP operations, -which fit the exact same instruction format. Thus all three types -(unit, stride and indexed) may be fitted into FLW, FLD and FLQ, -as well as FSW, FSD and FSQ. - -Notes: - -* LOAD remains functionally (topologically) identical to RVV LOAD - (for both integer and floating-point variants). -* Predication CSR-marking register is not explicitly shown in instruction, it's - implicit based on the CSR predicate state for the rd (destination) register -* rs2, the source, may *also be marked as a vector*, which implicitly - is taken to indicate "Indexed Load" (LD.X) -* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S) -* Bit 31 is reserved (ideas under consideration: auto-increment) -* **TODO**: include CSR SIMD bitwidth in the pseudo-code below. -* **TODO**: clarify where width maps to elsize - -Pseudo-code (excludes CSR SIMD bitwidth for simplicity): - - if (unit-strided) stride = elsize; - else stride = areg[as2]; // constant-strided - - preg = int_pred_reg[rd] - - for (int i=0; i - -There is no MV instruction in RV however there is a C.MV instruction. -It is used for copying integer-to-integer registers (vectorised FMV -is used for copying floating-point). - -If either the source or the destination register are marked as vectors -C.MV is reinterpreted to be a vectorised (multi-register) predicated -move operation. The actual instruction's format does not change: - -[[!table data=""" -15 12 | 11 7 | 6 2 | 1 0 | -funct4 | rd | rs | op | -4 | 5 | 5 | 2 | -C.MV | dest | src | C0 | -"""]] - -A simplified version of the pseudocode for this operation is as follows: - - function op_mv(rd, rs) # MV not VMV! -  rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd; -  rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs; -  ps = get_pred_val(FALSE, rs); # predication on src -  pd = get_pred_val(FALSE, rd); # ... AND on dest -  for (int i = 0, int j = 0; i < VL && j < VL;): - if (int_vec[rs].isvec) while (!(ps & 1< + +> > > so, let's say instead of another LR *cancelling* the load +> > > reservation, the SMP core / hardware thread *blocks* for +> > > up to 63 further instructions, waiting for the reservation +> > > to clear. +> > +> > Can you explain what you mean by this paragraph? +> +> best put in sequential events, probably. +> +> LR <-- 64-instruction countdown starts here +> ... 63 +> ... 62 +> LR same address <--- notes that core1 is on 61, +> so pauses for **UP TO** 61 cycles +> ... 32 +> SC <- core1 didn't reach zero, therefore valid, therefore +> core2 is now **UNBLOCKED**, is granted the +> load-reservation (and begins its **own** 64-cycle +> LR instruction countdown) +> ... 63 +> ... 62 +> ... +> ... +> SC <- also valid + +Looks to me that you could effect the same functionality by simply +holding onto the cache line in core 1 preventing core 2 from + getting past the LR. + +On the other hand, the freeze is similar to how the MP CRAYs did +ATOMIC stuff. + # References * SIMD considered harmful