From 999df9c317daa04d7af6e13abc40a62aa1fe2aff Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Sun, 30 Sep 2018 11:10:04 +0100 Subject: [PATCH] split simple-v specification into separate page --- simple_v_extension.mdwn | 656 +--------------- simple_v_extension/simple_v_chennai_2018.tex | 2 +- simple_v_extension/specification.mdwn | 773 +++++++++++++++++++ 3 files changed, 776 insertions(+), 655 deletions(-) create mode 100644 simple_v_extension/specification.mdwn diff --git a/simple_v_extension.mdwn b/simple_v_extension.mdwn index ea10abce9..3a16606a1 100644 --- a/simple_v_extension.mdwn +++ b/simple_v_extension.mdwn @@ -14,7 +14,8 @@ the uniformity of a consistent API. **No arithmetic operations are added or required to be added.** SV is purely a parallelism API and consequentially is suitable for use even with RV32E. -Talk slides: +* Talk slides: +* Specification: now move to its own page: [[specification]] [[!toc ]] @@ -359,659 +360,6 @@ absolute bare minimum level of compliance with the "API" (software-traps when vectorisation is detected), all the way up to full supercomputing level all-hardware parallelism. Options are covered in the Appendix. -# CSRs - -There are two CSR tables needed to create lookup tables which are used at -the register decode phase. - -* Integer Register N is Vector -* Integer Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Floating-point Register N is Vector of length M: r(N) -> r(N..N+M-1) -* Floating-point Register N is of implicit bitwidth M (M=default,8,16,32,64) -* Integer Register N is a Predication Register (note: a key-value store) - -Also (see Appendix, "Context Switch Example") it may turn out to be important -to have a separate (smaller) set of CSRs for M-Mode (and S-Mode) so that -Vectorised LOAD / STORE may be used to load and store multiple registers: -something that is missing from the Base RV ISA. - -Notes: - -* for the purposes of LOAD / STORE, Integer Registers which are - marked as a Vector will result in a Vector LOAD / STORE. -* Vector Lengths are *not* the same as vsetl but are an integral part - of vsetl. -* Actual vector length is *multipled* by how many blocks of length - "bitwidth" may fit into an XLEN-sized register file. -* Predication is a key-value store due to the implicit referencing, - as opposed to having the predicate register explicitly in the instruction. -* Whilst the predication CSR is a key-value store it *generates* easier-to-use - state information. -* TODO: assess whether the same technique could be applied to the other - Vector CSRs, particularly as pointed out in Section 17.8 (Draft RV 0.4, - V2.3-Draft ISA Reference) it becomes possible to greatly reduce state - needed for context-switches (empty slots need never be stored). - -## Predication CSR - -The Predication CSR is a key-value store indicating whether, if a given -destination register (integer or floating-point) is referred to in an -instruction, it is to be predicated. However it is important to note -that the *actual* register is *different* from the one that ends up -being used, due to the level of indirection through the lookup table. -This includes (in the future) redirecting to a *second* bank of -integer registers (as a future option) - -* regidx is the actual register that in combination with the - i/f flag, if that integer or floating-point register is referred to, - results in the lookup table being referenced to find the predication - mask to use on the operation in which that (regidx) register has - been used -* predidx (in combination with the bank bit in the future) is the - *actual* register to be used for the predication mask. Note: - in effect predidx is actually a 6-bit register address, as the bank - bit is the MSB (and is nominally set to zero for now). -* inv indicates that the predication mask bits are to be inverted - prior to use *without* actually modifying the contents of the - register itself. -* zeroing is either 1 or 0, and if set to 1, the operation must - place zeros in any element position where the predication mask is - set to zero. If zeroing is set to 0, unpredicated elements *must* - be left alone. Some microarchitectures may choose to interpret - this as skipping the operation entirely. Others which wish to - stick more closely to a SIMD architecture may choose instead to - interpret unpredicated elements as an internal "copy element" - operation (which would be necessary in SIMD microarchitectures - that perform register-renaming) - -| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) | -| ----- | - | - | - | - | ------- | ------- | -| 0 | bank0 | zero0 | inv0 | i/f | regidx | predidx | -| 1 | bank1 | zero1 | inv1 | i/f | regidx | predidx | -| .. | bank.. | zero.. | inv.. | i/f | regidx | predidx | -| 15 | bank15 | zero15 | inv15 | i/f | regidx | predidx | - -The Predication CSR Table is a key-value store, so implementation-wise -it will be faster to turn the table around (maintain topologically -equivalent state): - - struct pred { - bool zero; - bool inv; - bool bank; // 0 for now, 1=rsvd - bool enabled; - int predidx; // redirection: actual int register to use - } - - struct pred fp_pred_reg[32]; // 64 in future (bank=1) - struct pred int_pred_reg[32]; // 64 in future (bank=1) - - for (i = 0; i < 16; i++) - tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg; - idx = CSRpred[i].regidx - tb[idx].zero = CSRpred[i].zero - tb[idx].inv = CSRpred[i].inv - tb[idx].bank = CSRpred[i].bank - tb[idx].predidx = CSRpred[i].predidx - tb[idx].enabled = true - -So when an operation is to be predicated, it is the internal state that -is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following -pseudo-code for operations is given, where p is the explicit (direct) -reference to the predication register to be used: - - for (int i=0; i - -For full analysis of topological adaptation of RVV LOAD/STORE -see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X) -may be implicitly overloaded into the one base RV LOAD instruction, -and likewise for STORE. - -Revised LOAD: - -[[!table data=""" -31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 | -imm[11:0] |||| rs1 | funct3 | rd | opcode | -1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 | -? | s | rs2 | imm[4:0] | base | width | dest | LOAD | -"""]] - -The exact same corresponding adaptation is also carried out on the single, -double and quad precision floating-point LOAD-FP and STORE-FP operations, -which fit the exact same instruction format. Thus all three types -(unit, stride and indexed) may be fitted into FLW, FLD and FLQ, -as well as FSW, FSD and FSQ. - -Notes: - -* LOAD remains functionally (topologically) identical to RVV LOAD - (for both integer and floating-point variants). -* Predication CSR-marking register is not explicitly shown in instruction, it's - implicit based on the CSR predicate state for the rd (destination) register -* rs2, the source, may *also be marked as a vector*, which implicitly - is taken to indicate "Indexed Load" (LD.X) -* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S) -* Bit 31 is reserved (ideas under consideration: auto-increment) -* **TODO**: include CSR SIMD bitwidth in the pseudo-code below. -* **TODO**: clarify where width maps to elsize - -Pseudo-code (excludes CSR SIMD bitwidth for simplicity): - - if (unit-strided) stride = elsize; - else stride = areg[as2]; // constant-strided - - preg = int_pred_reg[rd] - - for (int i=0; i - -There is no MV instruction in RV however there is a C.MV instruction. -It is used for copying integer-to-integer registers (vectorised FMV -is used for copying floating-point). - -If either the source or the destination register are marked as vectors -C.MV is reinterpreted to be a vectorised (multi-register) predicated -move operation. The actual instruction's format does not change: - -[[!table data=""" -15 12 | 11 7 | 6 2 | 1 0 | -funct4 | rd | rs | op | -4 | 5 | 5 | 2 | -C.MV | dest | src | C0 | -"""]] - -A simplified version of the pseudocode for this operation is as follows: - - function op_mv(rd, rs) # MV not VMV! -  rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd; -  rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs; -  ps = get_pred_val(FALSE, rs); # predication on src -  pd = get_pred_val(FALSE, rd); # ... AND on dest -  for (int i = 0, int j = 0; i < VL && j < VL;): - if (int_vec[rs].isvec) while (!(ps & 1< + +[[!toc ]] + +# Introduction + +# CSRs + +There are two CSR tables needed to create lookup tables which are used at +the register decode phase. + +## MAXVECTORLENGTH + +MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V, +given that its primary (base, unextended) purpose is for 3D, Video and +other purposes (not requiring supercomputing capability), it makes sense +to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64 +and so on). + +The reason for setting this limit is so that predication registers, when +marked as such, may fit into a single register as opposed to fanning out +over several registers. This keeps the implementation a little simpler. +Note also (as also described in the VSETVL section) that the *minimum* +for MAXVECTORDEPTH must be the total number of registers (15 for RV32E +and 31 for RV32 or RV64). + +Note that RVV on top of Simple-V may choose to over-ride this decision. + +## MAXVECTORLENGTH + +MAXVECTORLENGTH is the same concept as MVL in RVV. However in Simple-V, +given that its primary (base, unextended) purpose is for 3D, Video and +other purposes (not requiring supercomputing capability), it makes sense +to limit MAXVECTORDEPTH to the regfile bitwidth (32 for RV32, 64 for RV64 +and so on). + +The reason for setting this limit is so that predication registers, when +marked as such, may fit into a single register as opposed to fanning out +over several registers. This keeps the implementation a little simpler. +Note also (as also described in the VSETVL section) that the *minimum* +for MAXVECTORDEPTH must be the total number of registers (15 for RV32E +and 31 for RV32 or RV64). + +Note that RVV on top of Simple-V may choose to over-ride this decision. + +## Predication CSR + +The Predication CSR is a key-value store indicating whether, if a given +destination register (integer or floating-point) is referred to in an +instruction, it is to be predicated. However it is important to note +that the *actual* register is *different* from the one that ends up +being used, due to the level of indirection through the lookup table. + +* regidx is the actual register that in combination with the + i/f flag, if that integer or floating-point register is referred to, + results in the lookup table being referenced to find the predication + mask to use on the operation in which that (regidx) register has + been used +* predidx (in combination with the bank bit in the future) is the + *actual* register to be used for the predication mask. Note: + in effect predidx is actually a 6-bit register address, as the bank + bit is the MSB (and is nominally set to zero for now). +* inv indicates that the predication mask bits are to be inverted + prior to use *without* actually modifying the contents of the + register itself. +* zeroing is either 1 or 0, and if set to 1, the operation must + place zeros in any element position where the predication mask is + set to zero. If zeroing is set to 0, unpredicated elements *must* + be left alone. Some microarchitectures may choose to interpret + this as skipping the operation entirely. Others which wish to + stick more closely to a SIMD architecture may choose instead to + interpret unpredicated elements as an internal "copy element" + operation (which would be necessary in SIMD microarchitectures + that perform register-renaming) + +| PrCSR | 13 | 12 | 11 | 10 | (9..5) | (4..0) | +| ----- | - | - | - | - | ------- | ------- | +| 0 | bank0 | zero0 | inv0 | i/f | regidx | predkey | +| 1 | bank1 | zero1 | inv1 | i/f | regidx | predkey | +| .. | bank.. | zero.. | inv.. | i/f | regidx | predkey | +| 15 | bank15 | zero15 | inv15 | i/f | regidx | predkey | + +The Predication CSR Table is a key-value store, so implementation-wise +it will be faster to turn the table around (maintain topologically +equivalent state): + + struct pred { + bool zero; + bool inv; + bool bank; // 0 for now, 1=rsvd + bool enabled; + int predidx; // redirection: actual int register to use + } + + struct pred fp_pred_reg[32]; // 64 in future (bank=1) + struct pred int_pred_reg[32]; // 64 in future (bank=1) + + for (i = 0; i < 16; i++) + tb = int_pred_reg if CSRpred[i].type == 0 else fp_pred_reg; + idx = CSRpred[i].regidx + tb[idx].zero = CSRpred[i].zero + tb[idx].inv = CSRpred[i].inv + tb[idx].bank = CSRpred[i].bank + tb[idx].predidx = CSRpred[i].predidx + tb[idx].enabled = true + +So when an operation is to be predicated, it is the internal state that +is used. In Section 6.4.2 of Hwacha's Manual (EECS-2015-262) the following +pseudo-code for operations is given, where p is the explicit (direct) +reference to the predication register to be used: + + for (int i=0; i + +For full analysis of topological adaptation of RVV LOAD/STORE +see [[v_comparative_analysis]]. All three types (LD, LD.S and LD.X) +may be implicitly overloaded into the one base RV LOAD instruction, +and likewise for STORE. + +Revised LOAD: + +[[!table data=""" +31 | 30 | 29 25 | 24 20 | 19 15 | 14 12 | 11 7 | 6 0 | +imm[11:0] |||| rs1 | funct3 | rd | opcode | +1 | 1 | 5 | 5 | 5 | 3 | 5 | 7 | +? | s | rs2 | imm[4:0] | base | width | dest | LOAD | +"""]] + +The exact same corresponding adaptation is also carried out on the single, +double and quad precision floating-point LOAD-FP and STORE-FP operations, +which fit the exact same instruction format. Thus all three types +(unit, stride and indexed) may be fitted into FLW, FLD and FLQ, +as well as FSW, FSD and FSQ. + +Notes: + +* LOAD remains functionally (topologically) identical to RVV LOAD + (for both integer and floating-point variants). +* Predication CSR-marking register is not explicitly shown in instruction, it's + implicit based on the CSR predicate state for the rd (destination) register +* rs2, the source, may *also be marked as a vector*, which implicitly + is taken to indicate "Indexed Load" (LD.X) +* Bit 30 indicates "element stride" or "constant-stride" (LD or LD.S) +* Bit 31 is reserved (ideas under consideration: auto-increment) +* **TODO**: include CSR SIMD bitwidth in the pseudo-code below. +* **TODO**: clarify where width maps to elsize + +Pseudo-code (excludes CSR SIMD bitwidth for simplicity): + + if (unit-strided) stride = elsize; + else stride = areg[as2]; // constant-strided + + preg = int_pred_reg[rd] + + for (int i=0; i + +There is no MV instruction in RV however there is a C.MV instruction. +It is used for copying integer-to-integer registers (vectorised FMV +is used for copying floating-point). + +If either the source or the destination register are marked as vectors +C.MV is reinterpreted to be a vectorised (multi-register) predicated +move operation. The actual instruction's format does not change: + +[[!table data=""" +15 12 | 11 7 | 6 2 | 1 0 | +funct4 | rd | rs | op | +4 | 5 | 5 | 2 | +C.MV | dest | src | C0 | +"""]] + +A simplified version of the pseudocode for this operation is as follows: + + function op_mv(rd, rs) # MV not VMV! +  rd = int_vec[rd].isvector ? int_vec[rd].regidx : rd; +  rs = int_vec[rs].isvector ? int_vec[rs].regidx : rs; +  ps = get_pred_val(FALSE, rs); # predication on src +  pd = get_pred_val(FALSE, rd); # ... AND on dest +  for (int i = 0, int j = 0; i < VL && j < VL;): + if (int_vec[rs].isvec) while (!(ps & 1< What does an ADD of two different-sized vectors do in simple-V? + +* if the two source operands are not the same, throw an exception. +* if the destination operand is also a vector, and the source is longer + than the destination, throw an exception. + +> And what about instructions like JALR?  +> What does jumping to a vector do? + +* Throw an exception. Whether that actually results in spawning threads + as part of the trap-handling remains to be seen. + +# Under consideration + +## Retro-fitting Predication into branch-explicit ISA + +One of the goals of this parallelism proposal is to avoid instruction +duplication. However, with the base ISA having been designed explictly +to *avoid* condition-codes entirely, shoe-horning predication into it +bcomes quite challenging. + +However what if all branch instructions, if referencing a vectorised +register, were instead given *completely new analogous meanings* that +resulted in a parallel bit-wise predication register being set? This +would have to be done for both C.BEQZ and C.BNEZ, as well as BEQ, BNE, +BLT and BGE. + +We might imagine that FEQ, FLT and FLT would also need to be converted, +however these are effectively *already* in the precise form needed and +do not need to be converted *at all*! The difference is that FEQ, FLT +and FLE *specifically* write a 1 to an integer register if the condition +holds, and 0 if not. All that needs to be done here is to say, "if +the integer register is tagged with a bit that says it is a predication +register, the **bit** in the integer register is set based on the +current vector index" instead. + +There is, in the standard Conditional Branch instruction, more than +adequate space to interpret it in a similar fashion: + +[[!table data=""" +31 |30 ..... 25 |24..20|19..15| 14...12| 11.....8 | 7 | 6....0 | +imm[12] | imm[10:5] |rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | + 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | + offset[12,10:5] || src2 | src1 | BEQ | offset[11,4:1] || BRANCH | +"""]] + +This would become: + +[[!table data=""" +31 | 30 .. 25 |24 ... 20 | 19 15 | 14 12 | 11 .. 8 | 7 | 6 ... 0 | +imm[12] | imm[10:5]| rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | +1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | +reserved || src2 | src1 | BEQ | predicate rs3 || BRANCH | +"""]] + +Similarly the C.BEQZ and C.BNEZ instruction format may be retro-fitted, +with the interesting side-effect that there is space within what is presently +the "immediate offset" field to reinterpret that to add in not only a bit +field to distinguish between floating-point compare and integer compare, +not only to add in a second source register, but also use some of the bits as +a predication target as well. + +[[!table data=""" +15..13 | 12 ....... 10 | 9...7 | 6 ......... 2 | 1 .. 0 | +funct3 | imm | rs10 | imm | op | +3 | 3 | 3 | 5 | 2 | +C.BEQZ | offset[8,4:3] | src | offset[7:6,2:1,5] | C1 | +"""]] + +Now uses the CS format: + +[[!table data=""" +15..13 | 12 . 10 | 9 .. 7 | 6 .. 5 | 4..2 | 1 .. 0 | +funct3 | imm | rs10 | imm | | op | +3 | 3 | 3 | 2 | 3 | 2 | +C.BEQZ | pred rs3 | src1 | I/F B | src2 | C1 | +"""]] + +Bit 6 would be decoded as "operation refers to Integer or Float" including +interpreting src1 and src2 accordingly as outlined in Table 12.2 of the +"C" Standard, version 2.0, +whilst Bit 5 would allow the operation to be extended, in combination with +funct3 = 110 or 111: a combination of four distinct (predicated) comparison +operators. In both floating-point and integer cases those could be +EQ/NEQ/LT/LE (with GT and GE being synthesised by inverting src1 and src2). + + -- 2.30.2